Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add layoutlm layoutxlm support #2980

Merged

Conversation

helpmefindaname
Copy link
Member

@helpmefindaname helpmefindaname commented Nov 7, 2022

This PR adds the following:

  • Metadata for Datapoints: there is add_metadata, get_metadata and has_metadata for each datapoint (Token, Sentence, Span)
  • OCR support for embeddings: You can use embeddings like layoutlm or layoutlmv3 using the following:
     sentence = Sentence(["I", "love", "Berlin"])
     sentence[0].add_metadata("bbox", BoundingBox(0, 0, 10, 10))
     sentence[1].add_metadata("bbox", (12, 0, 22, 10))
     sentence[2].add_metadata("bbox", (0, 12, 10, 22))
     emb = TransformerWordEmbeddings("microsoft/layoutlm-base-uncased", layers="-1,-2,-3,-4", layer_mean=True)
     emb.embed(sentence)
    
    when using layoutlmv3 or layoutlmv2, also the image needs to be added:
    with Image.open("tests/resources/tasks/example_images/i_love_berlin.png") as img:
         img.load()
         img = img.convert("RGB")
    sentence.add_metadata("image", img)
    
  • Support for the SROIE dataset, which can be loaded as the following:
    corpus = SROIE()
    corpus = SROIE(load_images=True) # for embeddings such as layoutlmv2 or layoutlmv3 which take information from the images
    

I tested the training, using 100 epochs, batch_size=16, lr=5e-5, train_with_dev=True for the following embeddings:

Embedding Micro-F1
Layoutlm-base 93.59%
Layoutlm-large 93.59%
Layoutlmv3-base 94.60%
Layoutlmv3-large 94.80%

This was referenced Nov 7, 2022
Copy link
Member

@whoisjones whoisjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I couldn't really check the I/O operation of images in the dataset. I have some minor things you might want to look at before merging.

left_context = left_context[-context_length:]
break
return left_context
left_context = sentence.tokens + left_context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left_context += sentence.tokens to be consistent with right context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that addition on lists is not cummutative, hence your suggestion would lead to wrong results.

"""
Instantiates a Dataset from a OCR-Json format.
The folder is structured with a "images" folder and a "tagged" folder.
Those folders contain respectively .jpg an .json files with matching file name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*and

:param path_to_split_directory: base folder with the task data
:param label_type: the label_type to add the ocr labels to
:param encoding: the encoding to load the .json files with
:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is normalizing to thousands usual? If it is just selected by random, why not make the normalization factor as int and optional and normalize images if factor is provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalizing to thousands is very usual, it is done by Layoutlm (& v2/v3), Docformer, Lambert, ...
I haven't seen an implementation so far that did it different

)

return tensor_args
# random check some tokens to save performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check entire dataset or discard inputs not having all required metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the comment, running the checks on a broader scale would impact the speed, this is rather to ensure that the user gets a good warning if there are no bounding boxes in general, without impacting the performance.

And Discarding inputs would silently hide errors, so I am against that.

if self.tokenizer_needs_ocr_boxes:
tokenizer_kwargs["boxes"] = [[t.get_metadata("bbox") for t in tokens] for tokens in flair_tokens]
else:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run formatter again should remove empty line

if "bbox" in batch_encoding:
model_kwargs["bbox"] = batch_encoding["bbox"].to(device, non_blocking=True)

if self.token_embedding or self.needs_manual_ocr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check 2x for self.token_embedding (line 547) and self.needs_manual_ocr (line 549) and jointly in line 516. The part required by both are the word_ids. Can we move everything after line 526 also in the condition for self.token_embeddings? looks like self.needs_manual_ocr only needs word_ids_list.

@whoisjones
Copy link
Member

👍

@alanakbik alanakbik merged commit dde4847 into flairNLP:master Nov 15, 2022
@alanakbik
Copy link
Collaborator

@helpmefindaname thanks for adding this!

@helpmefindaname helpmefindaname deleted the add_layoutlm_layoutxlm_support branch November 28, 2022 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants