Add layoutlm layoutxlm support #2980

helpmefindaname · 2022-11-07T12:13:49Z

This PR adds the following:

Metadata for Datapoints: there is add_metadata, get_metadata and has_metadata for each datapoint (Token, Sentence, Span)

OCR support for embeddings: You can use embeddings like layoutlm or layoutlmv3 using the following:

 sentence = Sentence(["I", "love", "Berlin"])
 sentence[0].add_metadata("bbox", BoundingBox(0, 0, 10, 10))
 sentence[1].add_metadata("bbox", (12, 0, 22, 10))
 sentence[2].add_metadata("bbox", (0, 12, 10, 22))
 emb = TransformerWordEmbeddings("microsoft/layoutlm-base-uncased", layers="-1,-2,-3,-4", layer_mean=True)
 emb.embed(sentence)

when using layoutlmv3 or layoutlmv2, also the image needs to be added:

with Image.open("tests/resources/tasks/example_images/i_love_berlin.png") as img:
     img.load()
     img = img.convert("RGB")
sentence.add_metadata("image", img)

Support for the SROIE dataset, which can be loaded as the following:

corpus = SROIE()
corpus = SROIE(load_images=True) # for embeddings such as layoutlmv2 or layoutlmv3 which take information from the images

I tested the training, using 100 epochs, batch_size=16, lr=5e-5, train_with_dev=True for the following embeddings:

Embedding	Micro-F1
Layoutlm-base	93.59%
Layoutlm-large	93.59%
Layoutlmv3-base	94.60%
Layoutlmv3-large	94.80%

whoisjones

Looks good, I couldn't really check the I/O operation of images in the dataset. I have some minor things you might want to look at before merging.

whoisjones · 2022-11-09T14:04:54Z

flair/data.py

-                left_context = left_context[-context_length:]
-                break
-        return left_context
+            left_context = sentence.tokens + left_context


left_context += sentence.tokens to be consistent with right context

Notice that addition on lists is not cummutative, hence your suggestion would lead to wrong results.

whoisjones · 2022-11-09T14:06:30Z

flair/datasets/ocr.py

+        """
+        Instantiates a Dataset from a OCR-Json format.
+        The folder is structured with a "images" folder and a "tagged" folder.
+        Those folders contain respectively .jpg an .json files with matching file name.


whoisjones · 2022-11-09T14:17:44Z

flair/datasets/ocr.py

+        :param path_to_split_directory: base folder with the task data
+        :param label_type: the label_type to add the ocr labels to
+        :param encoding: the encoding to load the .json files with
+        :param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000


is normalizing to thousands usual? If it is just selected by random, why not make the normalization factor as int and optional and normalize images if factor is provided.

Normalizing to thousands is very usual, it is done by Layoutlm (& v2/v3), Docformer, Lambert, ...
I haven't seen an implementation so far that did it different

whoisjones · 2022-11-09T14:35:13Z

flair/embeddings/transformer.py

-        )
-
-        return tensor_args
+        # random check some tokens to save performance.


Why not check entire dataset or discard inputs not having all required metadata?

As noted in the comment, running the checks on a broader scale would impact the speed, this is rather to ensure that the user gets a good warning if there are no bounding boxes in general, without impacting the performance.

And Discarding inputs would silently hide errors, so I am against that.

whoisjones · 2022-11-09T14:37:44Z

flair/embeddings/transformer.py

+        if self.tokenizer_needs_ocr_boxes:
+            tokenizer_kwargs["boxes"] = [[t.get_metadata("bbox") for t in tokens] for tokens in flair_tokens]
+        else:
+


run formatter again should remove empty line

whoisjones · 2022-11-09T14:50:17Z

flair/embeddings/transformer.py

+        if "bbox" in batch_encoding:
+            model_kwargs["bbox"] = batch_encoding["bbox"].to(device, non_blocking=True)
+
+        if self.token_embedding or self.needs_manual_ocr:


we check 2x for self.token_embedding (line 547) and self.needs_manual_ocr (line 549) and jointly in line 516. The part required by both are the word_ids. Can we move everything after line 526 also in the condition for self.token_embeddings? looks like self.needs_manual_ocr only needs word_ids_list.

whoisjones · 2022-11-14T09:03:20Z

👍

alanakbik · 2022-11-15T15:12:29Z

@helpmefindaname thanks for adding this!

Benedikt Fuchs added 7 commits October 31, 2022 21:02

add ocr box support

29c2bf7

add sroie dataset

b1df696

add normalization of bbox

7fcf1cb

pad hidden states

dae311b

add docs for SROIE dataset

fa6eb84

fix typing

8c2bf29

fix bug

9d73f49

This was referenced Nov 7, 2022

LayoutLM and LayoutLMv2 + Flair? #2884

Closed

LayoutLM support #2465

Closed

Benedikt Fuchs added 3 commits November 7, 2022 14:36

small fixes and set word_dropout to 0 if no crf/lstm is used

c2e4237

fix rebase issue

17c4087

fix formatting

2b8f78a

whoisjones reviewed Nov 9, 2022

View reviewed changes

fix merge review

314d01a

alanakbik merged commit dde4847 into flairNLP:master Nov 15, 2022

helpmefindaname deleted the add_layoutlm_layoutxlm_support branch November 28, 2022 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add layoutlm layoutxlm support #2980

Add layoutlm layoutxlm support #2980

helpmefindaname commented Nov 7, 2022 •

edited

Loading

whoisjones left a comment

whoisjones Nov 9, 2022

helpmefindaname Nov 11, 2022

whoisjones Nov 9, 2022

whoisjones Nov 9, 2022

helpmefindaname Nov 11, 2022

whoisjones Nov 9, 2022

helpmefindaname Nov 11, 2022

whoisjones Nov 9, 2022

whoisjones Nov 9, 2022

whoisjones commented Nov 14, 2022

alanakbik commented Nov 15, 2022

Add layoutlm layoutxlm support #2980

Add layoutlm layoutxlm support #2980

Conversation

helpmefindaname commented Nov 7, 2022 • edited Loading

whoisjones left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whoisjones commented Nov 14, 2022

alanakbik commented Nov 15, 2022

helpmefindaname commented Nov 7, 2022 •

edited

Loading