-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add layoutlm layoutxlm support #2980
Add layoutlm layoutxlm support #2980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I couldn't really check the I/O operation of images in the dataset. I have some minor things you might want to look at before merging.
left_context = left_context[-context_length:] | ||
break | ||
return left_context | ||
left_context = sentence.tokens + left_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left_context += sentence.tokens
to be consistent with right context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that addition on lists is not cummutative, hence your suggestion would lead to wrong results.
flair/datasets/ocr.py
Outdated
""" | ||
Instantiates a Dataset from a OCR-Json format. | ||
The folder is structured with a "images" folder and a "tagged" folder. | ||
Those folders contain respectively .jpg an .json files with matching file name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*and
:param path_to_split_directory: base folder with the task data | ||
:param label_type: the label_type to add the ocr labels to | ||
:param encoding: the encoding to load the .json files with | ||
:param normalize_coords_to_thousands: if True, the coordinates will be ranged from 0 to 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is normalizing to thousands usual? If it is just selected by random, why not make the normalization factor as int and optional and normalize images if factor is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normalizing to thousands is very usual, it is done by Layoutlm (& v2/v3), Docformer, Lambert, ...
I haven't seen an implementation so far that did it different
) | ||
|
||
return tensor_args | ||
# random check some tokens to save performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check entire dataset or discard inputs not having all required metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the comment, running the checks on a broader scale would impact the speed, this is rather to ensure that the user gets a good warning if there are no bounding boxes in general, without impacting the performance.
And Discarding inputs would silently hide errors, so I am against that.
flair/embeddings/transformer.py
Outdated
if self.tokenizer_needs_ocr_boxes: | ||
tokenizer_kwargs["boxes"] = [[t.get_metadata("bbox") for t in tokens] for tokens in flair_tokens] | ||
else: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run formatter again should remove empty line
if "bbox" in batch_encoding: | ||
model_kwargs["bbox"] = batch_encoding["bbox"].to(device, non_blocking=True) | ||
|
||
if self.token_embedding or self.needs_manual_ocr: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we check 2x for self.token_embedding
(line 547) and self.needs_manual_ocr
(line 549) and jointly in line 516. The part required by both are the word_ids. Can we move everything after line 526 also in the condition for self.token_embeddings
? looks like self.needs_manual_ocr
only needs word_ids_list
.
👍 |
@helpmefindaname thanks for adding this! |
This PR adds the following:
add_metadata
,get_metadata
andhas_metadata
for each datapoint (Token, Sentence, Span)I tested the training, using 100 epochs, batch_size=16, lr=5e-5, train_with_dev=True for the following embeddings: