-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotating formulas, listings and figures #1100
Comments
@Schroedi regarding point 1, the documentation is referred to the The example you are linking is another model, which I'm not sure it's even used (it's last update was in 2017 😅). |
Regarding point 3, if there are missing figures in the In this case you should examine the generated training data generated from the models upstream. See Fig 2 in https://grobid.readthedocs.io/en/latest/Principles/ for more information of what is upstream and downstream. I recommend you to work in batches of documents, and check each model's data at the same time, then move to the next model. Usually takes time to get familiar with each models' structure and working on the same model before moving to the next may be more efficient. It's just a recommendation, though. I do my best to explain what I have been doing, feel free to point me to the unclear parts. 😅
As the training process, this explanation can be performed in an iterative way. Let me know if there are points that are not clear. |
Thank your for taking your time and your detailed answer! It really helped me.
You're right. I think this one should be fixed though: #1107 The last open point is 2. Listing. Is there any special handling or should they just be figures? |
Thanks for the PR #1107, we might merge it at the next iteration on the models (which might happens in a few months) so that we don't forget about it. For the listing, I don't really know, I quickly checked but did not find any training data. |
Hi, thanks for your awesome work!
I have some annotation questions:
Formula labeling
https://grobid.readthedocs.io/en/latest/training/fulltext/#formulas
Advises to not include the brackets in the label. The training data includes them, though. One of multiple samples: https://github.com/kermitt2/grobid/blob/be9e6523d71518544e1394f5be56bda0e55819ef/grobid-trainer/resources/dataset/shorttext/corpus/tei/submission_106.training.shorttext.tei.xml#L10C177-L10C177
Listings
How should I annotate listings like Algorithm 1 in [1]?
Are they figures? If so, what would be the label?
I assume I should add missing figures to the figure.tei.xml file?
They probably should follow the order in which they appear within the fulltext?
The following is obsolete: I found the `trash` tag in the training data
Should they contain all text+tags from the fulltext and additionally annotate the relevant parts (head, label, figDesc)? Here is an example from [1] again: ```xml
Random Saccades 50 100 150 200 250 300 160 180 200 220 240 260 280 300 Smooth Pursuit 140 160 180 200 220 240 260 0 50 100 150 200 250 300 Pixel Coordinate Pixel Coordinate Pupil in Camera Space Gaze Point in Screen Space Gaze Point in Screen Space 20°6 3°2 0°6 3°4 0°9 5°9 5°9 5°P ixel Coordinate Pixel Coordinate Fig. 6 . Fitted pupil locations and gaze point estimates for smooth pursuit motion and random saccadic motion are shown for four different users in different colors. The figure is organized into grids; the first row plots smooth pursuit data and the second row plots random saccadic data. ``` Should I keep the first part or remove it?[1] arXiv:2004.03577v3
The text was updated successfully, but these errors were encountered: