-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid duplicated body part in the abstract #486
Conversation
Looks good! 👍 I have a couple of minor suggestions. SortedSet<DocumentPiece> documentParts = new TreeSet<>();
for (List<LayoutToken> chunk : tokenChunks) {
documentParts.addAll(collectPiecesFromLayoutTokens(doc, chunk));
} See the changes: |
I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation). In the next example EPJ0290369-CC.pdf the first line is without reformatting:
I'm not sure where these differences are done... |
Ah yes I completely forgot that I've already written exactly the same piece of program before :D The idea of The change would be then simply something like that:
That's it, I think. |
A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each |
1 similar comment
A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each |
Ok I see, although in Tomorrow I will test it and push the final version on this branch, now that is more clear 😃 |
Yes the new version is actually better, it avoids these extra chunks of layout tokens, so better to move it to |
OK! |
… texts like the abstract
some loss with the benchmarking, the last changes need to be re-worked |
// structure the abstract using the fulltext model | ||
if ( (resHeader.getAbstract() != null) && (resHeader.getAbstract().length() > 0) ) { | ||
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); | ||
if ( (abstractTokens != null) && (abstractTokens.size()>0) ) { | ||
if ( (abstractTokens != null) && (abstractTokens.size()>0) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a duplicated if
, I would also rewrite this part directly with CollectionUtils and StringUtils to improve readability:
// structure the abstract using the fulltext model
if ( isNotBlank(resHeader.getAbstract())) {
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
if (CollectionUtils.isNotEmpty(abstractTokens)) {
Pair<String, List<LayoutToken>> abstractProcessed = processShort(abstractTokens, doc);
if (abstractProcessed != null) {
resHeader.setLabeledAbstract(abstractProcessed.getLeft());
resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT);
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle the end-2-end on the papers I was working looks good, however I have minor questions. See my comments in the code.
I've pushed an updated version where I've done the following:
Hope this can be accepted soon |
I fixed several bugs related to labeled abstracts (the problems of issues #424 are solved), still one thing to check. The labeled abstract part was a bit too rushed, lacking tests, but really nice addition I think for getting citations in the abstract - it's working fine.
This score is erroneous - it was too good to be true :D
We are at:
which is for the f-score +3 than v. 0.5.5, +6 than v. 0.5.4, +1.2 than v. 0.5.3 (no more regression with labeled abstract, on the contrary progress). |
From my side I didn't face any problems or regressions. Feel free to cleanup what's not used or obsolete. |
Avoid duplicated body part in the abstract Former-commit-id: f184546
This is a fix for one of the issue raised in #476.
The problem is the selection of the abstract to be given to the fulltext model for further structuring the abstract (paragraphs, citations, ..). The current approach supposes incorrectly that the abstract was a continuous segment of the document, while it can consist of several non continuous parts.
In the new version, we create several DocumentPiece according to the continuous segments.
EPL0580589-CC.pdf
Before:
After: