Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid duplicated body part in the abstract #486

Merged
merged 8 commits into from
Sep 12, 2019

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Aug 20, 2019

This is a fix for one of the issue raised in #476.

The problem is the selection of the abstract to be given to the fulltext model for further structuring the abstract (paragraphs, citations, ..). The current approach supposes incorrectly that the abstract was a continuous segment of the document, while it can consist of several non continuous parts.

In the new version, we create several DocumentPiece according to the continuous segments.

  1. Processing of this PDF is now working (without heuristics):

EPL0580589-CC.pdf

  1. Test on the 1942 PMC dataset improves abstract identification quite significantly for the fuzzy matches, which is nice:

Before:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.11        78.41        72.21        75.18 

After:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             95.84        87.93        80.43        84.01  

@kermitt2 kermitt2 requested a review from lfoppiano August 20, 2019 12:41
@coveralls
Copy link

coveralls commented Aug 20, 2019

Coverage Status

Coverage increased (+0.4%) to 37.02% when pulling 06da47f on duplicated-body-parts-476 into bb8cf62 on master.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Aug 21, 2019

Looks good! 👍

I have a couple of minor suggestions.
I've noticed the same (old version) code was used in the method processShort(). I suggest to move the code in a separate method and reuse it there to be on a safe side:

     SortedSet<DocumentPiece> documentParts = new TreeSet<>();

     for (List<LayoutToken> chunk : tokenChunks) {
         documentParts.addAll(collectPiecesFromLayoutTokens(doc, chunk));
     }

See the changes:
Minor_suggestions.patch.zip

@lfoppiano
Copy link
Collaborator

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

In the next example EPJ0290369-CC.pdf the first line is without reformatting:

<p>A large positive magnetoresistivity (up to tens of percents) is observed in both underdoped and overdoped superconducting La 2-x Sr x CuO 4 epitaxial thin films at temperatures far above the superconducting critical temperature T c. For the underdoped samples, this magnetoresistance far above T c cannot be described by the Kohler rule and we believe it is to be attributed to the influence of superconducting fluctuations. In the underdoped regime, the large magnetoresistance is only present when at low temperatures superconductivity occurs. T he strong magnetoresistivity, which persists even at temperatures far above T c , can be related to the pairs forming eventually the superconducting state below T c. Our observations support the idea of a close relation between the pseudogap and the superconducting gap and provide new indications for the presence of pairs above T c .</p>
<p>A large positive magnetoresistivity (up to tens of percents) is observed in both underdoped and overdoped superconducting La 2-x Sr x CuO 4 epitaxial thin films at temperatures far above the superconducting critical temperature T c . For the underdoped samples, this magnetoresistance far above T c cannot be described by the Kohler rule and we believe it is to be attributed to the influence of superconducting fluctuations. In the underdoped regime, the large magnetoresistance is only present when at low temperatures superconductivity occurs. T he strong magnetoresistivity, which persists even at temperatures far above T c , can be related to the pairs forming eventually the superconducting state below T c . Our observations support the idea of a close relation between the pseudogap and the superconducting gap and provide new indications for the presence of pairs above T c .</p>

I'm not sure where these differences are done...

@kermitt2
Copy link
Owner Author

kermitt2 commented Aug 22, 2019

I've noticed the same (old version) code was used in the method processShort().

Ah yes I completely forgot that I've already written exactly the same piece of program before :D

The idea of processShort() was indeed to apply the full text model to any sequence of LayoutToken in a document, it was used for figure and table captions - so we should reuse that for the abstract. I don't remember why I've not reused it when adding the processing of abstract (probably forgot).

The change would be then simply something like that:

if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {
      Pair<String, List<LayoutToken>> abstractProcessed = processShort(abstractTokens, doc);
      resHeader.setLabeledAbstract(abstractProcessed.getLeft());
      resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT);
}

That's it, I think.

@kermitt2
Copy link
Owner Author

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each TaggingTokenCluster, but it's a bit complicated) so you could open an issue specific to that.

1 similar comment
@kermitt2
Copy link
Owner Author

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each TaggingTokenCluster, but it's a bit complicated) so you could open an issue specific to that.

@lfoppiano
Copy link
Collaborator

lfoppiano commented Aug 22, 2019

Ok I see, although in processShort you create chunks of layout token and then create document pieces based on that. Yes is the same result...

Tomorrow I will test it and push the final version on this branch, now that is more clear 😃

@kermitt2
Copy link
Owner Author

Yes the new version is actually better, it avoids these extra chunks of layout tokens, so better to move it to processShort() I think.

@lfoppiano
Copy link
Collaborator

OK!

@kermitt2
Copy link
Owner Author

some loss with the benchmarking, the last changes need to be re-worked

// structure the abstract using the fulltext model
if ( (resHeader.getAbstract() != null) && (resHeader.getAbstract().length() > 0) ) {
List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {
if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a duplicated if, I would also rewrite this part directly with CollectionUtils and StringUtils to improve readability:

            // structure the abstract using the fulltext model
            if ( isNotBlank(resHeader.getAbstract())) {
                List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
                if (CollectionUtils.isNotEmpty(abstractTokens)) {
                    Pair<String, List<LayoutToken>> abstractProcessed = processShort(abstractTokens, doc);
                    if (abstractProcessed != null) {
                        resHeader.setLabeledAbstract(abstractProcessed.getLeft());
                        resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT);
                    }
                }
            }

Copy link
Collaborator

@lfoppiano lfoppiano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle the end-2-end on the papers I was working looks good, however I have minor questions. See my comments in the code.

@lfoppiano
Copy link
Collaborator

I've pushed an updated version where I've done the following:

  1. move the code to collect the document pieces in a separate method called collectPiecesFromLayoutToken()
  2. refined the methods
  • processShort() / old version, working first on the layout token and creating document pieces from it and
  • processShortNew() / new version using the layout token list as it is and apply collectPiecesFromLayoutToken() to extract subsequent document pieces only from subsequent parts
  1. added one unit test - not enough time unfortunately
  2. I've corrected processShort index as commented in a previous review

Hope this can be accepted soon

@kermitt2
Copy link
Owner Author

kermitt2 commented Sep 12, 2019

I fixed several bugs related to labeled abstracts (the problems of issues #424 are solved), still one thing to check. The labeled abstract part was a bit too rushed, lacking tests, but really nice addition I think for getting citations in the abstract - it's working fine.

processShort() is partly rewritten and processShortNew() is useless I think, it is more complicated and should be removed.
collectPiecesFromLayoutTokens() has disappeared because I have trouble to follow and review a process when it is sliced, but we could re-introduced it for the unit tests at some point.

This score is erroneous - it was too good to be true :D

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             95.84        87.93        80.43        84.01 

We are at:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.88        82.11        75.88        78.87 

which is for the f-score +3 than v. 0.5.5, +6 than v. 0.5.4, +1.2 than v. 0.5.3 (no more regression with labeled abstract, on the contrary progress).

@lfoppiano
Copy link
Collaborator

From my side I didn't face any problems or regressions. Feel free to cleanup what's not used or obsolete.

@lfoppiano lfoppiano deleted the duplicated-body-parts-476 branch October 18, 2019 08:17
tantikristanti pushed a commit that referenced this pull request Nov 15, 2019
Avoid duplicated body part in the abstract

Former-commit-id: f184546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants