Avoid duplicated body part in the abstract #486

kermitt2 · 2019-08-20T12:41:01Z

This is a fix for one of the issue raised in #476.

The problem is the selection of the abstract to be given to the fulltext model for further structuring the abstract (paragraphs, citations, ..). The current approach supposes incorrectly that the abstract was a continuous segment of the document, while it can consist of several non continuous parts.

In the new version, we create several DocumentPiece according to the continuous segments.

Processing of this PDF is now working (without heuristics):

EPL0580589-CC.pdf

Test on the 1942 PMC dataset improves abstract identification quite significantly for the fuzzy matches, which is nice:

Before:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.11        78.41        72.21        75.18

After:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             95.84        87.93        80.43        84.01

coveralls · 2019-08-20T12:49:40Z

Coverage increased (+0.4%) to 37.02% when pulling 06da47f on duplicated-body-parts-476 into bb8cf62 on master.

lfoppiano · 2019-08-21T00:21:32Z

Looks good! 👍

I have a couple of minor suggestions.
I've noticed the same (old version) code was used in the method processShort(). I suggest to move the code in a separate method and reuse it there to be on a safe side:

     SortedSet<DocumentPiece> documentParts = new TreeSet<>();

     for (List<LayoutToken> chunk : tokenChunks) {
         documentParts.addAll(collectPiecesFromLayoutTokens(doc, chunk));
     }

See the changes:
Minor_suggestions.patch.zip

lfoppiano · 2019-08-21T01:14:43Z

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

In the next example EPJ0290369-CC.pdf the first line is without reformatting:

<p>A large positive magnetoresistivity (up to tens of percents) is observed in both underdoped and overdoped superconducting La 2-x Sr x CuO 4 epitaxial thin films at temperatures far above the superconducting critical temperature T c. For the underdoped samples, this magnetoresistance far above T c cannot be described by the Kohler rule and we believe it is to be attributed to the influence of superconducting fluctuations. In the underdoped regime, the large magnetoresistance is only present when at low temperatures superconductivity occurs. T he strong magnetoresistivity, which persists even at temperatures far above T c , can be related to the pairs forming eventually the superconducting state below T c. Our observations support the idea of a close relation between the pseudogap and the superconducting gap and provide new indications for the presence of pairs above T c .</p>
<p>A large positive magnetoresistivity (up to tens of percents) is observed in both underdoped and overdoped superconducting La 2-x Sr x CuO 4 epitaxial thin films at temperatures far above the superconducting critical temperature T c . For the underdoped samples, this magnetoresistance far above T c cannot be described by the Kohler rule and we believe it is to be attributed to the influence of superconducting fluctuations. In the underdoped regime, the large magnetoresistance is only present when at low temperatures superconductivity occurs. T he strong magnetoresistivity, which persists even at temperatures far above T c , can be related to the pairs forming eventually the superconducting state below T c . Our observations support the idea of a close relation between the pseudogap and the superconducting gap and provide new indications for the presence of pairs above T c .</p>

I'm not sure where these differences are done...

kermitt2 · 2019-08-22T10:59:28Z

I've noticed the same (old version) code was used in the method processShort().

Ah yes I completely forgot that I've already written exactly the same piece of program before :D

The idea of processShort() was indeed to apply the full text model to any sequence of LayoutToken in a document, it was used for figure and table captions - so we should reuse that for the abstract. I don't remember why I've not reused it when adding the processing of abstract (probably forgot).

The change would be then simply something like that:

if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {
      Pair<String, List<LayoutToken>> abstractProcessed = processShort(abstractTokens, doc);
      resHeader.setLabeledAbstract(abstractProcessed.getLeft());
      resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT);
}

That's it, I think.

kermitt2 · 2019-08-22T11:04:17Z

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each TaggingTokenCluster, but it's a bit complicated) so you could open an issue specific to that.

kermitt2 · 2019-08-22T11:04:34Z

I notice that disabling the further abstract reformatting returns slightly different text (with the reformatting, there are many more spaces after punctuation).

A problem in building the full text result I guess... I don't think it's related (there is a special mechanism to keep track of trailing spaces for each TaggingTokenCluster, but it's a bit complicated) so you could open an issue specific to that.

lfoppiano · 2019-08-22T13:53:32Z

Ok I see, although in processShort you create chunks of layout token and then create document pieces based on that. Yes is the same result...

Tomorrow I will test it and push the final version on this branch, now that is more clear 😃

kermitt2 · 2019-08-22T13:56:11Z

Yes the new version is actually better, it avoids these extra chunks of layout tokens, so better to move it to processShort() I think.

lfoppiano · 2019-08-22T14:07:01Z

OK!

… texts like the abstract

kermitt2 · 2019-08-22T17:21:34Z

some loss with the benchmarking, the last changes need to be re-worked

lfoppiano · 2019-08-26T02:12:44Z

grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java

            // structure the abstract using the fulltext model
            if ( (resHeader.getAbstract() != null) && (resHeader.getAbstract().length() > 0) ) {
+                List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT);
+                if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {
+                    if ( (abstractTokens != null) && (abstractTokens.size()>0) ) {


there is a duplicated if, I would also rewrite this part directly with CollectionUtils and StringUtils to improve readability:

// structure the abstract using the fulltext model if ( isNotBlank(resHeader.getAbstract())) { List<LayoutToken> abstractTokens = resHeader.getLayoutTokens(TaggingLabels.HEADER_ABSTRACT); if (CollectionUtils.isNotEmpty(abstractTokens)) { Pair<String, List<LayoutToken>> abstractProcessed = processShort(abstractTokens, doc); if (abstractProcessed != null) { resHeader.setLabeledAbstract(abstractProcessed.getLeft()); resHeader.setLayoutTokensForLabel(abstractProcessed.getRight(), TaggingLabels.HEADER_ABSTRACT); } } }

grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java

lfoppiano

In principle the end-2-end on the papers I was working looks good, however I have minor questions. See my comments in the code.

…t tests

lfoppiano · 2019-09-07T09:23:07Z

I've pushed an updated version where I've done the following:

move the code to collect the document pieces in a separate method called collectPiecesFromLayoutToken()
refined the methods

processShort() / old version, working first on the layout token and creating document pieces from it and
processShortNew() / new version using the layout token list as it is and apply collectPiecesFromLayoutToken() to extract subsequent document pieces only from subsequent parts

added one unit test - not enough time unfortunately
I've corrected processShort index as commented in a previous review

Hope this can be accepted soon

…neration

kermitt2 · 2019-09-12T12:47:30Z

I fixed several bugs related to labeled abstracts (the problems of issues #424 are solved), still one thing to check. The labeled abstract part was a bit too rushed, lacking tests, but really nice addition I think for getting citations in the abstract - it's working fine.

processShort() is partly rewritten and processShortNew() is useless I think, it is more complicated and should be removed.
collectPiecesFromLayoutTokens() has disappeared because I have trouble to follow and review a process when it is sliced, but we could re-introduced it for the unit tests at some point.

This score is erroneous - it was too good to be true :D

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             95.84        87.93        80.43        84.01

We are at:

==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)

===== Field-level results =====

label                accuracy     precision    recall       f1     

abstract             94.88        82.11        75.88        78.87

which is for the f-score +3 than v. 0.5.5, +6 than v. 0.5.4, +1.2 than v. 0.5.3 (no more regression with labeled abstract, on the contrary progress).

lfoppiano · 2019-09-13T01:48:45Z

From my side I didn't face any problems or regressions. Feel free to cleanup what's not used or obsolete.

Avoid duplicated body part in the abstract Former-commit-id: f184546

create valid DocumentPiece for further structuring abstract

c6d2930

kermitt2 requested a review from lfoppiano August 20, 2019 12:41

update processShort for applying the fulltext model to short piece of…

626ad60

… texts like the abstract

kermitt2 added 2 commits August 22, 2019 22:35

rollback

377ad90

use previous processShort for all short texts

2087e78

lfoppiano reviewed Aug 26, 2019

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/FullTextParser.java Show resolved Hide resolved

lfoppiano reviewed Aug 26, 2019

View reviewed changes

Implementing suggestions and move code into methods + adding some uni…

02612ff

…t tests

kermitt2 added 2 commits September 12, 2019 09:34

review processShort; fix bug for DocumentPiece handling in feature ge…

345c6ae

…neration

fix #424, fix labeled abstract mapping

6a9e167

Add a cleaning method for abstract working with layout tokens

06da47f

kermitt2 merged commit f184546 into master Sep 12, 2019

This was referenced Sep 12, 2019

GROBID 0.5.4 abstract extraction regression #424

Closed

make labeled abstract configurable #501

Closed

lfoppiano deleted the duplicated-body-parts-476 branch October 18, 2019 08:17

tantikristanti pushed a commit that referenced this pull request Nov 15, 2019

Merge pull request #486 from kermitt2/duplicated-body-parts-476

ef8a54e

Avoid duplicated body part in the abstract Former-commit-id: f184546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid duplicated body part in the abstract #486

Avoid duplicated body part in the abstract #486

kermitt2 commented Aug 20, 2019 •

edited

Loading

coveralls commented Aug 20, 2019 •

edited

Loading

lfoppiano commented Aug 21, 2019 •

edited

Loading

lfoppiano commented Aug 21, 2019

kermitt2 commented Aug 22, 2019 •

edited

Loading

kermitt2 commented Aug 22, 2019

kermitt2 commented Aug 22, 2019

lfoppiano commented Aug 22, 2019 •

edited

Loading

kermitt2 commented Aug 22, 2019

lfoppiano commented Aug 22, 2019

kermitt2 commented Aug 22, 2019

lfoppiano Aug 26, 2019

lfoppiano left a comment

lfoppiano commented Sep 7, 2019

kermitt2 commented Sep 12, 2019 •

edited

Loading

lfoppiano commented Sep 13, 2019

Avoid duplicated body part in the abstract #486

Avoid duplicated body part in the abstract #486

Conversation

kermitt2 commented Aug 20, 2019 • edited Loading

coveralls commented Aug 20, 2019 • edited Loading

lfoppiano commented Aug 21, 2019 • edited Loading

lfoppiano commented Aug 21, 2019

kermitt2 commented Aug 22, 2019 • edited Loading

kermitt2 commented Aug 22, 2019

kermitt2 commented Aug 22, 2019

lfoppiano commented Aug 22, 2019 • edited Loading

kermitt2 commented Aug 22, 2019

lfoppiano commented Aug 22, 2019

kermitt2 commented Aug 22, 2019

lfoppiano Aug 26, 2019

Choose a reason for hiding this comment

lfoppiano left a comment

Choose a reason for hiding this comment

lfoppiano commented Sep 7, 2019

kermitt2 commented Sep 12, 2019 • edited Loading

lfoppiano commented Sep 13, 2019

kermitt2 commented Aug 20, 2019 •

edited

Loading

coveralls commented Aug 20, 2019 •

edited

Loading

lfoppiano commented Aug 21, 2019 •

edited

Loading

kermitt2 commented Aug 22, 2019 •

edited

Loading

lfoppiano commented Aug 22, 2019 •

edited

Loading

kermitt2 commented Sep 12, 2019 •

edited

Loading