Improved dehypenisation #498

lfoppiano · 2019-09-06T06:08:22Z

I tried to improve the dehypenisation (#180)

Changes:

deprecated all dehypenisation method in TextUtilities (they are "redirected" to the LayoutTokensUtil)
improved recognition using coordinates to check "real" break line
added more Unit tests

TODO?: use a dictionary to check whether previousToken + followingToken are valid words (from dehypeniseHard)

coveralls · 2019-09-06T06:57:16Z

Coverage increased (+0.2%) to 37.312% when pulling 3212b61 on improved-dehypenisation into 24e6f0e on master.

kermitt2 · 2019-09-12T20:33:40Z

grobid-core/src/main/java/org/grobid/core/utilities/LayoutTokensUtil.java

@@ -21,6 +22,9 @@
 */
 public class LayoutTokensUtil {

+    private final static Pattern LOWERCASE_LETTERS = Pattern.compile("[a-z]+");
+    private final static Pattern UPPER_AND_LOWERCASE_LETTERS = Pattern.compile("^[A-Za-z]+$");
+


Normally we should use Unicode character class (\p{Lu} for uppercase unicodes, \p{Ll} for lowercase unicodes) and methods like Character.isLowerCase(), because here we might miss many characters in non-English languages.

kermitt2 · 2019-09-28T19:06:22Z

PDF Case of #180 is working fine and the PR is tested against the 1942 PMC PDF without loss, many thanks !

Improved dehypenisation Former-commit-id: 472324a

lfoppiano added 3 commits September 6, 2019 14:29

cleanup dehypenisation

fc0cafa

improving dehypenisation using coordinates to check breakline

3ccfd89

avoiding going out of bounds

f6b2434

lfoppiano requested a review from kermitt2 September 6, 2019 06:08

lfoppiano added 2 commits September 6, 2019 15:47

cosmetics

4b14c67

getting instance of GrobidProperties before running tests

d6b1d0e

kermitt2 added 7 commits September 13, 2019 07:01

fix merging issue with master

bc6bd9a

merge with master for benchmark

c71f879

cleaning

bf7e1de

Merge branch 'master' into improved-dehypenisation

5f4af22

do not use anymore deprecated dehyphenization methods in grobid core

883b3cb

support unicode strings

f395e21

fix test

3212b61

kermitt2 approved these changes Sep 28, 2019

View reviewed changes

kermitt2 merged commit 472324a into master Sep 28, 2019

lfoppiano deleted the improved-dehypenisation branch September 30, 2019 05:30

tantikristanti pushed a commit that referenced this pull request Nov 15, 2019

Merge pull request #498 from kermitt2/improved-dehypenisation

1adc351

Improved dehypenisation Former-commit-id: 472324a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved dehypenisation #498

Improved dehypenisation #498

lfoppiano commented Sep 6, 2019 •

edited

Loading

coveralls commented Sep 6, 2019 •

edited

Loading

kermitt2 Sep 12, 2019

kermitt2 commented Sep 28, 2019

Improved dehypenisation #498

Improved dehypenisation #498

Conversation

lfoppiano commented Sep 6, 2019 • edited Loading

coveralls commented Sep 6, 2019 • edited Loading

kermitt2 Sep 12, 2019

Choose a reason for hiding this comment

kermitt2 commented Sep 28, 2019

lfoppiano commented Sep 6, 2019 •

edited

Loading

coveralls commented Sep 6, 2019 •

edited

Loading