Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved dehypenisation #498

Merged
merged 12 commits into from
Sep 28, 2019
Merged

Improved dehypenisation #498

merged 12 commits into from
Sep 28, 2019

Conversation

lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Sep 6, 2019

I tried to improve the dehypenisation (#180)

Changes:

  • deprecated all dehypenisation method in TextUtilities (they are "redirected" to the LayoutTokensUtil)
  • improved recognition using coordinates to check "real" break line
  • added more Unit tests

TODO?: use a dictionary to check whether previousToken + followingToken are valid words (from dehypeniseHard)

/cc @de-code

@lfoppiano lfoppiano requested a review from kermitt2 September 6, 2019 06:08
@coveralls
Copy link

coveralls commented Sep 6, 2019

Coverage Status

Coverage increased (+0.2%) to 37.312% when pulling 3212b61 on improved-dehypenisation into 24e6f0e on master.

@@ -21,6 +22,9 @@
*/
public class LayoutTokensUtil {

private final static Pattern LOWERCASE_LETTERS = Pattern.compile("[a-z]+");
private final static Pattern UPPER_AND_LOWERCASE_LETTERS = Pattern.compile("^[A-Za-z]+$");

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally we should use Unicode character class (\p{Lu} for uppercase unicodes, \p{Ll} for lowercase unicodes) and methods like Character.isLowerCase(), because here we might miss many characters in non-English languages.

@kermitt2
Copy link
Owner

PDF Case of #180 is working fine and the PR is tested against the 1942 PMC PDF without loss, many thanks !

@kermitt2 kermitt2 merged commit 472324a into master Sep 28, 2019
@lfoppiano lfoppiano deleted the improved-dehypenisation branch September 30, 2019 05:30
tantikristanti pushed a commit that referenced this pull request Nov 15, 2019
Improved dehypenisation

Former-commit-id: 472324a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants