-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved dehypenisation #498
Conversation
@@ -21,6 +22,9 @@ | |||
*/ | |||
public class LayoutTokensUtil { | |||
|
|||
private final static Pattern LOWERCASE_LETTERS = Pattern.compile("[a-z]+"); | |||
private final static Pattern UPPER_AND_LOWERCASE_LETTERS = Pattern.compile("^[A-Za-z]+$"); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally we should use Unicode character class (\p{Lu}
for uppercase unicodes, \p{Ll}
for lowercase unicodes) and methods like Character.isLowerCase()
, because here we might miss many characters in non-English languages.
PDF Case of #180 is working fine and the PR is tested against the 1942 PMC PDF without loss, many thanks ! |
Improved dehypenisation Former-commit-id: 472324a
I tried to improve the dehypenisation (#180)
Changes:
TODO?: use a dictionary to check whether
previousToken + followingToken
are valid words (from dehypeniseHard)/cc @de-code