Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abbreviations: remove dot if word is not an abbreviation #5

Merged
merged 3 commits into from
Mar 20, 2024

Conversation

Terrtia
Copy link
Contributor

@Terrtia Terrtia commented Mar 7, 2024

  • Add abbreviation set
  • remove . if word is not an abbreviation

@pierotofy
Copy link
Member

pierotofy commented Mar 7, 2024

Thanks for the PR !

I'm not sure I'd want to start introducing such specific tokenization rules and additional pickle files. One of the goals of the projects is to keep the logic simple (and fast). Texts with more specific tokenization rules could be handled outside of lexilang (by performing tokenization, then calling lexilang individually on the tokens)?

@Terrtia
Copy link
Contributor Author

Terrtia commented Mar 8, 2024

How should we handle abbreviations ?

Punctuation needs to be removed to check if a word is in a dictionary. However, lexilang utilizes punctuation to distinguish Bulgarian and Hungarian languages, which is why I didn't removed dot characters in my previous PR.

I think It's interesting to detect languages through abbreviations. I use lexilang to identify languages in chat messages, which frequently include language-specific abbreviations.

@pierotofy
Copy link
Member

pierotofy commented Mar 10, 2024

One possible approach could be client applications could replace abbreviations with their full word equivalent before sending the text to lexilang for detection?

@Terrtia
Copy link
Contributor Author

Terrtia commented Mar 20, 2024

I've removed all abbreviations and dot characters."

@pierotofy
Copy link
Member

Looks good, thanks!!

@pierotofy pierotofy merged commit ee46d67 into LibreTranslate:main Mar 20, 2024
@Terrtia
Copy link
Contributor Author

Terrtia commented Mar 20, 2024

Thanks !
Could you please create a new release on PyPI ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants