-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF Tokenizer: get charspans #1584
HF Tokenizer: get charspans #1584
Conversation
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java
Show resolved
Hide resolved
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java
Show resolved
Hide resolved
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java
Show resolved
Hide resolved
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java
Outdated
Show resolved
Hide resolved
extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @andreabrduque!
It looks like you are getting caught on our checkstyle:
Error: eckstyle] [ERROR] /home/runner/work/djl/djl/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java:17: Missing a Javadoc comment. [MissingJavadocType]
https://github.com/deepjavalibrary/djl/runs/6082365027?check_suite_focus=true#step:11:3779
Change-Id: Ic72c4bd89be325d4ade3a4f86c35846a24f9b08c
Codecov Report
@@ Coverage Diff @@
## master #1584 +/- ##
============================================
- Coverage 72.08% 70.86% -1.23%
- Complexity 5126 5397 +271
============================================
Files 473 504 +31
Lines 21970 23606 +1636
Branches 2351 2567 +216
============================================
+ Hits 15838 16729 +891
- Misses 4925 5595 +670
- Partials 1207 1282 +75
Continue to review full report at Codecov.
|
Description
This PR extends the hugging face Java/Rust interface. I added a method to return the original char spans for each token result, together with the encoding object.
This function is useful in Named Entity Recognition tasks (NER), specifically to reconstruct entities together, since the tokens retrieved from the tokenization lose whitespace information.