Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF Tokenizer: get charspans #1584

Merged
merged 6 commits into from
Apr 20, 2022

Conversation

andreabrduque
Copy link
Contributor

@andreabrduque andreabrduque commented Apr 19, 2022

Description

This PR extends the hugging face Java/Rust interface. I added a method to return the original char spans for each token result, together with the encoding object.

This function is useful in Named Entity Recognition tasks (NER), specifically to reconstruct entities together, since the tokens retrieved from the tokenization lose whitespace information.

Copy link
Contributor

@zachgk zachgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @andreabrduque!

It looks like you are getting caught on our checkstyle:

Error: eckstyle] [ERROR] /home/runner/work/djl/djl/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/jni/CharSpan.java:17: Missing a Javadoc comment. [MissingJavadocType]

https://github.com/deepjavalibrary/djl/runs/6082365027?check_suite_focus=true#step:11:3779

Change-Id: Ic72c4bd89be325d4ade3a4f86c35846a24f9b08c
@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2022

Codecov Report

Merging #1584 (9d035f5) into master (bb5073f) will decrease coverage by 1.22%.
The diff coverage is 61.59%.

@@             Coverage Diff              @@
##             master    #1584      +/-   ##
============================================
- Coverage     72.08%   70.86%   -1.23%     
- Complexity     5126     5397     +271     
============================================
  Files           473      504      +31     
  Lines         21970    23606    +1636     
  Branches       2351     2567     +216     
============================================
+ Hits          15838    16729     +891     
- Misses         4925     5595     +670     
- Partials       1207     1282      +75     
Impacted Files Coverage Δ
api/src/main/java/ai/djl/modality/cv/Image.java 69.23% <ø> (-4.11%) ⬇️
...i/djl/modality/cv/translator/BigGANTranslator.java 21.42% <ø> (-5.24%) ⬇️
...odality/cv/translator/BigGANTranslatorFactory.java 33.33% <0.00%> (+8.33%) ⬆️
...nslator/InstanceSegmentationTranslatorFactory.java 14.28% <0.00%> (-3.90%) ⬇️
.../modality/cv/translator/YoloTranslatorFactory.java 8.33% <0.00%> (-1.67%) ⬇️
...i/djl/modality/cv/translator/YoloV5Translator.java 5.69% <0.00%> (ø)
...odality/cv/translator/YoloV5TranslatorFactory.java 8.33% <0.00%> (-1.67%) ⬇️
...pi/src/main/java/ai/djl/ndarray/BytesSupplier.java 54.54% <0.00%> (-12.13%) ⬇️
...ain/java/ai/djl/ndarray/index/dim/NDIndexPick.java 100.00% <ø> (ø)
api/src/main/java/ai/djl/nn/Blocks.java 75.00% <0.00%> (-25.00%) ⬇️
... and 230 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06997ce...9d035f5. Read the comment docs.

@frankfliu frankfliu merged commit 077eb40 into deepjavalibrary:master Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants