Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive Clone on Tokenizer, add Encoding.into_tokens() method #1381

Merged
merged 3 commits into from
Nov 20, 2023

Conversation

epwalsh
Copy link
Collaborator

@epwalsh epwalsh commented Nov 1, 2023

Hey team,

For the What's In My Big Data? (WIMBD) project I found that we needed a couple additional features in the Rust library here. Currently we're depending on my fork of tokenizers but I was hoping we could get this merged and released so I can publish our wimbd crate.

The changes proposed are:

  • Deriving the Clone trait for the Tokenizer struct. This proved super useful so we don't need to re-instantiate the same tokenizer over and over when we run worker jobs in threads.
  • Adding an .into_tokens() -> Vec<String> method on the Encoding class so we can easily get an owned Vec of the raw tokens without having to create a new one.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 1, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link

@meg-huggingface meg-huggingface left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me, but I am probably not the best person to review. Adding other reviewers.
In particular for me, I'm not sure where Clone is being called, so I can't picture offhand how it's being used. Any chance this is missing another file or two that has Clone in it?

@epwalsh
Copy link
Collaborator Author

epwalsh commented Nov 2, 2023

I'm not sure where Clone is being called, so I can't picture offhand how it's being used. Any chance this is missing another file or two that has Clone in it?

There isn't any use of Tokenizer.clone() in this repo, but it's useful as part of the public API. In particular we are doing something like this:

let tokenizer = Tokenizer.from_pretrained(name, None)?;

for path in paths {
    let tokenizer = tokenizer.clone();
    thread_pool::execute(move || {
        // tokenize file at `path` ...
    })
}

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi sorry for the delay.

I don't think we need into_tokens().

get_tokens().to_vec() should be preferred IMHO.

If you're not using the Encoding further then the compiler should be able to infer the move semantics (To be proven I admit).

The problem with into_ is that actually into_ids() would also be very valid (it's actually much superior to tokens, tokens being really poor informationwise compared to ids).

I'll go ahead and accept the Clone thing asap. There's indeed no need to not have it clone.
For the into_ I think we not include this particular one. Maybe a single method that split all the pieces if needed but I think the get_...to_vec() is a bit more natural IMO).
To motivate the split_into() thing, I think it would be nice to have a simple example showcasing how it's superior because of performance (or code simplicity).

tokenizers/src/tokenizer/encoding.rs Outdated Show resolved Hide resolved
@Narsil Narsil merged commit daf3616 into huggingface:main Nov 20, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants