Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method #1381

epwalsh · 2023-11-01T15:45:23Z

Hey team,

For the What's In My Big Data? (WIMBD) project I found that we needed a couple additional features in the Rust library here. Currently we're depending on my fork of tokenizers but I was hoping we could get this merged and released so I can publish our wimbd crate.

The changes proposed are:

Deriving the Clone trait for the Tokenizer struct. This proved super useful so we don't need to re-instantiate the same tokenizer over and over when we run worker jobs in threads.
Adding an .into_tokens() -> Vec<String> method on the Encoding class so we can easily get an owned Vec of the raw tokens without having to create a new one.

HuggingFaceDocBuilderDev · 2023-11-01T15:54:40Z

The documentation is not available anymore as the PR was closed or merged.

meg-huggingface

This looks fine to me, but I am probably not the best person to review. Adding other reviewers.
In particular for me, I'm not sure where Clone is being called, so I can't picture offhand how it's being used. Any chance this is missing another file or two that has Clone in it?

epwalsh · 2023-11-02T00:11:33Z

I'm not sure where Clone is being called, so I can't picture offhand how it's being used. Any chance this is missing another file or two that has Clone in it?

There isn't any use of Tokenizer.clone() in this repo, but it's useful as part of the public API. In particular we are doing something like this:

let tokenizer = Tokenizer.from_pretrained(name, None)?;

for path in paths {
    let tokenizer = tokenizer.clone();
    thread_pool::execute(move || {
        // tokenize file at `path` ...
    })
}

Narsil

Hi sorry for the delay.

I don't think we need into_tokens().

get_tokens().to_vec() should be preferred IMHO.

If you're not using the Encoding further then the compiler should be able to infer the move semantics (To be proven I admit).

The problem with into_ is that actually into_ids() would also be very valid (it's actually much superior to tokens, tokens being really poor informationwise compared to ids).

I'll go ahead and accept the Clone thing asap. There's indeed no need to not have it clone.
For the into_ I think we not include this particular one. Maybe a single method that split all the pieces if needed but I think the get_...to_vec() is a bit more natural IMO).
To motivate the split_into() thing, I think it would be nice to have a simple example showcasing how it's superior because of performance (or code simplicity).

tokenizers/src/tokenizer/encoding.rs

epwalsh added 2 commits November 9, 2022 13:36

Add into_tokens() method

cb5e8a5

derive clone

eb3d37f

meg-huggingface requested review from Narsil and abhishekkrthakur November 2, 2023 00:01

meg-huggingface reviewed Nov 2, 2023

View reviewed changes

Narsil reviewed Nov 20, 2023

View reviewed changes

tokenizers/src/tokenizer/encoding.rs Outdated Show resolved Hide resolved

Update tokenizers/src/tokenizer/encoding.rs

6a36601

Narsil approved these changes Nov 20, 2023

View reviewed changes

Narsil merged commit daf3616 into huggingface:main Nov 20, 2023
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method #1381

Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method #1381

epwalsh commented Nov 1, 2023

HuggingFaceDocBuilderDev commented Nov 1, 2023 •

edited

Loading

meg-huggingface left a comment

epwalsh commented Nov 2, 2023

Narsil left a comment

Derive Clone on Tokenizer, add Encoding.into_tokens() method #1381

Derive Clone on Tokenizer, add Encoding.into_tokens() method #1381

Conversation

epwalsh commented Nov 1, 2023

HuggingFaceDocBuilderDev commented Nov 1, 2023 • edited Loading

meg-huggingface left a comment

Choose a reason for hiding this comment

epwalsh commented Nov 2, 2023

Narsil left a comment

Choose a reason for hiding this comment

Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method #1381

Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method #1381

HuggingFaceDocBuilderDev commented Nov 1, 2023 •

edited

Loading