[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

davidwendt · 2020-08-06T18:52:45Z

Is your feature request related to a problem? Please describe.
The current subword_tokenize API in cudf accepts the vocabulary hash-file as a parameter. Profiling shows processing this file into GPU memory is about 14% of the overall execution time for 1M strings 15 bytes each using the CLX BERT vocab hash file (755KB).

Time(%)	Avg (ns)	Range Name
78.3	20994753.8	libcudf:subword_tokenize
14.2	3811397.6	libcudf:load_vocabulary_file
7.3	318007.7	libcudf:make_numeric_column

From some previous discussions with @VibhuJawa @raykallen @BartleyR, multiple subword_tokenize calls over batches of data are likely using the same vocabulary so loading the hash-file for each tokenize call would be wasteful.

Note that the load time is fixed so for a smaller set of strings, the cost against the overall time is higher. Here is a run with only 100K strings with same vocab file:

Time(%)	Avg (ns)	Range Name
53.5	3806982.7	libcudf:load_vocabulary_file
42.2	3007159.7	libcudf:subword_tokenize
4.1	47691.5	libcudf:make_numeric_column

Here the vocab file load is more than half the overall time.

Describe the solution you'd like
Create a separate API to load the hash-table for the tokenize. The resulting object could then be passed on each tokenize call that used the same vocabulary thereby speeding up the processing of multiple batches of data.

This may also align better with the hugging face tokenizer which at least stores the vocabulary file names as an attribute if not the preprocessed hash data somehow internally.

The libcudf API layer already has APIs that separate the file load from the tokenize:

nvtext::load_vocabulary_file
nvtext::subword_tokenize accepts a nvtext::hashed_vocabulary struct

So this feature request is to design and implement an equivalent cudf Python API appropriately.
@randerzander

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-08-06T20:41:01Z

Create a separate API to load the hash-table for the tokenize. The resulting object could then be passed on each tokenize call that used the same vocabulary thereby speeding up the processing of multiple batches of data.

I have a slightly unrelated question, How hard will it be to support vocab.txt too (instead of just the hash-table file) in a similar API ?

The reasons for this ask are:

a. HuggingFace users often just read the vocab.txt for tokenizing if we can help users skip the process of creating the hash-table it might make porting stuff easier for users and with the proposed feature here this is gonna be anyways be a one time cost so it should be fine .

b. It paves the road for the decode feature (essentially a dictionary lookup) which we don't have in our tokenizer right now ( And we have to rely on cpus to do)

c. The current process of creating hash-table may have some bugs in it (see:#5760 and #5765 ). It is also not under any kind of tests or CI (AFAIK), (feel free to correct me @raykallen / @BartleyR ), so anyways it might be worthwhile adding it to add it to a location in cudf so that it's easier to keep it updated under a CI with tests.

Also, note that the suggestion here is to have both options (vocab.txt as well as vocab-hash-table) which the user can provide as he wishes.

Please feel to shoot this idea down if it's expensive/difficult to implement, This more of a nice to have ask :-)

edit: Added point b on decode .

davidwendt · 2020-08-06T21:01:09Z

I have a slightly unrelated question, How hard will it be to support vocab.txt too (instead of just the hash-table file) in a similar API ?

This would involve incorporating this python file inside the cudf Python package and having someone maintain it.
https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py

@kkraus14

github-actions · 2021-02-16T21:18:54Z

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

kkraus14 · 2021-02-22T18:38:44Z

We had a call offline to discuss a plan on this and this is what we came up with as an initial plan of action:

Change load_vocabulary_file function to return shared_ptr
Expose load_vocabulary_file function in Python / Cython
Create opaque Python wrapper around hashed_vocabulary shared_ptr
Expose subword_tokenize call into Python / Cython as appropriate
Rework the hash_vocab_utils.py into Python as appropriate in order to generate the files necessary for the load_vocabulary_file function.

Follow up work after the above is finished:

Expose subword_tokenize call that takes in a raw vocab.txt file (pipelines through above)
Optimize hash_vocab_utils.py?
Move hash_vocab_utils functionality to libcudf
Implement 'hash_vocab_utils' functionality to the GPU

VibhuJawa · 2021-02-22T19:03:39Z

On the python end-user side, the requested API can loosely follow BertTokenizerFast and we will provide a subset of the functionality as first pass.

class SubwordTokenizer(hash_file: str, max_length: int = 64, stride: int = 48, do_lower: bool = True, do_truncate: bool = False, max_rows_tensor: int = 500):

The function to include in this class is encode:

def encode(Convert a cudf string series into sequence of ids)

## cudf series -> Tuple[cupy.core.core.ndarray, cupy.core.core.ndarray, cupy.core.core.ndarray]

The end user will call it like below:

ser = cudf.Series(['this is the', 'best book'])

tokenizer = SubwordTokenizer('voc_hash.txt', max_length=max_length, stride=stride)

tokens, masks, metadata = tokenizer.encode(ser)

CC: @BartleyR / @raykallen for inputs.

@davidwendt

Reference #5868 This PR changes the `nvtext::load_vocabulary_file` to return a unique-pointer to make it easier to manage in Python/Cython class object. The original signature returned a flat structure that contained unique-pointers which would make it difficult to copy and manage. The corresponding gtests and gbenchmarks were updated for this API change. Authors: - David (@davidwendt) Approvers: - Conor Hoekstra (@codereport) - Karthikeyan (@karthikeyann) URL: #7424

@davidwendt

This PR closes #5868 by adding a new tokenizer API. We are seeing speedups even at low batch sizes (10/100) so this should potentially unlock some inference/training use cases for us. ## Benchmarks: (Thanks to @davidwendt for writing the super fast tokenizer 💥 ) | Batch Size | HuggingFace | Rapids Old API | Rapids Tokenizer API | Tokenizer API Speed up vs HuggingFace | Rapids New API Speedup | |- |- |- |- |- |- | | 1 | 0.000242 | 0.006890 | 0.000497 | 0.487 | 13.863 | | 10 | 0.002800 | 0.007030 | 0.000516 | 5.426 | 13.624 | | 100 | 0.016200 | 0.007140 | 0.000537 | 30.168 | 13.296 | | 1000 | 0.149000 | 0.007150 | 0.000517 | 288.201 | 13.830 | ## API Comparision to HuggingFace: The goal of this PR is to ensure our API matches up HuggingFace as much as possible to help with ease of porting. Proposed API in this PR: ```python from cudf.core.subword_tokenizer import SubwordTokenizer tokenizer = SubwordTokenizer('bert-base-cased-vocab-hash.txt',do_lower_case=False) output = tokenizer(str_series, max_num_rows=len(str_series), truncation=True, max_length=seq_len, padding='max_length', add_special_tokens=False, return_tensors='pt') ``` HuggingFace API: ```python from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False) output = tokenizer(input_sentence_ls, truncation=True, max_length=seq_len, padding='max_length', add_special_tokens=False, return_tensors = 'pt') output_d = {k:v.cuda() for k,v in output.items()} ``` ## TODO: - [x] Add tests - [x] Throw appropriate warnings for HuggingFace discrepancies - [x] API checks - [X] [ Benchmark/Example Notebook ](https://nbviewer.jupyter.org/gist/VibhuJawa/350a8479b10be3591dd9c4d5da3cfc3b) CC: @raykallen, @BartleyR (from the cyber team) CC: @randerzander , @beckernick (from the workflows team) Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - AJ Schmidt (https://github.com/ajschmidt8) - Keith Kraus (https://github.com/kkraus14) URL: #7942

davidwendt added feature request New feature or request Needs Triage Need team to review and classify labels Aug 6, 2020

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 7, 2020

davidwendt added the strings strings issues (C++ and Python) label Aug 7, 2020

VibhuJawa mentioned this issue Aug 25, 2020

[FEA] Support packing to a max input sequence length with cudf-subword tokenizer #6089

Open

VibhuJawa mentioned this issue Oct 9, 2020

[DOC]Documentation for subword_tokenize #5799

Closed

github-actions bot added the rotten label Feb 16, 2021

github-actions bot removed the inactive-90d label Feb 22, 2021

kkraus14 assigned VibhuJawa Feb 22, 2021

davidwendt mentioned this issue Feb 22, 2021

Change nvtext::load_vocabulary_file to return a unique ptr #7424

Merged

VibhuJawa mentioned this issue Apr 12, 2021

Subword Tokenizer HuggingFace like API #7942

Merged

4 tasks

rapids-bot bot closed this as completed in #7942 May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

davidwendt commented Aug 6, 2020 •

edited

Loading

VibhuJawa commented Aug 6, 2020 •

edited

Loading

davidwendt commented Aug 6, 2020

github-actions bot commented Feb 16, 2021

kkraus14 commented Feb 22, 2021

VibhuJawa commented Feb 22, 2021 •

edited

Loading

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

Comments

davidwendt commented Aug 6, 2020 • edited Loading

VibhuJawa commented Aug 6, 2020 • edited Loading

davidwendt commented Aug 6, 2020

github-actions bot commented Feb 16, 2021

kkraus14 commented Feb 22, 2021

VibhuJawa commented Feb 22, 2021 • edited Loading

davidwendt commented Aug 6, 2020 •

edited

Loading

VibhuJawa commented Aug 6, 2020 •

edited

Loading

VibhuJawa commented Feb 22, 2021 •

edited

Loading