Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

Closed
VibhuJawa opened this issue Jul 24, 2020 · 3 comments
Closed
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. question Further information is requested strings strings issues (C++ and Python)

Comments

@VibhuJawa
Copy link
Member

What is your question?

Currently, our subword tokenize seems to be tokenizing special tokens like [PAD], [UNK], [SEP] etc. This behavior is in discrepancy with Hugging Face. What is the best way as a user to encode them?

Create Vocab File

import cudf
!wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py

with open('test_vocab.txt','w') as f:
    string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\njenna\nis\na\nfrom\nlondon\n'  
    f.write(string)

!python3 perfect_hash.py  --vocab 'test_vocab.txt' --output 'test_hash.txt'

Logic

text = "[CLS] Jenna [SEP]"
from transformers import BertTokenizer, BertModel
hugging_face_tokenizer = BertTokenizer(vocab_file=f"test_vocab.txt")
d = hugging_face_tokenizer(text)
tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']
for token_id in tokens:
    if token_id!=0:
        print(f"hugging-token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")
        
        
cudf_ser = cudf.Series([text])
tokens, masks, metadata = cudf_ser.str.subword_tokenize("test_hash.txt")
for token_id in tokens:
    if token_id!=0:
        print(f"cudf-token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")
hugging-token_id = 2 token = [CLS]
hugging-token_id = 2 token = [CLS]
hugging-token_id = 5 token = jenna
hugging-token_id = 3 token = [SEP]
hugging-token_id = 3 token = [SEP]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 5 token = jenna
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]

Currently, the encoding for [SEP] as an example is encoded as [1,1,1] (UNK) with cudf while hugging face correctly encodes it as [3].

All of this may be stemming from a discrepancy in the hashing file and may be related to #5760 .

CC: @davidwendt / @efajardo-nv .

@VibhuJawa VibhuJawa added question Further information is requested Needs Triage Need team to review and classify labels Jul 24, 2020
@kkraus14 kkraus14 added doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Aug 5, 2020
@davidwendt
Copy link
Contributor

@VibhuJawa Was this resolved with the @raykallen solution in #5760 ?

@VibhuJawa
Copy link
Member Author

VibhuJawa commented Aug 6, 2020

@VibhuJawa Was this resolved with the @raykallen solution in #5760 ?

Nope, See below for a cleaner reproducer:

#!rm -rf *.txt
#!wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py
#!wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt 
#!python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt' --compact

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "[UNK]" # True for [MASK],[SEP] , [CLS] etc
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False,add_special_tokens=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False, add_special_tokens=False)

d = hugging_face_tokenizer(text,add_special_tokens=False)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']

print("cudf_tokens", cudf_tokens[cudf_tokens!=0])
print("h_tokens", h_tokens)
cudf_tokens [ 164 7414 2428  166]
h_tokens [100]

I am not sure where this stemming from through, because in the perfect_hash.py code we seem to special case some of these tokens, not sure how it is related though.

@VibhuJawa
Copy link
Member Author

Closing in favor of #6937

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. question Further information is requested strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

3 participants