[QST] How do we handle Special tokens in subword_tokenize function ? #5765

VibhuJawa · 2020-07-24T17:26:29Z

What is your question?

Currently, our subword tokenize seems to be tokenizing special tokens like [PAD], [UNK], [SEP] etc. This behavior is in discrepancy with Hugging Face. What is the best way as a user to encode them?

Create Vocab File

import cudf
!wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py

with open('test_vocab.txt','w') as f:
    string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\njenna\nis\na\nfrom\nlondon\n'  
    f.write(string)

!python3 perfect_hash.py  --vocab 'test_vocab.txt' --output 'test_hash.txt'

Logic

text = "[CLS] Jenna [SEP]"
from transformers import BertTokenizer, BertModel
hugging_face_tokenizer = BertTokenizer(vocab_file=f"test_vocab.txt")
d = hugging_face_tokenizer(text)
tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']
for token_id in tokens:
    if token_id!=0:
        print(f"hugging-token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")
        
        
cudf_ser = cudf.Series([text])
tokens, masks, metadata = cudf_ser.str.subword_tokenize("test_hash.txt")
for token_id in tokens:
    if token_id!=0:
        print(f"cudf-token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")

hugging-token_id = 2 token = [CLS]
hugging-token_id = 2 token = [CLS]
hugging-token_id = 5 token = jenna
hugging-token_id = 3 token = [SEP]
hugging-token_id = 3 token = [SEP]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 5 token = jenna
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]
cudf-token_id = 1 token = [UNK]

Currently, the encoding for [SEP] as an example is encoded as [1,1,1] (UNK) with cudf while hugging face correctly encodes it as [3].

All of this may be stemming from a discrepancy in the hashing file and may be related to #5760 .

CC: @davidwendt / @efajardo-nv .

The text was updated successfully, but these errors were encountered:

davidwendt · 2020-08-06T14:32:52Z

@VibhuJawa Was this resolved with the @raykallen solution in #5760 ?

VibhuJawa · 2020-08-06T16:24:44Z

@VibhuJawa Was this resolved with the @raykallen solution in #5760 ?

Nope, See below for a cleaner reproducer:

#!rm -rf *.txt
#!wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py
#!wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt 
#!python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt' --compact

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "[UNK]" # True for [MASK],[SEP] , [CLS] etc
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False,add_special_tokens=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False, add_special_tokens=False)

d = hugging_face_tokenizer(text,add_special_tokens=False)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']

print("cudf_tokens", cudf_tokens[cudf_tokens!=0])
print("h_tokens", h_tokens)

cudf_tokens [ 164 7414 2428  166]
h_tokens [100]

I am not sure where this stemming from through, because in the perfect_hash.py code we seem to special case some of these tokens, not sure how it is related though.

VibhuJawa · 2021-01-15T13:29:13Z

Closing in favor of #6937

VibhuJawa added question Further information is requested Needs Triage Need team to review and classify labels Jul 24, 2020

VibhuJawa mentioned this issue Jul 29, 2020

[DOC]Documentation for subword_tokenize #5799

Closed

kkraus14 added doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Aug 5, 2020

VibhuJawa mentioned this issue Aug 6, 2020

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

Closed

davidwendt mentioned this issue Dec 8, 2020

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

Closed

VibhuJawa closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

VibhuJawa commented Jul 24, 2020

davidwendt commented Aug 6, 2020

VibhuJawa commented Aug 6, 2020 •

edited

Loading

VibhuJawa commented Jan 15, 2021

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

Comments

VibhuJawa commented Jul 24, 2020

Create Vocab File

Logic

davidwendt commented Aug 6, 2020

VibhuJawa commented Aug 6, 2020 • edited Loading

VibhuJawa commented Jan 15, 2021

VibhuJawa commented Aug 6, 2020 •

edited

Loading