You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
VibhuJawa opened this issue
Jul 24, 2020
· 3 comments
Labels
docDocumentationlibcudfAffects libcudf (C++/CUDA) code.PythonAffects Python cuDF API.questionFurther information is requestedstringsstrings issues (C++ and Python)
Currently, our subword tokenize seems to be tokenizing special tokens like [PAD], [UNK], [SEP] etc. This behavior is in discrepancy with Hugging Face. What is the best way as a user to encode them?
I am not sure where this stemming from through, because in the perfect_hash.pycode we seem to special case some of these tokens, not sure how it is related though.
docDocumentationlibcudfAffects libcudf (C++/CUDA) code.PythonAffects Python cuDF API.questionFurther information is requestedstringsstrings issues (C++ and Python)
What is your question?
Currently, our subword tokenize seems to be tokenizing special tokens like
[PAD]
,[UNK]
,[SEP]
etc. This behavior is in discrepancy with Hugging Face. What is the best way as a user to encode them?Create Vocab File
Logic
Currently, the encoding for [SEP] as an example is encoded as [
1,1,1
] (UNK) with cudf while hugging face correctly encodes it as[3]
.All of this may be stemming from a discrepancy in the hashing file and may be related to #5760 .
CC: @davidwendt / @efajardo-nv .
The text was updated successfully, but these errors were encountered: