Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We cannot rely on out-of-context tokenization to calculate tokenized offset lengths #31

Closed
aalok-sathe opened this issue Apr 28, 2022 · 1 comment
Labels
bug Something isn't working triage triage needed; please assign this issue some labels

Comments

@aalok-sathe
Copy link
Contributor

Tokenized lengths of words may be different depending on the context, as illustrated by the example below.

A source of possible inconsistency in our ANN encode method: the tokenization of a word is non-uniform across contexts, so we can't rely on the tokenized length of the individual token to calculate offsets.

In [1]: from transformers import AutoModel, AutoTokenizer
t 
In [2]: t = AutoTokenizer.from_pretrained('distilgpt2')

In [3]: t.decode([30119, 9015, 354, 5973])
Out[3]: 'past chickenchicken'

In [4]: t.decode(1)
Out[4]: '"'

In [5]: t.decode([354])
Out[5]: 'ch'

In [6]: t.decode([354, 5973])
Out[6]: 'chicken'

In [7]: [*map(t.decode, [354, 5973])]
Out[7]: ['ch', 'icken']

In [8]: t('chicken')
Out[8]: {'input_ids': [354, 5973], 'attention_mask': [1, 1]}

In [9]: t('tasty chicken')
Out[9]: {'input_ids': [83, 7833, 9015], 'attention_mask': [1, 1, 1]}

In [10]: t.decode([9015])
Out[10]: ' chicken'

We need to evaluate the situations in which this would be an issue, and whether this will affect the output of encode

@aalok-sathe aalok-sathe added bug Something isn't working triage triage needed; please assign this issue some labels labels Apr 28, 2022
@aalok-sathe
Copy link
Contributor Author

fixed, see #19

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage triage needed; please assign this issue some labels
Projects
None yet
Development

No branches or pull requests

1 participant