Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding Issue for Latin Characters in added_tokens #1424

Closed
44670 opened this issue Jan 4, 2024 · 2 comments
Closed

Decoding Issue for Latin Characters in added_tokens #1424

44670 opened this issue Jan 4, 2024 · 2 comments
Labels

Comments

@44670
Copy link

44670 commented Jan 4, 2024

Hello,

I'm encountering a decoding issue in the tokenizers library, particularly with some latin characters included in the added_tokens. This issue is observed when using the DeepSeek-coder model, which has the following token definition:

"added_tokens": [
  {
    "id": 32000,
    "content": "õ",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  }......
]

While encoding this character works as expected, the decoding process does not produce the correct result. Here's an example illustrating the issue:

tok.encode('õ', add_special_tokens=False)
# Output: [32000] // This is correct

tok.decode([32000])
# Output: '�' // This is incorrect

The decoding of the token ID 32000 should return 'õ', but instead, it returns an incorrect character. This issue seems to be specific to the decoding process.

Could you please investigate this problem? Any assistance in resolving this would be greatly appreciated.

Thank you for your help.

@DOGEwbx
Copy link

DOGEwbx commented Jan 22, 2024

Hi @44670, thanks for your interests in DeepSeek models. The problem can be explained in #1392. This issue cannot be resolved for the time being. We will update our tokenizer in the subsequent model releases.

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Feb 22, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants