Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for CLIP tokenizers from Hugging Face #173

Open
dkalinowski opened this issue Jun 5, 2023 · 0 comments
Open

Support for CLIP tokenizers from Hugging Face #173

dkalinowski opened this issue Jun 5, 2023 · 0 comments

Comments

@dkalinowski
Copy link

dkalinowski commented Jun 5, 2023

Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main

I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects Ġ prefix in the vocabulary. CLIP's vocabulary uses </w> as a suffix and not a prefix.
I tried to modify the script to detect ending </w> instead of Ġ to append 0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91
but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:

Input string: "a photo of a really, functistaner big cat."

Hugging faces:
49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407]
BlingFire:
320   1125   539    320  1414   11 1499 66   555     2203    517     1205    2368    13 

Is there some way to make BlingFire support CLIP version of tokenizer?

My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant