New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Support for CLIP tokenizers from Hugging Face #173

Open

dkalinowski opened this issue Jun 5, 2023 · 0 comments

dkalinowski commented Jun 5, 2023 •

edited

Loading

Hello, I'm trying to use BlingFire tools to build tokenization model for CLIP out of existing vocab.json/merges.txt file available here: https://huggingface.co/openai/clip-vit-base-patch32/tree/main

I tried to the same approach given for RoBERTa: https://github.com/microsoft/BlingFire/tree/master/ldbsrc/roberta
However, export_vocab script expects Ġ prefix in the vocabulary. CLIP's vocabulary uses </w> as a suffix and not a prefix.
I tried to modify the script to detect ending </w> instead of Ġ to append 0x2581: https://github.com/microsoft/BlingFire/blob/master/ldbsrc/gpt2/export_vocab.py#L91
but this gives slightly different results than tokenizer from hugging face when dealing with punctuation:

Input string: "a photo of a really, functistaner big cat."

Hugging faces:
49406, 320, 1125, 539, 320, 1414, 267, 8679, 555, 2203, 528, 1205, 2368, 269, 49407]
BlingFire:
320   1125   539    320  1414   11 1499 66   555     2203    517     1205    2368    13

Is there some way to make BlingFire support CLIP version of tokenizer?

My current scripts and reproduction steps:
https://github.com/dkalinowski/BlingFire/tree/clip/ldbsrc/clip

The text was updated successfully, but these errors were encountered:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment