-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files for BioBERT tokenizer #11
Comments
I would be interested in this question, too; Did you ever find out more about it? |
I had a deadline so I used BERT, but I will delve into it again. |
Hi, sorry for the inconvenience. The BERT tokenizer is exactly the same as BioBERT tokenizer. The files you are mentioning seem to be a newer version of BERT's vocabulary, which will be incompatible unless you modify the code. You can just use BioBERT's vocabulary provided along with the pre-trained BioBERT files. |
Thank you, @jhyuklee ! Yeah, that's what i did in the end, and it seems to be working ok: |
In order to use Tokenizer from BioBERT, the program requires tokenizer files for BioBERT.
tokenizer = BertTokenizer.from_pretrained('BioBERT_DIR/BioBERT_tokenizer_files')
These are the files generated when one saves the developed tokenizer using the following command.
tokenizer.save_pretrained('./my_saved_biobert_model_directory/')
This should save files like,
The file names are,
However, I am not able to find these files from these pretrained BioBERT weights directory.
From this post, I understand that this is linked to issue #1. Does this mean, one needs to use tokenizer from BERT and not BioBERT? What BERT tokenizer will be compatible with BioBERT?
I will be grateful for your response.
The text was updated successfully, but these errors were encountered: