-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary for the pre-trained model is not updated ? Any reason why #31
Comments
You are correct that the clinicalBERT models use the exact same vocabulary as the original BERT models. This is because we first initialized the models with the BERT base parameters and then further trained the masked LM & next sentence prediction heads on MIMIC data. While training BERT from scratch on clinical data with a clinical vocabulary would certainly be better, training from scratch is very expensive (i.e. requires extensive GPU resources & time). That being said, BERT uses word pieces for its vocabulary, rather than just whole words. Traditionally in NLP, any words not found in the vocabulary are represented as an UNKNOWN token. This makes it difficult to generalize to new domains. However, because BERT uses word pieces, this problem is not as severe. If a word does not appear in the BERT vocabulary during preprocessing, then the word is broken down to its word pieces. For example, penicillin may not be in the BERT vocabulary, but perhaps the word pieces "pen", "i", and "cillin" are present. In this example, the word piece "pen" would then likely have a very different contextual embedding in clinicalBERT compared to general domain BERT because it is frequently found in the context of a drug. In the paper, we show that the nearest neighbors of embeddings of disease & operations-related words make more sense when the words are embedded by clinicalBERT compared to bioBERT & general BERT. Unfortunately, we don't have an uncased version of the model at this time. Hope this helps! |
I think BioBERT updated their model recently (or at least after clinicalBERT was published). The model we compared to in our paper had the same vocabulary as BERT. Check out the issue on their github where someone had a similar question to yours. I do agree with you that a custom vocabulary would likely be better. I don't currently have the bandwidth to train it, but if you end up doing so, let us know! |
Thanks for making such a comprehensive bert model.
I am worried about the actual words that I find in the model though.
Author mentions that "The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC". I am supposing this would have mean that the vocab will also be updated.
But when i see the vocabulary words, I don't see medical concepts.
['Cafe', 'locomotive', 'sob', 'Emilio', 'Amazing', '##ired', 'Lai', 'NSA', 'counts', '##nius', 'assumes', 'talked', 'ク', 'rumor', 'Lund', 'Right', 'Pleasant', 'Aquino', 'Synod', 'scroll', '##cope', 'guitarist', 'AB', '##phere', 'resulted', 'relocation', 'ṣ', 'electors', '##tinuum', 'shuddered', 'Josephine', '"', 'nineteenth', 'hydroelectric', '##genic', '68', '1000', 'offensive', 'Activities', '##ito', 'excluded', 'dictatorship', 'protruding', '1832', 'perpetual', 'cu', '##36', 'outlet', 'elaborate', '##aft', 'yesterday', '##ope', 'rockets', 'Eduard', 'straining', '510', 'passion', 'Too', 'conferred', 'geography', '38', 'Got', 'snail', 'cellular', '##cation', 'blinked', 'transmitted', 'Pasadena', 'escort', 'bombings', 'Philips', '##cky', 'sacks', '##Ñ', 'jumps', 'Advertising', 'Officer', '##ulp', 'potatoes', 'concentration', 'existed', '##rrigan', '##ier', 'Far', 'models', 'strengthen', 'mechanics'...]
Am i missing something here ?
Also, is there any uncased version present for this model ?
The text was updated successfully, but these errors were encountered: