-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending IndicBART or IndicBERT #57
Comments
Or, if I can continue pre-training of the IndicBART and IndicBERT models on some more Indian languages, preferably with additional language codes, that will be even better. I have a vague idea of how to do it, but not exactly in terms of the toolkit code. Any help or pointers will be helpful. |
There is already an indicbert v2 which supports 24 Indic languages: https://huggingface.co/ai4bharat/IndicBERTv2-SS As for IndicBARTv2, there is one in the works and should be out soon. Regarding what you want to do, you will need to figure out the following:
This will involve no change to the codebase. |
The URL you included is giving 404 error. Is there some other place where it may be described?
I should be able to do this.
For this can I simply load the pre-trained model and call the train script or function again on more data or should I keep in mind some other points? And do the language codes matter or one can simply reuse existing codes? In the paper I remember reading that languages are not distinguished while training to allow zero shot learning. But I guess the code will matter when trying to actually translate. BTW, I also want to fine-tune on some basic NLP tasks like POS tagging and NER etc. Will the pre-training differ for bilingual and monolingual tasks. There are, I think two different scripts for monolingual and translation tasks. |
The URL is working now. I have data for three languages, two of them are not in the list of 24 and for one I may have some more data. For IndicBERT, I have posted an issue on their repository. For IndicBART, I am not clear about how to use the language codes to continue pre-training or while fine-tuning. Should the pre-training be different for multilingual parallel corpora and multilingual monolingual corpora? Or will only fine-tuning be different? In either case, how to proceed properly? I don't have experience working with BERT before. |
You will have to look into how to resize the embedding layer. You will have to hack YANMTT for this. Take a look here for hints: https://discuss.huggingface.co/t/adding-new-tokens-while-preserving-tokenization-of-adjacent-tokens/12604/3
You can reuse but its a hacky solution.
YANMTT is not designed for basic NLP tasks. Its designed for NLG tasks. However you can treat the NLP task as a NLG task and see what happens. One script is for pre-training however, the fine-tuning script can also be indirectly used for pre-training. Im planning to retire the pre-training script since the latter one can already do everything. |
For fine-tuning you can take a look here: https://github.com/AI4Bharat/indic-bart
Rather than jumping into continued fine-tuning I recommend that you first get used to pretraining models from scratch with YANMTT. Once you familiarize yourself with this, things will get easier. Look into the examples folder for help. |
I want to basically pre-train models from scratch, including tokenizer, for languages included in IndicBART and IndicBERT and some more languages, so as to build a something like IndicBARTExt and IndicBERTExt.
While going through some related issues, I noticed that there some conventions about language codes. It is possible to use pre-existing language codes for new languages.
Is there some way to add new language codes, say, for IndicBART/IndicBERT without much change in Python code which is called from the pre-train shell script? Or will it require considerable changes?
The text was updated successfully, but these errors were encountered: