Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom model with pretokenized input including multiword #56

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ziqianPeng
Copy link

Hello!
I'm trying to train custom parser using trankit with pretokenized input extracted from conllu files.

Maybe I didn't get the right way but in my way some bug occurred for French (multiword token) and Chinese ("KeyError UD-Japanese-Like" if I parse my test file just after finish training), so I modified the source code to fix them. I also modified the path of xlm_roberta model in file_utils.py such that it will be downloaded only one time when training multiple models of the same type, such as 'customized'.
The file train_pred_trainkit.py is an example to apply these modification, especially the function pred_trankit.

I hope this would be helpful for you and thanks a lot for developing trankit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant