-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What do you meant by vocabulary size? #5
Comments
@signalarun : It is the number of unique characters in your dataset. The vocabulary is generated by the code. The vocabulary size is just an upper bound to the number of unique characters. |
That means we must use custom value for en_vocab_size, hn_vocab_size. |
@signalarun : Ideally, for transliteration you should choose 100 for both. But, you can experiment with different values. |
Tried the above for malayalam eg (a m m a -> അ മ ് മ). But after training it isn’t give correct output. |
@signalarun: Could you give me some more details about how you trained it. It would be easier to track the issue if you could share your hyper parameters, dataset size, output that you are getting, how many iterations did you train it for or maybe the dataset itsef. If you can send me the dataset, I can do some debugging myself and explain what could have gone wrong. |
Parameters as same as given in documentation and dataset size around 70000 |
@signalarun : The default hyperparameters are for a huge network wich requires a lot of data. I would suggest training with the following params,
It would be easier for me to debug if you could share a part of your dataset. It works for me on my English to Hindi dataset and theoretically it should work on any other. If you could share the following with me it would help too,
|
I had tried with --size=1024 --num_layers=5 --steps_per_checkpoint=1000 for about 300000 iterations. |
@signalarun : Please share the bucket perplexity too. I wanted to know if you have a separate test set as mentioned in the doc too? |
Ok |
--size=256 global step 554000 learning rate 0.1237 step-time 0.08 perplexity 1.01 test result (wrong) Source : ` |
@signalarun : The model seems to have converged but I find it very weirdly overfitted. I tried training my English to Hindi model and it works perfectly fine. I had 150000+ words in my dataset. Could you share the latest trained model with me so that I can debug myself? That would be a lot easier. |
Is the format that I had given correct? |
@signalarun : The source and target look perfect. What I doubt is
I am not sure if in the decode mode you are giving the input in the right way. In the decode mode after you see '>' you should just type a work without spaces in between the characters and hit enter. Could you tell me which version of tensorflow you are using? Are you using a separate evaluation set too ? Because this is highly unexpected and weird. I do not have a sample dataset to train and test and neither do I have a pretrained model that I can use to debug. I am just shooting in the dark here. |
No its in this way. |
You should not be putting spaces between characters while decoding. But, I am not sure if that is the issue. You can definitely try though. |
Is it the number of words we provide or the number of characters in vocabulary file which is generated by the code.
The text was updated successfully, but these errors were encountered: