Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What do you meant by vocabulary size? #5

Open
signalarun opened this issue Dec 24, 2016 · 16 comments
Open

What do you meant by vocabulary size? #5

signalarun opened this issue Dec 24, 2016 · 16 comments

Comments

@signalarun
Copy link

Is it the number of words we provide or the number of characters in vocabulary file which is generated by the code.

@dashayushman
Copy link
Owner

dashayushman commented Dec 25, 2016

@signalarun : It is the number of unique characters in your dataset. The vocabulary is generated by the code. The vocabulary size is just an upper bound to the number of unique characters.
I hope it helps.

@signalarun
Copy link
Author

That means we must use custom value for en_vocab_size, hn_vocab_size.

@dashayushman
Copy link
Owner

@signalarun : Ideally, for transliteration you should choose 100 for both. But, you can experiment with different values.

@signalarun
Copy link
Author

signalarun commented Dec 27, 2016

Tried the above for malayalam eg (a m m a -> അ മ ് മ). But after training it isn’t give correct output.

@dashayushman
Copy link
Owner

@signalarun: Could you give me some more details about how you trained it.

It would be easier to track the issue if you could share your hyper parameters, dataset size, output that you are getting, how many iterations did you train it for or maybe the dataset itsef.

If you can send me the dataset, I can do some debugging myself and explain what could have gone wrong.

@signalarun
Copy link
Author

Parameters as same as given in documentation and dataset size around 70000

@dashayushman
Copy link
Owner

@signalarun : The default hyperparameters are for a huge network wich requires a lot of data. I would suggest training with the following params,

  1. Size: 265
  2. Layers: 2
  3. Number of iterations: 150000 (minimum)

It would be easier for me to debug if you could share a part of your dataset. It works for me on my English to Hindi dataset and theoretically it should work on any other. If you could share the following with me it would help too,

  1. Number of iterations you trained for
  2. What is the output that you are getting (a few examples maybe)

@signalarun
Copy link
Author

I had tried with --size=1024 --num_layers=5 --steps_per_checkpoint=1000 for about 300000 iterations.
Shall reply after trying with above parameters.

@dashayushman
Copy link
Owner

dashayushman commented Dec 27, 2016

@signalarun : Please share the bucket perplexity too. I wanted to know if you have a separate test set as mentioned in the doc too?

@signalarun
Copy link
Author

Ok

@signalarun
Copy link
Author

--size=256
--num_layers=2
"learning_rate", 0.5
"learning_rate_decay_factor", 0.99
"en_vocab_size", 31
"hn_vocab_size", 77
--steps_per_checkpoint=1000
datasize : 700000

global step 554000 learning rate 0.1237 step-time 0.08 perplexity 1.01
eval: empty bucket 0
eval: bucket 1 perplexity 2.25
eval: bucket 2 perplexity 1.52
eval: bucket 3 perplexity 1.16
eval: bucket 4 perplexity 1.12
eval: bucket 5 perplexity 1.26

test result (wrong)
t h a n d i l _PAD_PADഒഒഒഒഒഒഒഒലലലഒഒ
Some sample training
Target :




















































ി

















‌ എ ന ് ന ു
എ ന ് ന
ഞ ാ ൻ
എ ന ് റ െ
സ ം
എ ന ് ന ്
പ ര ി
അ വ
ബ െ ർ ള ി
പ ് ര ത ി
ക ൊ ണ ് ട ്
ന ി ന ് ന ്
പ ോ ല െ
ഉ പ
ന ട
അ ന ു
ഇ വ ി ട െ
പ ഴ ഞ ് ച ൊ ല ് ല ു ക ൾ
അ വ ർ
ന ി ന ് ന ു
മ ാ ത ് ര ം
ര ണ ് ട ു
ത ോ മ സ ്
മ ന
പ ു ത ി യ
പ ോ ല ു ം
എ ന ് ന െ
ക ഴ ി
ന മ ് മ ു ട െ
ഗ ൈ ഡ ്

Source :
`
m
a
a
a a
i
e e
u
o o
r u
e
e
a i
o
o
a u
k a
k h a
g a
g h a
n g a
c h a
c h h a
j a
j h a
n j a
t a
d t a
d a
d d a
n a
t h a
t h a
d a
d h a
n a
p a
p h a
b a
b h a
m a
y a
r a
r a
l a
l a
z h a
v a
s h a
s h a
s a
h a
a a
i
e e
u
o o
r u
e
e
y
o
o
o u

a u
n
n
r
l
l
‌ e n n u
e n n a
n j a a n
e n t e
s a m
e n n u
p a r i
a v a
b e r l i
p r a t h i
k o n d u
n i n n u
p o l e
u p a
n a t a
a n u
i v i t e
p a z h a n c h o l l u k a l
a v a r
n i n n u
m a a t h r a m
r a n d u
t h o m a s
m a n a
p u t h i y a
p o l u m
e n n e
k a z h i
n a m m u t e
g y d

`

@dashayushman
Copy link
Owner

dashayushman commented Dec 28, 2016

@signalarun : The model seems to have converged but I find it very weirdly overfitted. I tried training my English to Hindi model and it works perfectly fine. I had 150000+ words in my dataset. Could you share the latest trained model with me so that I can debug myself? That would be a lot easier.

@signalarun
Copy link
Author

Is the format that I had given correct?

@dashayushman
Copy link
Owner

@signalarun : The source and target look perfect. What I doubt is

t h a n d i l _PAD_PAD

I am not sure if in the decode mode you are giving the input in the right way.

In the decode mode after you see '>' you should just type a work without spaces in between the characters and hit enter.

Could you tell me which version of tensorflow you are using? Are you using a separate evaluation set too ? Because this is highly unexpected and weird. I do not have a sample dataset to train and test and neither do I have a pretrained model that I can use to debug. I am just shooting in the dark here.

@signalarun
Copy link
Author

No its in this way.
t h a n d i l -> _PAD_PADഒഒഒഒഒഒഒഒലലലഒഒ

@sigvoiced
Copy link

You should not be putting spaces between characters while decoding. But, I am not sure if that is the issue. You can definitely try though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants