What do you meant by vocabulary size? #5

signalarun · 2016-12-24T11:59:23Z

Is it the number of words we provide or the number of characters in vocabulary file which is generated by the code.

dashayushman · 2016-12-25T20:02:53Z

@signalarun : It is the number of unique characters in your dataset. The vocabulary is generated by the code. The vocabulary size is just an upper bound to the number of unique characters.
I hope it helps.

signalarun · 2016-12-26T03:58:00Z

That means we must use custom value for en_vocab_size, hn_vocab_size.

dashayushman · 2016-12-26T10:03:59Z

@signalarun : Ideally, for transliteration you should choose 100 for both. But, you can experiment with different values.

signalarun · 2016-12-27T06:56:54Z

Tried the above for malayalam eg (a m m a -> അ മ ് മ). But after training it isn’t give correct output.

dashayushman · 2016-12-27T10:05:25Z

@signalarun: Could you give me some more details about how you trained it.

It would be easier to track the issue if you could share your hyper parameters, dataset size, output that you are getting, how many iterations did you train it for or maybe the dataset itsef.

If you can send me the dataset, I can do some debugging myself and explain what could have gone wrong.

signalarun · 2016-12-27T10:20:20Z

Parameters as same as given in documentation and dataset size around 70000

dashayushman · 2016-12-27T11:10:19Z

@signalarun : The default hyperparameters are for a huge network wich requires a lot of data. I would suggest training with the following params,

Size: 265
Layers: 2
Number of iterations: 150000 (minimum)

It would be easier for me to debug if you could share a part of your dataset. It works for me on my English to Hindi dataset and theoretically it should work on any other. If you could share the following with me it would help too,

Number of iterations you trained for
What is the output that you are getting (a few examples maybe)

signalarun · 2016-12-27T11:20:37Z

I had tried with --size=1024 --num_layers=5 --steps_per_checkpoint=1000 for about 300000 iterations.
Shall reply after trying with above parameters.

dashayushman · 2016-12-27T11:22:44Z

@signalarun : Please share the bucket perplexity too. I wanted to know if you have a separate test set as mentioned in the doc too?

signalarun · 2016-12-27T11:30:12Z

Ok

signalarun · 2016-12-28T06:22:01Z

--size=256
--num_layers=2
"learning_rate", 0.5
"learning_rate_decay_factor", 0.99
"en_vocab_size", 31
"hn_vocab_size", 77
--steps_per_checkpoint=1000
datasize : 700000

global step 554000 learning rate 0.1237 step-time 0.08 perplexity 1.01
eval: empty bucket 0
eval: bucket 1 perplexity 2.25
eval: bucket 2 perplexity 1.52
eval: bucket 3 perplexity 1.16
eval: bucket 4 perplexity 1.12
eval: bucket 5 perplexity 1.26

test result (wrong)
t h a n d i l _PAD_PADഒഒഒഒഒഒഒഒലലലഒഒ
Some sample training
Target :
ം
ഃ
അ
ആ
ഇ
ഈ
ഉ
ഊ
ഋ
എ
ഏ
ഐ
ഒ
ഓ
ഔ
ക
ഖ
ഗ
ഘ
ങ
ച
ഛ
ജ
ഝ
ഞ
ട
ഠ
ഡ
ഢ
ണ
ത
ഥ
ദ
ധ
ന
പ
ഫ
ബ
ഭ
മ
യ
ര
റ
ല
ള
ഴ
വ
ശ
ഷ
സ
ഹ
ാ
ി
ീ
ു
ൂ
ൃ
െ
േ
ൈ
ൊ
ോ
ൌ
്
ൗ
ൺ
ൻ
ർ
ൽ
ൾ
‌ എ ന ് ന ു
എ ന ് ന
ഞ ാ ൻ
എ ന ് റ െ
സ ം
എ ന ് ന ്
പ ര ി
അ വ
ബ െ ർ ള ി
പ ് ര ത ി
ക ൊ ണ ് ട ്
ന ി ന ് ന ്
പ ോ ല െ
ഉ പ
ന ട
അ ന ു
ഇ വ ി ട െ
പ ഴ ഞ ് ച ൊ ല ് ല ു ക ൾ
അ വ ർ
ന ി ന ് ന ു
മ ാ ത ് ര ം
ര ണ ് ട ു
ത ോ മ സ ്
മ ന
പ ു ത ി യ
പ ോ ല ു ം
എ ന ് ന െ
ക ഴ ി
ന മ ് മ ു ട െ
ഗ ൈ ഡ ്

Source :
`
m
a
a
a a
i
e e
u
o o
r u
e
e
a i
o
o
a u
k a
k h a
g a
g h a
n g a
c h a
c h h a
j a
j h a
n j a
t a
d t a
d a
d d a
n a
t h a
t h a
d a
d h a
n a
p a
p h a
b a
b h a
m a
y a
r a
r a
l a
l a
z h a
v a
s h a
s h a
s a
h a
a a
i
e e
u
o o
r u
e
e
y
o
o
o u
്
a u
n
n
r
l
l
‌ e n n u
e n n a
n j a a n
e n t e
s a m
e n n u
p a r i
a v a
b e r l i
p r a t h i
k o n d u
n i n n u
p o l e
u p a
n a t a
a n u
i v i t e
p a z h a n c h o l l u k a l
a v a r
n i n n u
m a a t h r a m
r a n d u
t h o m a s
m a n a
p u t h i y a
p o l u m
e n n e
k a z h i
n a m m u t e
g y d

`

dashayushman · 2016-12-28T12:17:23Z

@signalarun : The model seems to have converged but I find it very weirdly overfitted. I tried training my English to Hindi model and it works perfectly fine. I had 150000+ words in my dataset. Could you share the latest trained model with me so that I can debug myself? That would be a lot easier.

signalarun · 2016-12-28T12:47:35Z

Is the format that I had given correct?

dashayushman · 2016-12-28T18:24:34Z

@signalarun : The source and target look perfect. What I doubt is

t h a n d i l _PAD_PAD

I am not sure if in the decode mode you are giving the input in the right way.

In the decode mode after you see '>' you should just type a work without spaces in between the characters and hit enter.

Could you tell me which version of tensorflow you are using? Are you using a separate evaluation set too ? Because this is highly unexpected and weird. I do not have a sample dataset to train and test and neither do I have a pretrained model that I can use to debug. I am just shooting in the dark here.

signalarun · 2016-12-29T04:39:35Z

No its in this way.
t h a n d i l -> _PAD_PADഒഒഒഒഒഒഒഒലലലഒഒ

sigvoiced · 2016-12-30T13:34:44Z

You should not be putting spaces between characters while decoding. But, I am not sure if that is the issue. You can definitely try though.

dashayushman added the help wanted label Jun 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do you meant by vocabulary size? #5

What do you meant by vocabulary size? #5

signalarun commented Dec 24, 2016

dashayushman commented Dec 25, 2016 •

edited

Loading

signalarun commented Dec 26, 2016

dashayushman commented Dec 26, 2016

signalarun commented Dec 27, 2016 •

edited

Loading

dashayushman commented Dec 27, 2016

signalarun commented Dec 27, 2016

dashayushman commented Dec 27, 2016

signalarun commented Dec 27, 2016

dashayushman commented Dec 27, 2016 •

edited

Loading

signalarun commented Dec 27, 2016

signalarun commented Dec 28, 2016

dashayushman commented Dec 28, 2016 •

edited

Loading

signalarun commented Dec 28, 2016

dashayushman commented Dec 28, 2016

signalarun commented Dec 29, 2016

sigvoiced commented Dec 30, 2016

What do you meant by vocabulary size? #5

What do you meant by vocabulary size? #5

Comments

signalarun commented Dec 24, 2016

dashayushman commented Dec 25, 2016 • edited Loading

signalarun commented Dec 26, 2016

dashayushman commented Dec 26, 2016

signalarun commented Dec 27, 2016 • edited Loading

dashayushman commented Dec 27, 2016

signalarun commented Dec 27, 2016

dashayushman commented Dec 27, 2016

signalarun commented Dec 27, 2016

dashayushman commented Dec 27, 2016 • edited Loading

signalarun commented Dec 27, 2016

signalarun commented Dec 28, 2016

dashayushman commented Dec 28, 2016 • edited Loading

signalarun commented Dec 28, 2016

dashayushman commented Dec 28, 2016

signalarun commented Dec 29, 2016

sigvoiced commented Dec 30, 2016

dashayushman commented Dec 25, 2016 •

edited

Loading

signalarun commented Dec 27, 2016 •

edited

Loading

dashayushman commented Dec 27, 2016 •

edited

Loading

dashayushman commented Dec 28, 2016 •

edited

Loading