-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
189 lines (131 loc) · 7.29 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
==========================================================================================
Introduction to Natural Laguage Processing Assignment 1
==========================================================================================
1. In this assignment, you will be asked to:
- generate batch for skip-gram model (word2vec_basic.py)
- implement two loss functions to train word embeddings (loss_func.py)
- tune the parameters for word embeddings
- apply learned word embeddings to word analogy task (word_analogy.py)
2. Generating batch
You will generate small subset of training data, which is called batch.
For skip-gram model, you will slide a window
and sample training instances from the data inside the window.
[Example]
Suppose that we have a text: "The quick brown fox jumps over the lazy dog."
And batch_size = 8, window_size = 3
"[The quick brown] fox jumps over the lazy dog"
Context word would be 'quick' and predicting words are 'The' and 'brown'.
This will generate training examples:
context(x), predicted_word(y)
(quick , The)
(quick , brown)
And then move the sliding window.
"The [quick brown fox] jumps over the lazy dog"
In the same way, we have two more examples:
(brown, quick)
(brown, fox)
Moving the window again:
"The quick [brown fox jumps] over the lazy dog"
We get,
(fox, brown)
(fox, jumps)
Finally we get two more instances from the moved window,
"The quick brown [fox jumps over] the lazy dog"
(jumps, fox)
(jumps, over)
Since now we have 8 training instances, which is the batch size,
stop generating this batch and return batch data.
data_index is the index of a word. You can access a word using data[data_index].
batch_size is the number of instances in one batch.
num_skips is the number of samples you want to draw in a window(in example, it was 2).
skip_windows decides how many words to consider left and right from a context word(so, skip_windows*2+1 = window_size).
batch will contains word ids for context words. Dimension is [batch_size].
labels will contains word ids for predicting words. Dimension is [batch_size, 1].
3. Analogies using word vectors
You will use the word vectors you learned from both approaches(cross entropy and NCE) in the following word analogy task.
Each question/task is in the following form.
-------------------------------------------------------------------------------------
Consider the following word pairs that share the same relation, R:
pilgrim:shrine, hunter:quarry, assassin:victim, climber:peak
Among these word pairs,
(1) pig:mud
(2) politician:votes
(3) dog:bone
(4) bird:worm
Q1. Which word pairs has the MOST illustrative(similar) example of the relation R?
Q2. Which word pairs has the LEAST illustrative(similar) example of the relation R?
-------------------------------------------------------------------------------------
One simple method to answer those questions will be measuring the similarities of difference vectors.
[Difference Vector]
Recall that vectors are representing some direction in space.
If (a, b) and (c, d) pairs are analogous pairs then the transformation from a to b (i.e., some x vector when added to a gives b: a + x = b)
should be highly similar to the transformation from c to d (i.e., some y vector when added to c gives d: b + y = d).
In other words, the difference vector (b-a) should be similar to difference vector (d-c).
This difference vector can be thought to represent the relation between the two words.
* Due to the noisy annotation data, the expected accuracy is not high.
The pre-trained model scores 31.4% overall accuracy. Improving this score 1~3% would be your goal.
4. This package contains several files:
For word2vec
- word2vec_basic.py:
This file is the main script for training word2vec model.
You should fill out the part that generates batch from training data.
Your trained model will be saved as word2vec.model
Usage:
python word2vec_basic.py [cross_entropy | nce]
- loss_func.py
This file have two loss functions but they should be completed.
1. cross_entropy_loss - Loss we used in class. See lecture notes on word representations for details.
2. nce_loss - Noise contrastive estimation loss described in the assignment pdf.
/Users/praveenkumar/Desktop/Sem 1/NLP/Assignment/Assignment1_mine
- pretrained/word2vec
For analogy task
- word_analogy.py
You will write a code in this file for evaluating relation between pairs of words -- called MaxDiff question.
(https://en.wikipedia.org/wiki/MaxDiff)
You will generate a file with your predictions following the format of "word_analogy_sample_predictions.txt"
- score_maxdiff.pl
a perl script to evaluate YOUR PREDICTIONS on development data
Usage:
./score_maxdiff.pl word_analogy_dev_mturk_answers.txt word_analogy_test_predictions_cross_entropy.txt o1.txt
./score_maxdiff.pl word_analogy_dev_mturk_answers.txt word_analogy_test_predictions_nce.txt o2.txt
- word_analogy_dev.txt
Data for development.
Each line of this file is divided into "examples" and "choices" by "||".
[examples]||[choices]
"Examples" and "choices" are delimited by a comma.
ex) "tailor:suit","oracle:prophesy","baker:flour"
- word_analogy_dev_sample_predictions.txt
A sample prediction file. Pay attention to the format of this file.
Your prediction file should follow this to use "score_maxdiff.pl" script.
Each row is in this format:
<pair1> <pair2> <pair3> <pair4> <least_illustrative_pair> <most_illustrative_pair>
The order of word pairs should match their original order found in "word_analogy_dev.txt".
- word_analogy_dev_mturk_answers.txt
This is the answers collected using Amazon mechanical turk for "word_analogy_dev.txt".
The answers in this file is used as the correct answer and used to evaluate your analogy predictions. (using "score_maxdiff.pl")
For your information, the answers here are a little bit noisy.
- word_analogy_test.txt
Test data file. When you are done experiment with your model, you will generate predictions for this test data using your best models (NCE/cross entropy).
5. What you should do:
1. Implement batch generation
2. Implement Cross Entropy Loss
3. Implement NCE Loss
4. Try different parameters to train your word2vec model
5. Write a code to evaluate relations between word pairs
6. Find the least illustrative pair and most illustrative pair among four pairs of words.
6. What you should submit:
** PAY ATTENTION TO THE FILENAMES!!!
I will use scripts to organize your codes/files.
Create a single zip file ( <SBUID>.zip ) contains the following files:
Done 1. word2vec_basic.py
Done 2. loss_func.py
3. your best word2vec models for Cross Entroy loss and NCE loss
- word2vec_cross_entropy.model
- word2vec_nce.model
Done 4. word_analogy.py
5. Prediction on data generated for word_analogy_test.txt
- word_analogy_test_predictions_cross_entropy.txt
- word_analogy_test_predictions_nce.txt
Done 6. README file with explanation of your implementation
Done 7. Report as detailed in the assignment pdf
- report_<SBUID>.pdf