-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
195 lines (143 loc) · 7.13 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
DPTSG
January 2010, Matt Post
--
The code contained in this project directory was used to obtain the
results in our ACL 2009 short paper:
"Bayesian learning of a tree substitution grammar"
Matt Post & Daniel Gildea
ACL 2009
If you want to learn how the code works, along with all the caveats,
please see the file README.code. This file describes the project
files, along with the steps needed to prepare all of the data to do
Gibbs sampling.
-- INSTALLATION AND WALK-THROUGH -------------------------------------
All files and code herein assume a base directory of
$ENV{DPTSG}. You should set this in your profile, e.g., in bash,
$ export DPTSG=$HOME/code/dptsg
You will also need to put this directory in your Perl search path:
$ export PERL5LIB+=:$DPTSG
* List of programs in scripts/
** oneline_treebank.pl
Takes (on STDIN) a multi-line treebank tree and condenses it to a
single line (on STDOUT).
** clean.pl
Removes empty nodes and Treebank v2 information from node labels.
** build_lex.pl
Builds the lexicon. This is used by most programs to determine
which words should be converted to one of the ~ 60 unknown word
classes.
** print_leaves.pl
Takes a oneline tree and prints out its frontier.
** extract_lines.pl
Takes a list of line numbers and a file, and prints out only lines
from the file corresponding to those in the list of line numbers.
** annotate_spinal_grammar.pl
Takes a treebank (one tree per line) and annotates it with the
derivations from the spinal grammar, using the Magerman head
selection rules.
** rule_probs.pl
Produces a grammar from a combination of rule counts in the form
<count> <rule>
and oneline trees (these can be interspersed).
** extract_bod01_grammar.pl
Samples random trees at various heights from a treebank,
reproducing the "minimal subset" grammar of Bod.
* Preparing the data
** Process the trees into a nice format.
You need a copy of the WSJ portion of the Penn treebank.
Gets rid of treebank2 info, removes NONE nodes and traces.
$ cd $basedir
$ mkdir data
$ (cd /p/mt/corpora/wsj; cat {02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21}/*mrg) | \
./oneline_treebank.pl > data/wsj.trees.02-21
$ cat data/wsj.trees.02-21 | ./clean.pl > data/wsj.trees.02-21.clean
** Generate the lexicon
This determines which words get transformed into one of the ~ 60
unknown word bins. It writes (on STDOUT) a vocabulary file with
lines of the form "ID WORD COUNT", for all words found in the
training trees.
$ ./build_lex.pl < data/wsj.trees.02-21.clean > data/lex.02-21
** Annotate the training data with spinal annotations
A TSG derivation tree is represented by prepending a '*' to the
labels of nodes that are internal to a TSG rule, e.g., the tree
(S (*NP (*DT the) (NN boy)) (*VP (VBD was)) (*. .))
would yield the rules
(S (NP (DT the) NN) (VP VBD) (. .))
(NN boy)
(VBD was)
Produce the spinal annotation of the training data with the
following command:
$ ./annotate_spinal_grammar.pl data/lex.02-21 < data/wsj.trees.02-21.clean |
> data/wsj.trees.02-21.clean.annotated
You need this representation if you want to initialize the Gibbs
sampler from the spinal derivations instead of the PCFG
derivations, and also to extract the spinal baseline grammar. This
initialization is recommended since it converges more quickly to
better grammars.
** Generate the development and test sets.
$ cat /p/mt/corpora/wsj/22/*.mrg | ./oneline_treebank.pl > data/wsj.trees.22
$ cat data/wsj.trees.22 | ./clean.pl > data/wsj.trees.22.clean
$ cat data/wsj.trees.22.clean | ./print_leaves.pl > data/wsj.22.words
$ cat /p/mt/corpora/wsj/23/*.mrg | ./oneline_treebank.pl > data/wsj.trees.23
$ cat data/wsj.trees.23 | ./clean.pl > data/wsj.trees.23.clean
$ cat data/wsj.trees.23.clean | ./print_leaves.pl > data/wsj.23.words
Get those sentences with forty words max:
$ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.23.words > wsj.23.words.lines-max40
$ extract_lines.pl -l wsj.23.words.lines-max40 -f wsj.23.words > wsj.23.words.max40
$ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.22.words > wsj.22.words.lines-max40
$ extract_lines.pl -l wsj.22.words.lines-max40 -f wsj.22.words > wsj.22.words.max40
** Generate the PCFG rule probs used for the base measure.
$ cat data/wsj.trees.02-21.clean | ./rule_probs.pl -counts > data/rule_counts
$ cat data/wsj.trees.02-21.clean | ./rule_probs.pl > data/rule_probs
(alternately, the grammar could be produced directly from the counts:
$ cat data/rule_counts | ./rule_probs.pl > data/rule_probs
)
** Generate the spinal grammar
$ cat data/wsj.trees.02-21.clean.annotated | ./rule_probs.pl -counts > data/spinal_counts
$ cat data/rule_counts data/spinal_counts | ./rule_probs.pl > data/spinal_probs
** Generate Bod's grammar
$ cat data/wsj.trees.02-21.clean | ./extract_bod01_grammar.pl > data/bod01.rules
The file 'bod01.rules" prepends each line with the height of that
rule. To build the grammar, remove that field, add in the PCFG
counts, and normalize:
$ (perl -pe 's/^\d+ //' data/bod01.rules; cat data/rule_counts) | ./rule_probs.pl > data/bod01.prb
* Gibbs sampling
The Gibbs sampler comprises the files tsg.pl, the generic sampler
code in Sampler.pm, and the TSG-specific sampling code in
Sampler/TSG.pm. As input, it takes a treebank with arbitrary TSG
derivations and a PCFG used to score the base measure. Each
iteration's output is written to a subdirectory whose name is the
iteration number, and whose contents are the corpus state at tne end
of that iteration and the counts of TSG rules extracted from that
corpus derivation state.
The sampler can be interrupted at any time, and will restart from
the last completed iteration (it will be safer to remove the last
iteration's directory if it is incomplete).
Usage:
tsg.pl
-iters number of iterations
-corpus corpus to initialize counts from (unless picking up
from an interrupted run)
-alpha the value of alpha for the DP prior
-stop the stop probability for the DP base measure
-pcfg the pcfg grammar used for the DP base measure
-rundir the run directory (default = pwd)
-two sample two nodes at a time (default = sample one)
[many more options are available]
e.g.,
$ tsg.pl -corpus $basedir/data/wsj.trees.02-21.clean -iters 500
To extract a Gibbs-sampled grammar, use one of:
- extract a summed grammar (from counts of first $i iters)
(for num in $(seq 1 $i); do
bzcat $num/counts.bz2
done) | $basedir/scripts/rules_probs.pl > 1-$i.prb
- extract a point grammar (from a single iteration $i)
bzcat $i/counts.bz2 | $basedir/scripts/rule_probs.pl > $i.prb
* Parsing
You can now parse with these grammars with your favorite CKY
parser. I have modified Mark Johnson's parser to work with these
grammars; please feel free to email me for a copy if I haven't
already posted the source to my website.
* Evaluation
Use evalb with the COLLINS.prm file.
http://nlp.cs.nyu.edu/evalb/