-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathGENERATE-ARTICLE-SPANS
62 lines (42 loc) · 2.66 KB
/
GENERATE-ARTICLE-SPANS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
To generate article spans:
(NOTE: Use search/replace to replace the date everywhere in this file as appropriate)
First, you need the following:
1. Text of pages 100-199 of all volumes of War of the Rebellion (WOTR), e.g. in
'../wotr-text-100'. This should contain only the '*.txt.joined' files
(use symlinks if necessary).
2. Text of all pages of all volumes of War of the Rebellion, e.g. in
'../wotr-text'. This should contain only the '*.txt.joined' files
(use symlinks if necessary).
3. Files containing docgeo spans, e.g. in 'docgeo-spans-jun-10-616pm'.
This generates the following result directory:
1. 'wotr-spans.jun-10-616pm' or similar, containing the text of WOTR with
the spans marked.
It also generates the following intermediate files/directories:
1. 'articleseg_feats.txt.jun-10-616pm': Features for training the Mallet
sequence model to predict the spans.
2. 'mallet.model.jun-10-616pm': Mallet sequence model predicted from the
above file.
3. 'mallet-predict.jun-10-616pm': Directory containing features for predicting
the spans of the WOTR text, one file per volume.
4. 'mallet-predictions.jun-10-616pm': Directory containing Mallet predictions
(in B/I/O/L format for beginning/inside/outside/last) of the above
feature files.
Then:
1. Generate Mallet training data.
python/generate-span-features.py --spans docgeo-spans-jun-10-616pm --text ../wotr-text-100 --training-file articleseg_feats.txt.jun-10-616pm --train
This generates a file 'articleseg_feats.txt.jun-10-616pm' containing Mallet
training data.
2. Train Mallet.
java -cp /Users/benwing/devel/mallet-2.0.7/class:/Users/benwing/devel/mallet-2.0.7/lib/mallet-deps.jar cc.mallet.fst.SimpleTagger --train true --model-file mallet.model.jun-10-616pm articleseg_feats.txt.jun-10-616pm
This uses 'articleseg_feats.txt.jun-10-616pm' to train a model, stored in
'mallet.model.jun-10-616pm'.
3. Generate features for the full War of the Rebellion text (this may
take awhile).
python/generate-span-features.py --text ../wotr-text --predict-dir mallet-predict.jun-10-616pm --predict
4. Run Mallet in prediction mode.
md mallet-predictions.jun-10-616pm
cd mallet-predict.jun-10-616pm
for x in *.feats; do echo $x; java -cp /Users/benwing/devel/mallet-2.0.7/class:/Users/benwing/devel/mallet-2.0.7/lib/mallet-deps.jar cc.mallet.fst.SimpleTagger --model-file ../mallet.model.jun-10-616pm $x >! ../mallet-predictions.jun-10-616pm/`basename $x .feats`.predictions; done
cd ..
5. Combine the predictions and text to generate predicted spans.
python/generate-span-features.py --text ../wotr-text --predictions mallet-predictions.jun-10-616pm --predicted-spans-dir wotr-spans.jun-10-616pm --predicted-spans