README

DPTSG
January 2010, Matt Post
--

The code contained in this project directory was used to obtain the
results in our ACL 2009 short paper:

"Bayesian learning of a tree substitution grammar"
Matt Post & Daniel Gildea
ACL 2009

If you want to learn how the code works, along with all the caveats,
please see the file README.code.  This file describes the project
files, along with the steps needed to prepare all of the data to do
Gibbs sampling.

-- INSTALLATION AND WALK-THROUGH -------------------------------------

All files and code herein assume a base directory of
$ENV{DPTSG}.  You should set this in your profile, e.g., in bash,

$ export DPTSG=$HOME/code/dptsg

You will also need to put this directory in your Perl search path:

$ export PERL5LIB+=:$DPTSG

* List of programs in scripts/
** oneline_treebank.pl
   Takes (on STDIN) a multi-line treebank tree and condenses it to a
   single line (on STDOUT).
** clean.pl
   Removes empty nodes and Treebank v2 information from node labels.
** build_lex.pl
   Builds the lexicon.  This is used by most programs to determine
   which words should be converted to one of the ~ 60 unknown word
   classes.
** print_leaves.pl
   Takes a oneline tree and prints out its frontier.
** extract_lines.pl
   Takes a list of line numbers and a file, and prints out only lines
   from the file corresponding to those in the list of line numbers.
** annotate_spinal_grammar.pl
   Takes a treebank (one tree per line) and annotates it with the
   derivations from the spinal grammar, using the Magerman head
   selection rules.
** rule_probs.pl
   Produces a grammar from a combination of rule counts in the form
   <count> <rule>
   and oneline trees (these can be interspersed).  
** extract_bod01_grammar.pl
   Samples random trees at various heights from a treebank,
   reproducing the "minimal subset" grammar of Bod.

* Preparing the data
** Process the trees into a nice format.

  You need a copy of the WSJ portion of the Penn treebank.

  Gets rid of treebank2 info, removes NONE nodes and traces.

  $ cd $basedir
  $ mkdir data
  $ (cd /p/mt/corpora/wsj; cat {02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21}/*mrg) | \
    ./oneline_treebank.pl > data/wsj.trees.02-21
  $ cat data/wsj.trees.02-21 | ./clean.pl > data/wsj.trees.02-21.clean

** Generate the lexicon

  This determines which words get transformed into one of the ~ 60
  unknown word bins.  It writes (on STDOUT) a vocabulary file with
  lines of the form "ID WORD COUNT", for all words found in the
  training trees.

  $ ./build_lex.pl < data/wsj.trees.02-21.clean > data/lex.02-21

** Annotate the training data with spinal annotations

   A TSG derivation tree is represented by prepending a '*' to the
   labels of nodes that are internal to a TSG rule, e.g., the tree
     (S (*NP (*DT the) (NN boy)) (*VP (VBD was)) (*. .))
   would yield the rules
     (S (NP (DT the) NN) (VP VBD) (. .))
     (NN boy)
     (VBD was)

   Produce the spinal annotation of the training data with the
   following command:

   $ ./annotate_spinal_grammar.pl data/lex.02-21 < data/wsj.trees.02-21.clean |
     > data/wsj.trees.02-21.clean.annotated

   You need this representation if you want to initialize the Gibbs
   sampler from the spinal derivations instead of the PCFG
   derivations, and also to extract the spinal baseline grammar.  This
   initialization is recommended since it converges more quickly to
   better grammars.

** Generate the development and test sets.

   $ cat /p/mt/corpora/wsj/22/*.mrg | ./oneline_treebank.pl > data/wsj.trees.22
   $ cat data/wsj.trees.22 | ./clean.pl > data/wsj.trees.22.clean
   $ cat data/wsj.trees.22.clean | ./print_leaves.pl > data/wsj.22.words

   $ cat /p/mt/corpora/wsj/23/*.mrg | ./oneline_treebank.pl > data/wsj.trees.23
   $ cat data/wsj.trees.23 | ./clean.pl > data/wsj.trees.23.clean
   $ cat data/wsj.trees.23.clean | ./print_leaves.pl > data/wsj.23.words

   Get those sentences with forty words max:

   $ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.23.words > wsj.23.words.lines-max40
   $ extract_lines.pl -l wsj.23.words.lines-max40 -f wsj.23.words > wsj.23.words.max40

   $ awk 'BEGIN { sentno=0 } { sentno++; if (NF <= 40) printf("%d\n",sentno); }' wsj.22.words > wsj.22.words.lines-max40
   $ extract_lines.pl -l wsj.22.words.lines-max40 -f wsj.22.words > wsj.22.words.max40

** Generate the PCFG rule probs used for the base measure.

   $ cat data/wsj.trees.02-21.clean | ./rule_probs.pl -counts > data/rule_counts
   $ cat data/wsj.trees.02-21.clean | ./rule_probs.pl > data/rule_probs

   (alternately, the grammar could be produced directly from the counts:
    $ cat data/rule_counts | ./rule_probs.pl > data/rule_probs
   )

** Generate the spinal grammar
   
   $ cat data/wsj.trees.02-21.clean.annotated | ./rule_probs.pl -counts > data/spinal_counts
   $ cat data/rule_counts data/spinal_counts | ./rule_probs.pl > data/spinal_probs

** Generate Bod's grammar

   $ cat data/wsj.trees.02-21.clean | ./extract_bod01_grammar.pl > data/bod01.rules

   The file 'bod01.rules" prepends each line with the height of that
   rule.  To build the grammar, remove that field, add in the PCFG
   counts, and normalize:

   $ (perl -pe 's/^\d+ //' data/bod01.rules; cat data/rule_counts) | ./rule_probs.pl > data/bod01.prb
   
* Gibbs sampling

  The Gibbs sampler comprises the files tsg.pl, the generic sampler
  code in Sampler.pm, and the TSG-specific sampling code in
  Sampler/TSG.pm.  As input, it takes a treebank with arbitrary TSG
  derivations and a PCFG used to score the base measure.  Each
  iteration's output is written to a subdirectory whose name is the
  iteration number, and whose contents are the corpus state at tne end
  of that iteration and the counts of TSG rules extracted from that
  corpus derivation state.

  The sampler can be interrupted at any time, and will restart from
  the last completed iteration (it will be safer to remove the last
  iteration's directory if it is incomplete).

  Usage:

  tsg.pl
    -iters     number of iterations
    -corpus    corpus to initialize counts from (unless picking up
               from an interrupted run)
    -alpha     the value of alpha for the DP prior
    -stop      the stop probability for the DP base measure
    -pcfg      the pcfg grammar used for the DP base measure
    -rundir    the run directory (default = pwd)
    -two       sample two nodes at a time (default = sample one)
    [many more options are available]

    e.g.,
    $ tsg.pl -corpus $basedir/data/wsj.trees.02-21.clean -iters 500

  To extract a Gibbs-sampled grammar, use one of:

  - extract a summed grammar (from counts of first $i iters)

    (for num in $(seq 1 $i); do
      bzcat $num/counts.bz2
    done) | $basedir/scripts/rules_probs.pl > 1-$i.prb

  - extract a point grammar (from a single iteration $i)

    bzcat $i/counts.bz2 | $basedir/scripts/rule_probs.pl > $i.prb


* Parsing

  You can now parse with these grammars with your favorite CKY
  parser.  I have modified Mark Johnson's parser to work with these
  grammars; please feel free to email me for a copy if I haven't
  already posted the source to my website.

* Evaluation

  Use evalb with the COLLINS.prm file.
  http://nlp.cs.nyu.edu/evalb/