lfgalign
is a program for aligning corresponding f-structures and
c-structures of LFG analysed parallel sentences. The analyses should
be in the
XLE format,
and preferably manually disambiguated from grammars that have been
written using common analysis principles (see the
Xpar project description). One
may optionally supply word-translations (e.g. from word alignments or
translational dictionaries) in order to improve the predicate
alignment.
There is an article about lfgalign that describes the method; see also the master's thesis (in Norwegian).
Prerequisites:
-
asdf, this is bundled with SBCL as well as the less common Common Lisps.
-
lisp-unit (optional, for regression tests)
Make a symlink from your systems
directory to lfgalign.asd
in this
directory (you can do the same for lisp-unit
); since I installed
SBCL using clbuild this
directory was at /path/to/clbuild/systems
, but you can find the path
by evaluating asdf:*central-registry*
in your interpreter after
requiring 'asdf
.
Load the package in your interpreter with
(asdf:operate 'asdf:load-op 'lfgalign)
Switch to the lfgalign
package:
(in-package :lfgalign)
(If you're using through Emacs with Slime, you can load from the REPL
with , load RET lfgalign RET
and switch with , in RET lfgalign RET
.)
You can then run the regression tests with
(lisp-unit:run-tests)
The function evaluate
in the file eval.lisp
shows how you load two
Prolog files into analysis tables, create an empty LPT table, run
f-alignment, ranking and c-alignment, finally give some
not-very-formatted output.
A very preliminary command-line interface using SBCL is available. You should be able to align two Prolog files by simply saying
./align.sh source.pl target.pl
although it assumes SBCL is installed in usr
; you can set the
correct paths to SBCL and your asdf systems directory (where you
symlinked to lfgalign.asd) first by doing e.g.:
export LISP=/l/c/clbuild/target/bin/sbcl
export LISPCORE=/l/c/clbuild/target/lib/sbcl/sbcl.core
export ASDFSYSTEMS=/l/c/clbuild/systems/
Common Lisp command-line interfaces are unfortunately not very standardised.
The global variable *pro-affects-c-linking*
controls whether
unlinked pro-elements may hinder linking c-structure nodes of two
predicates. Setting this to t
or nil
toggles two alternative ways
of linking c-structures in the cases where one language has
pro-elements and the other does not, and the pro-element is linked on
the f-structure level.
lfgalign.lisp currently does the following:
-
collect c-structure trees:
maketree
-
find the topmost c-node in an f-domain:
topnode
-
find a c-node referenced by f-structure variable:
treefind
-
find f-structure predicate from variable, traversing equivalent f-vars:
get-pred
andunravel
-
find arguments, adjuncts, lemma and lexical expression of a predicate/f-var:
get-args
,get-adjs
,lemma
,L
-
keep tables of LPT correspondences (lookup with
LPT?
ensures a "pro" is an LPT of a noun as defined bynoun?
) -
find all set-unique combinations of links of source arguments with target args/adjuncts, and target arguments with source args/adjuncts (excluding adj-adj links):
argalign
(if given LPT tables, this removes combinations where at least one link is non-LPT) -
outer-pred
creates a fake "sentence pred" with id -1, that has 0 as an argument and, as adjuncts: any unreferenced preds in the f-structure (preds that are not arguments/adjuncts reachable through 0) -
f-align
combines the above and recursively tries to align all arguments in all permutations of argument-argument/adjunct pairs, creating a decision tree of sorts;flatten
spreads this out into several simple lists. -
rank
usesrank-helper
andrank-branch
to turn the output fromf-align
into a single flat, ranked list of links for input intoc-align
. -
add-links
takes a flat f-alignment and a tree, and creates a table of typeLL-splits
. Each noden
is added to a list in the table, where the index of the list is the set of alignments of pre-terminal nodes dominated by noden
(so several nodes may have the same index). -
c-align
takes a flat f-alignment and finds theLL-splits
of source and target trees, intersecting that on the keys to find which nodes can be aligned.
prolog-import.lisp parses an XLE Prolog file and puts everything into a hash table. Keys are f-structure variable numbers for the f-structure, while the c-structure parts are referenced on the names of the parts (subtree, terminal, phi, cproj, fspan, semform_data, surfaceform), the values being alists with unique id keys. If we turn it all back into an assoc-list, we get e.g.:
((0 ("VFORM" . "fin") ("CLAUSE-TYPE" . "decl") ("TNS-ASP" . 10)
("POLARITY" . 5) ("CHECK" . 1) ("SUBJ" . 3) ("PRED" "qePa" 2 (3) NIL))
(3 ("PERS" . "3") ("NUM" . "sg") ("CASE" . "erg") ("ANIM" . "+") ("NTYPE" . 6)
("CHECK" . 4) ("PRED" "Abrams" 0 NIL NIL))
(4 ("_AGR-POS" . "left") ("_POLARITY" . 5)) (6 ("NSYN" . "proper"))
(1 ("_TENSEGROUP" . "aor") ("_TENSE" . "aor") ("_PERIOD" . "+")
("_MAIN-CL" . "+") ("_AGR" . "both") ("_MORPH-SYNT" . 7) ("_IN-SITU" . 2))
(|in_set| ("NO-PV" . 22) (3 . 2))
(7 ("_SYNTAX" . "unerg") ("_PERF-PV" . "-") ("_LEXID" . "V2746-3")
("_CLASS" . "MV") ("_AGR" . 8))
(8 ("_OBJ" . 9)) (9 ("PERS" . "3") ("NUM" . "sg"))
(10 ("TENSE" . "past") ("MOOD" . "indicative") ("ASPECT" . "perf"))
(21 ("o::" . 22))
(|subtree| (2 "PROP" NIL 1) (4 "V_SUFF_BASE" NIL 5) (6 "V_SUFF_BASE" NIL 7)
(8 "V_SUFF_BASE" NIL 9) (10 "V_SUFF_BASE" NIL 11) (12 "V_SUFF_BASE" NIL 13)
(14 "V_SUFF_BASE" NIL 15) (18 "V_BASE" NIL 17) (28 "PERIOD" NIL 22)
(118 "PROPP" NIL 2) (141 "IPfoc[main,-]" NIL 118) (281 "V" NIL 18)
(283 "V" 281 14) (284 "V" 283 12) (285 "V" 284 10) (286 "V" 285 8)
(287 "V" 286 6) (288 "V" 287 4) (293 "I[main,-]" NIL 288)
(398 "Ibar[main,-]" NIL 293) (401 "IPfoc[main,-]" 141 398)
(454 "ROOT" NIL 401) (457 "ROOT" 454 28))
(|phi| (1 . 3) (2 . 3) (4 . 0) (5 . 0) (6 . 0) (7 . 0) (8 . 0) (9 . 0)
(10 . 0) (11 . 0) (12 . 0) (13 . 0) (14 . 0) (15 . 0) (17 . 23) (18 . 0)
(22 . 0) (28 . 0) (118 . 3) (141 . 0) (281 . 0) (283 . 0) (284 . 0) (285 . 0)
(286 . 0) (287 . 0) (288 . 0) (293 . 0) (398 . 0) (401 . 0) (454 . 0)
(457 . 0))
(|terminal| (1 "abramsma" (1)) (5 "+Obj3" (3)) (7 "+Subj3Sg" (3))
(9 "+Aor" (3)) (11 "+Base" (3)) (13 "+Unerg" (3)) (15 "+V" (3))
(17 "qePa-2746-3" (3)) (22 "." (22)))
(|cproj| (17 . 21)) (|semform_data| (2 18 10 14) (0 2 1 9))
(|fspan| (3 1 9) (0 1 16))
(|surfaceform| (22 "." 15 16) (3 "iqePa" 10 15) (1 "abramsma" 1 9)))
We collect the eq-vars (equivalent variables) into a doubly-linked circular list (so we can easily look up a member and get all equivalents).
We signal an error if the file is not disambiguated (as indicated by
the select
and choice
fields in the Prolog file). Otherwise, we
filter out non-selected parses from the file, keeping only the ones
equivalent to the selected parse (see filter-equiv
, in-disjunction
and disambiguated?
).
-
Use LPT-check as a k-best ranking criterion rather than a binary cut-off.
-
SPEC and POSS features may lead to PRED's that are not arguments or adjuncts of anything else (e.g. determiners, possessors) -- need some principled method of aligning these.
-
The program just uses dset3 of the dsets, rename it (make a class?) and deprecate the others.
-
Could perhaps make argument calls a bit more concise by making a class
alignment
, containing constantstab_s
,tab_t
, creating constantstree_s
andtree_t
on init, and storingLPT
andf-alignments
.