-
Notifications
You must be signed in to change notification settings - Fork 128
How to change linguistic resources
- git clone Bling-Fire-Git-Path
- cd BlingFire
- cmake .
- ls -l Makefile
-rwxrwxrwx 1 sergeio sergeio 425666 Mar 29 18:58 Makefile
- make all
This will take a few minutes
Now you need to install the tools into the location known in PATH or to set the PATH to see the BlingFire directory with the tools. For the later one run this command from the BlingFire directory:
- . ./scripts/set_env
Let's make sure that the tools are actually in the PATH, type:
fa_nfa2dfa --help
All tools respond to --help, so you should see something like:
Usage: fa_nfa2dfa [OPTION] [< input.txt] [> output.txt] This program converts non-deterministic finite-state machine into deterministic one. --in=input-file - reads input from the input-file, if omited stdin is used --out=output-file - writes output to the output-file, if omited stdout is used --out2=output-file - writes output to the output-file, if omited stdout is used --pos-nfa=input-file - reads reversed position NFA from input-file, needed for --fsm=pos-rs-nfa to store only ambiguous positions, if omited stores all positions --fsm=rs-nfa - makes convertion from Rabin-Scott NFA (is used by default) --fsm=pos-rs-nfa - makes convertion from Rabin-Scott position NFA, builds Moore Multi Dfa --fsm=mealy-nfa - makes convertion from Mealy NFA into a cascade of two Mealy Dfa (general case) or a single Mealy DFA (trivial case) --spec-any=N - treats input weight N as a special any symbol, if specified produces Dfa with the same symbol on arcs, which must be interpreted as any other --bi-machine - uses bi-machine for Mealy NFA determinization --no-output - does not do any output --verbose - prints out debug information, if supported
Let's change the working directory into the root for linguistic sources:
cd ldbsrc
Note: we will add separate documentation on different format of the linguistic resources, for the moment we will modify the tokenization logic only like this:
touch wbd/wbd.lex.utf8
And now to recompile the wbd directory (word boundary disambiguation) or word-breaking or tokenization logic is defined in this directory. We need simply type:
make -f Makefile.gnu lang=wbd all
You should see something like this one the screen:
fa_build_conf \ --in=wbd/ldb.conf.small \ --out=wbd/tmp/ldb.mmap.small.txt fa_fsm2fsm_pack --type=mmap \ --in=wbd/tmp/ldb.mmap.small.txt \ --out=wbd/tmp/ldb.conf.small.dump \ --auto-test fa_build_lex --dict-root=. --full-unicode --in=wbd/wbd.lex.utf8 \ --tagset=wbd/wbd.tagset.txt --out-fsa=wbd/tmp/wbd.rules.fsa.txt \ --out-fsa-iwmap=wbd/tmp/wbd.rules.fsa.iwmap.txt \ --out-map=wbd/tmp/wbd.rules.map.txt fa_fsm2fsm_pack --alg=triv --type=moore-dfa --remap-iws --use-iwia --in=wbd/tmp/wbd.rules.fsa.txt --iw-map=wbd/tmp/wbd.rules.fsa.iwmap.txt --out=wbd/tmp/wbd.fsa.small.dump fa_fsm2fsm_pack --alg=triv --type=mmap --in=wbd/tmp/wbd.rules.map.txt --out=wbd/tmp/wbd.mmap.small.dump --auto-test fa_merge_dumps --out=ldb/wbd.bin wbd/tmp/ldb.conf.small.dump wbd/tmp/wbd.fsa.small.dump wbd/tmp/wbd.mmap.small.dump