Skip to content

How to change linguistic resources

SergeiAlonichau edited this page Mar 29, 2019 · 35 revisions

Make the tools ready (needed to be done one time only)

Build the library and tools

  • git clone Bling-Fire-Git-Path
  • cd BlingFire
  • cmake .
  • ls -l Makefile

-rwxrwxrwx 1 sergeio sergeio 425666 Mar 29 18:58 Makefile

  • make all

This will take a few minutes

Make sure the tools are in the path

Now you need to install the tools into the location known in PATH or to set the PATH to see the BlingFire directory with the tools. For the later one run this command from the BlingFire directory:

  • . ./scripts/set_env

Let's make sure that the tools are actually in the PATH, type:

fa_nfa2dfa --help

All tools respond to --help, so you should see something like:

Usage: fa_nfa2dfa [OPTION] [< input.txt] [> output.txt]

This program converts non-deterministic finite-state machine into
deterministic one.

  --in=input-file  - reads input from the input-file,
    if omited stdin is used

  --out=output-file - writes output to the output-file,
    if omited stdout is used

  --out2=output-file - writes output to the output-file,
    if omited stdout is used

  --pos-nfa=input-file - reads reversed position NFA from input-file,
    needed for --fsm=pos-rs-nfa to store only ambiguous positions, if omited
    stores all positions

  --fsm=rs-nfa - makes convertion from Rabin-Scott NFA (is used by default)
  --fsm=pos-rs-nfa - makes convertion from Rabin-Scott position NFA,
    builds Moore Multi Dfa
  --fsm=mealy-nfa - makes convertion from Mealy NFA into a cascade of
    two Mealy Dfa (general case) or a single Mealy DFA (trivial case)

  --spec-any=N - treats input weight N as a special any symbol,
    if specified produces Dfa with the same symbol on arcs,
    which must be interpreted as any other

  --bi-machine - uses bi-machine for Mealy NFA determinization

  --no-output - does not do any output

  --verbose - prints out debug information, if supported

Edit linguistic sources and compile them into automata

Let's change the working directory into the root for linguistic sources:

cd ldbsrc

Note: we will add separate documentation on different format of the linguistic resources, for the moment we will modify the tokenization logic only like this:

touch wbd/wbd.lex.utf8