pypolibox is a database-to-text generation (NLG) software built on Python 2.7, NLTK and Nicholas FitzGerald's pydocplanner.
Using a database of technical books and some user input, pypolibox generates sentences descriptions. These descriptions are then used by the OpenCCG surface realiser to generate written sentences in German.
In order to generate sentences (instead of abstract sentence
descriptions), you will need to install OpenCCG (tested with version
0.9.5). Make sure that you can call tccg
from the command line,
e.g. by adding the openccg/bin
directory to your $PATH
.
Under Linux, you'd have to add something like this to your .bashrc
:
export PATH=/home/username/bin/openccg/bin:$PATH export OPENCCG_HOME=/home/username/bin/openccg export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64
Under Windows, you'll have to set the environment variables
OPENCCG_HOME
, JAVA_HOME
and add the full path of your
openccg/bin
directory to the PATH
variable.
pywin32 also needs to be installed under Windows.
pip install pypolibox
Under Linux, you might have to prepend that command with sudo
or
execute it as root. Under Windows, you'll need to run this command in a
console with administrator rights.
You might also need superuser/admin rights for this (see above).
git clone https://github.com/arne-cl/pypolibox.git cd pypolibox python setup.py install
pypolibox
can be used from the command line or from within a Python
interpreter. To see all the available options, enter:
pypolibox -h
To find books that are written in German and use the programming language Prolog, type:
pypolibox --language German --proglang Prolog
or, if you prefer short but cryptic commands:
pypolibox -l German -p Prolog
You can choose between several output formats using the -o
or
--output-format
argument.
openccg
generates sentences using OpenCCG (default option)textplan-xml
generates an XML representation of the textplanstextplan-featstruct
generates a feature structure representation (nltk.featstruct
)hlds
generates an HLDS XML representations of all the sentences.
In future versions, you will be able to choose between several output
natural languages the -d
or --output-language
argument
(currently only German is supported).
The following example query will generate HLDS XML snippets describing books about Prolog written in German:
pypolibox --language German --proglang Prolog --output-format hlds
Further usage examples can be found in the pypolibox.database.Query
class documentation.
If you'd like to access pypolibox
from
within a Python interpreter, you can simply use the same arguments.
Instead of a string like -l German -p Prolog, you will have to
provide your arguments as a list of strings:
Query(["-l", "German", "-p", "Prolog"])
This query would be equivalent to the command line queries above.
pypolibox
is built as a pipeline, where each important step is
represented by a class. Each of these classes function as the input
of the next class in the pipeline, e.g.:
query = Query(["-l", "German", "-p", "Prolog"]) Results(query) Books(Results(query)) ... TextPlans(AllMessages(AllPropositions(AllFacts(Books(Results(query))))))
If you instanciate a Query with your query arguments, you can use
this Query
instance as the input of a Results
instance
(which contains the data that the database provided for your query),
which in turn can be used as the input of a Books
instance etc.
Of course, you wouldn't want to chain all those classes just to retrieve
textplans. To do so, simply use one of the functions provided in the
debug
module, either by running the debug.py
file in
the interpreter or by importing it:
import debug debug.gen_textplans(["-l", "German", "-p", "Prolog"])
This function call would return the same results as the aforementioned
command line calls. For further testing, try
debug.testqueries
and debug.error_testqueries
, which
basically are lists of predefined valid and invalid query arguments and which
can be used to query the database (and see how errors are handled).
The documentation is available online, but you can always get an up-to-date local copy using Sphinx.
You can generate an HTML or PDF version by running these commands in
pypolibox's docs
directory:
make latexpdf
to produce a PDF (docs/_build/latex/pypolibox.pdf
) and
make html
to produce a set of HTML files (docs/_build/html/index.html
).
The pypolibox package contains the following modules:
- The
pypolibox
module is the main module, which is invoked from the command line. - The
database
module handles the user input, queries the database and returns the results. facts
converts those results into attribute value matrices.- The
propositions
module evaluates those facts (positive, negative, neutral). - The
textplan
module takes those propositions and turns them into messages. In contrast to propositions, messages do not contain duplicates and add comparative information. Rules will be used to combine those message into constituent sets and ultimately into one text plan. Thetextplan
module also allows exporting those text plans in XML format. - The
rules
module contains the rules used by be thetextplan
module to combine messages into constituent sets and textplans, respectively. - The
messages
module generates messages from propositions, which will be used by thetextplan
module. - The
lexicalize_messageblocks
is the "main" module of the lexicalization. For each message block in a textplan, it generates one or more possible lexicalizations which are then realized by therealization
module. - The
lexicalization
module generates lexicalizations (in HLDS-XML format) for each message, which are used by thelexicalize_messageblocks
module to form lexicalizations of complete message blocks. - A note on terminology: A message block in
pypolibox
is basically an instance of theMessage
class, e.g an "id" message block. This "id" message block in turn consists of several messages, e.g. an "authors" message and a "title" message. - The
realization
module takes a lexicalized phrase or sentence (in HLDS-XML format) and converts it into a surface realization (with the help of OpenCCGstccg
executable). - The
hlds
module allows to convert textplans from anltk.featstruct
-based format to HLDS-XML and vice versa. In addition, the module can produce attribute-value matrices of these textplans as LaTeX/PDF files.
The code is licensed under GPL Version 3. The grammar fragment is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Arne Neumann (original author), Pablo Duboue
This software reimplements parts of the Java-based JPolibox text-generation software written by Alexandra Strelakova, Felix Dombek, Mathias Langer and Till Kolter. pypolibox also includes a heavily modified version of Nicholas FitzGerald's pydocplanner, which he released under a Creative Commons license (not specified further). The German OpenCCG grammar fragment that comes with pypolibox was written by Martin Oltmann.