Skip to content
jbjorne edited this page Feb 23, 2013 · 52 revisions

Turku Event Extraction System (TEES) is a free and open source natural language processing system developed for the extraction of events and relations from biomedical text. It is written mostly in Python, and should work in generic Unix/Linux environments.

TEES has been evaluated in the following Shared Tasks. Models for predicting their targets are provided with TEES and can be used on any unannotated text.

Biomedical event extraction refers to the automatic detection of molecular interactions from research articles. Events can accurately represent complex statements, e.g. the sentence “Protein A causes protein B to bind protein C” produces the event CAUSE(A, BIND(B, C)). Such formal structures can be processed with computational methods, allowing large-scale analysis of the literature, as well as applications such as pathway construction.

TEES event extraction process

TEES event extraction proceeds in a pipeline of external components and TEES modules. The processing steps are linked together by the interaction XML file format.

If TEES is used to process completely unannotated text, the Preprocessor uses a number of external tools to detect protein names and parse the sentences (A-C). The result is an interaction XML file comparable to the shared task corpora provided with TEES. An Event Detector can then be used to first detect trigger words such as verbs (D), followed by detection of the interactions (E), after which complex events need to be constructed in the unmerging step (F), and modifiers such as negation and speculation can be detected (G). The result is a set of events in the interaction XML format, which can be further converted to the BioNLP Shared Task flat-file format (H), or used as is.

The central concept in how TEES does event and relation extraction is a graph representation for both syntactic and semantic information. The parse tree is graph where the tokens are nodes and dependencies are edges. Events or relations form a graph where named entities and event triggers are entity nodes and relations and event arguments are interaction edges.

TEES has no intrinsic concept of an event. A relation is a single interaction edge between two entity nodes. An event is simply the entity node denoting the trigger word and the set of its outgoing interaction edges. With this approach, both relation and event extraction, including all the varied BioNLP Shared Task domains, can be represented with a simple graph of word nodes and interaction edges. This annotation, as well as parses and other data, are stored in the interaction XML format. Understanding the interaction XML format is central to using TEES, as all TEES components work with this shared file format.

Training, Classification and Models

The interface of TEES is similar to many popular machine learning tools, with a training program, that produces a model file, and a classification program that uses the model file to predict events in unlabeled text. Using the training and classification programs will be explained in their own sections.

The model file is simply a zip-archive (or uncompressed directory), containing all the information required to classify unlabeled text. It can be accessed using the Model-class, through which files and name/value pairs can be added or removed. The model file is used to store the parameters for the individual TEES components, the machine learning models produced by SVMs and the class and feature id numbers required to build new examples to be classified with an existing SVM model.

TEES provides precomputed models for all the BioNLP'11, BioNLP'09 and DDI Shared Task corpora. These models come in pairs, with a devel model using the training set for learning and the devel set for parameter optimization, and a test model, using both training and devel sets for learning and the parameters from the devel model. A devel model is used when adapting TEES for a new task, with the test model left for determining the final performance of the system on a usually hidden test set. When TEES is used for text mining, the test model should be used, as it has been trained on a larger set of data.

Log Files

Most TEES programs save all on-screen output into a log file, to help keep a record of experiments done. These log files are usually named as "*log.txt". When a log file already exists, the default behaviour of TEES is to append to it, so when re-running an experiment, the latest record is at the end of the file.

TEES logging is based on the StreamModifier class, which wraps Python's stdout and stderr streams. This allows saving also the output of external programs run as subprocesses. The most important TEES programs that write logs are train and classify. Logging can be turned off by passing None as the log-argument of these programs/functions and more detailed control of logging is available via the functions in the Utils.Stream module.

The log files for the production of the default TEES models and converted corpora can be found at the data directory where the model and corpus files are installed. These logs can provide an idea of what the output of such processes should look like when running your own work with TEES.

Parameter Sets

Many TEES components, such as the ExampleBuilder classes, have a great number of parameters that define their behaviour. In order to keep the number of method and command line arguments manageable, as well as provide consistent parameter validation, these components take some arguments as parameter sets. These parameter sets are processed with the Parameters module, and can be passed as parameter/value dictionaries or strings. When giving parameters as strings (such as when using the command line interface) the following format is used:

parameter1:parameter2=value:parameter4=value1,value2:parameter5

Parameters are separated by colon signs ":", if a parameter has a value it is defined with an equal sign "=" and multiple values are separated with a comma ",". For some special cases where only a single parameter is defined, the parameter name can be omitted, and only a value or list of values given. For example, for classifier parameters given to train.py, value and value1,value2 are equal to c=value and c=value1,value2.

Clone this wiki locally