-
Notifications
You must be signed in to change notification settings - Fork 7
Tutorial
- Familiarity with Java.
- JDK 1.6, Eclipse (Helios+), Maven (desirable), UIMA SDK.
** Other pages
-
General explanation of YAML
-
Migrating your UIMA project to CSE
(from cse-tutorial...)
- What is UIMA? According to the Apache UIMA project page:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
- CSE & ECD
- What is ECD? See: uima-ecd project wiki.
- What is CSE? See: cse-framework project wiki. ![Three Phase Pipeline][threephasepipeline]
See: Configuring an Eclipse project
Extended configuration descriptor (ECD) extends the UIMA and uimaFit to balance between ease-of-use of flexibility. To this end, ECD provides three major features: (1) YAML based descriptors, (2) driver that supports multiple options for a component specified in the descriptor, and (3) driver that supports declarative options for a component.
Similar to collection processing engine descriptor (CPE) in UIMA SDK, the key element of ECD is a YAML associative array of three elements:
- collection-reader
- the actual pipeline, and
- the optional post-process pipeline
Similar to the UIMA SDK component descriptor (collection reader,pipeline annotator, and cas consumer), the building blocks of ECD are component descriptors (a YAML associative array) defines:
- The Java class that implements the component,
- Parameters and values the component requires.
YAML Ain't Markup Language (YAML) is a human-readable data serialization format that provides a simple but rich syntax to represent a data structure, like our component descriptors as well as engine descriptors.
One important aspect of YAML is that indentation matters. Each indentation after a newline forms a block literal, similar to other indentation-sensitive languages such as Python. Also, parameters can be passed to fields using a colon and space. So for example,
configuration:
name: oaqa-tutorial
author: oaqa
Here the parameters name
, and author
are set to oaqa-tutorial
and oaqa
respectively, and are passed as a block to configuration
.
When writing an ECD for a UIMA annotator, the first line specifies whether it is a subclass of another annotator descriptor, or whether it refers to a UIMA annotator class directly. The first line in an ECD component can contain one of the following:
-
inherit: will look for a resource file within the class-path on the path specified by the doted syntax a.b.c => a/b/c.yaml. Inherited parameters. can be overridden directly on the body of the resource file.
-
class: will look for the specified class within the class-path, and is intended to be used as a shorthand for classes that do not have configurable parameters.
For example, if bar.yaml
refers to a concrete class Bar.java
that resides in the package bar with some fixed parameter fixed-param
set to a
, then the YAML descriptor will look like:
# bar.yaml
class: bar.Bar
fixed-param: a
If a descriptor foo.yaml
is a subclass of the bar.yaml
, the YAML descriptor will look like:
# foo.yaml
inherit: bar
var: [x, y]
![resources][resources]
Resources on a descriptor are configured using named parameters; any dash-separated string is a valid parameter name except for the reserved keywords: inherit, options, class and pipeline. The actual value of the parameter is either a Java primitive wrapper: Integer, Float, Long, Double, Boolean, or a String. For nested resources compound parameters are passed as Strings that are further parsed within the resource. For example, passing a RegEx pattern as a String parameter to a RoomAnnotator will look like:
class: annotators.RoomAnnotator
pattern: \\b[0-4]\\d[0-2]\\d\\d\\b
Combinatorial parameters are specified using the cross-opts mapping and declaring the desired values as elements on a list. For example for the following annotator descriptor:
class: annotators.RoomAnnotator
foo: bar
cross-opts:
parameter-a: [value100,value200]
parameter-b: [value300,value400]
The configuration on Listing 2 will result in the 2x2 cross-product of configurations of the component:
[foo:bar, parameter-a: value100, parameter-b: value300]
[foo:bar, parameter-a: value200, parameter-b: value300]
[foo:bar, parameter-a: value100, parameter-b: value400]
[foo:bar, parameter-a: value200, parameter-b: value400]
![In-phase pipeline][inphasepipeline]
![Execution path][executionpathinphasepipeline]
These examples are based on the UIMA SDK tutorial (USDK)(put anchor here), to see how they were migrated see section (Migrating your existing UIMA pipeline to CSE (and here)).
For this example we will only use one type --- RoomNumber. You can follow the steps from (Writing a Type System) to add this type.
<?xml version="1.0" encoding="UTF-8" ?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>TutorialTypeSystem</name>
<description>Type System Definition for the tutorial examples -
as of Exercise 1</description>
<vendor>Apache Software Foundation</vendor>
<version>1.0</version>
<types>
<typeDescription>
<name>org.apache.uima.tutorial.RoomNumber</name>
<description></description>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>building</name>
<description>Building containing this room</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
Don't forget to add the path of the RoomNumberAnnotator file to you META-INF/types.txt file!
Next, we will write the main yaml descriptor for the example. Notice that we are using the same collection-reader and cas consumer from before only we added the phase RoomNumberAnnotator.
configuration:
name: oaqa-tutorial
author: oaqa
collection-reader:
inherit: collection_reader.filesystem-collection-reader
InputDirectory: data/
pipeline:
- inherit: ecd.phase
name: RoomNumberAnnotator
options: |
- inherit: tutorial.ex1.RoomNumberAnnotator
- inherit: cas_consumer.AnnotationPrinter
Now we define the descriptor for the RoomNumberAnnotator by just specifying where the class is located.
class: org.apache.uima.tutorial.ex1.RoomNumberAnnotator
Finally, we write the code for the annotator, taken directly from the UIMA SDK tutorial (link).
package org.apache.uima.tutorial.ex1;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.jcas.JCas;
import org.apache.uima.tutorial.RoomNumber;
/**
* Example annotator that detects room numbers using Java 1.4 regular expressions.
*/
public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
private Pattern mYorktownPattern = Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b");
private Pattern mHawthornePattern = Pattern.compile("\\b[JG1-4][1-2NS]-[A-Z]\\d\\d\\b");
/**
* @see JCasAnnotator_ImplBase#process(JCas)
*/
public void process(JCas aJCas) {
// get document text
String docText = aJCas.getDocumentText();
// search for Yorktown room numbers
Matcher matcher = mYorktownPattern.matcher(docText);
while (matcher.find()) {
// found one - create annotation
RoomNumber annotation = new RoomNumber(aJCas);
annotation.setBegin(matcher.start());
annotation.setEnd(matcher.end());
annotation.setBuilding("Yorktown");
annotation.addToIndexes();
}
// search for Hawthorne room numbers
matcher = mHawthornePattern.matcher(docText);
while (matcher.find()) {
// found one - create annotation
RoomNumber annotation = new RoomNumber(aJCas);
annotation.setBegin(matcher.start());
annotation.setEnd(matcher.end());
annotation.setBuilding("Hawthorne");
annotation.addToIndexes();
}
}
}
To run simply specify the path of the main descriptor as an argument to the launch configuration.
-
Have to use UIMA-fit style of declaring type system in META-INF/org.uimafit/types.txt (see slides)
-
Have to use UIMA-fit hierarchy of cas_consumerimpl in order for it to work in the pipeline
-
initialize() -> initialize(UimaContext context)
- initialize takes argument UimaContext context
-
instead of processCas() -> process(CAS aCas)
-
Collection reader has to extend AbstractCollectionReader
- can not use out-of-the-box UIMA collection readers
- implement method getNextElement
- returns the next data element in the pipeline
- This is instead of using getNext(jcas)
- Override and call super of initialize
- Can we provide a reference to the original homework documents
- Organization for the wiki: should it be decentralized? (i.e., each wiki is responsible for its own content). Or centralized? (i.e., one big wiki for all the repositories used in the tutorial with references to those repositories).
Thanks to Zi Yang (@ziy), Elmer Garduno (@elmerg), and OAQA team members!
[1] (Session 12: Configuration Space Exploration (SE class)) - (SE)
[2] (Elmer's technical report) - (E)
[3] (Zi's Slides) - (Z)
[4] Zi HW0 - (hw0)
[5] Zi HW1 - (hw1)
[6] Zi HW2 - (hw2)
[7] UIMA SDK Tutorial (USDK)
[resources]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/resources.png "resources"
[inphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/inphasepipeline.png "inphasepipeline"
[executionpathinphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/executionpathinphasepipeline.png "executionpathinphasepipeline"
[threephasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/threephasepipeline.png "threephasepipeline"