Skip to content
amaiberg edited this page Jan 15, 2013 · 61 revisions

Table of Contents

Introduction

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

Getting Started

Setting up

Prerequisites

Installing JDK

If you have the latest JDK 6 installed, you could skip this task.

  1. Visit Java SE Downloads page, and choose the platform you are using to download JDK 6 SE35.

  2. Install JDK from the executable file if available, and set PATH manually (if you are using a Windows machine). The Java installation page at http://www.oracle.com/technetwork/java/javase/index-137561.html might be useful to you.

Installing Git

If you have Git installed, you can skip this task.

You will not need Git (specifically, execute Git goals from command line) in most cases, because we will have EGit (the Git plug-in for Eclipse) installed. But sadly, there is a case you have to install the Git, when the Maven plug-in for Eclipse (aka m2e) does need Git (not EGit) and the SCM URI you specified to execute Git commands. (Don’t know what SCM is? Probably you need to go back to the previous task.)

  1. Visit http://git-scm.com/downloads to download Git.

  2. You can refer to the Pro Git book for how to install Git for different platforms (at http://git-scm.com/book/en/Getting-Started-Installing-Git), and how to set up Git (at http://git-scm.com/book/en/Getting-Started-First-Time-Git-Setup).

Maven

If you have Maven installed, you can skip this task. Similar to the Git installation, you will not need a standalone (as opposed to the m2e plug-in) Maven most of time, since m2e has an embedded Maven runtime by default. But we (and some other developpers) found that in some environments, m2e could not find the correct installation path of the embedded runtime to execute certain goals (e.g., deploy, release:prepare, release:perform). If you find a feasible solution to get rid of this, please let us know.

  1. Download Maven 3.0.4 (Binary, either tar.gz or zip) from http://maven.apache.org/download.html.

  2. Follow the installation instructions (for difference platforms) at the bottom of the page to install Maven. The first note from the instruction is:

Maven is a Java tool, so you must have Java installed in order to proceed. More precisely, you need a Java Development Kit (JDK), the Java Runtime Environment (JRE) is not sufficient.

Eclipse (Git, Maven plug-ins integrated)

If you have an Eclipse IDE for Java Developers with version 3.7, you could probably skip this task. But if you are stuck in a situation where you were told Eclipse is missing a plug-in, you might want to return to this section. If you have other packages (Eclipse Classic or Eclipse for Java EE Developers), please do not skip this task.

  1. Download Eclipse IDE for Java Developers 4.2 at http://www.eclipse.org/downloads/packages/eclipse-ide-java-developers/junor.

  2. Install Eclipse by simply uncompressing the downloaded package.

Installing UIMA SDK

Similar to the installation processes you have gone through for Git and Maven, you will install the UIMA binaries and a UIMA Eclipse plug-in.

  1. Follow the instruction in subsection 3.1.21 of the Overview & Setup section of UIMA Manuals and Guides by adding the UIMA Eclipse Plug-in Update site (http://www.apache.org/dist/uima/eclipse-update-site)

  2. But, you might encounter a “Cannot satisfy dependency” issue like shown in Figure 3.1. You can apply a workaround for this year-long bug2 by only selecting the “Apache UIMA Eclipse tooling and runtime support” and unselecting the “Apache UIMA-AS (Asynchronous Scaleout) Eclipse tooling” (as shown in Figure 3.2), where the latter tool will not be used for this course.

  3. After installation, you are able to create or edit UIMA descriptors with a GUI.

Optional (recommended):

Importing Apache UIMA code style template

To development a software as a team, members should always adopt the same code conventions to improve the readability and maintainability of the project. We suggest you to view the Code Conventions for the Java Programming Language at http://www.oracle.com/technetwork/java/codeconv-138413.html, which was published from Oracle. For our course homeworks, you are required to adopt a set of more specific coding conventions from Apache UIMA project. Details can be found at http://uima.apache.org/codeConventions.html. At the bottom of the page, you could find a link to download the Eclipse code style template3.

  1. Download the template and save it in your local filesystem.

  2. Click Window > Preferences, then go to Java > Code Style > Formatter, and click Import.

Remember before you finish editing a Java file, press Ctrl+Shift+F to perform an automatic code formation. Another optional but useful tool for you to check your code style is the Eclipse Checkstyle plug-in. You can learn how to download and install the plug-in at http://eclipse-cs.sourceforge.net/.

Directory structure

Note: need to create generic archetype for CSE project separate from hw1.

The directory structure should like this:

myproject
 |- pom.xml
 '- src
    '- main
      |- java
      |  '- **/*.java 
      '- resources
         |- mypipeline.yaml /* the entry point for your pipeline */
         |- **/*.yaml /* all your descriptors go into the resources folder */
         '- META-INF
            '- org.uimafit
               '- types.txt
```

### (Optional) Persistence/Database

### (Optional) UI

## Writing Extended Configuration Descriptors (ECD)
[Extended configuration descriptor (ECD)](https://github.com/oaqa/uima-ecd) extends the UIMA and uimaFit to balance between ease-of-use of flexibility. To this end, ECD provides three major features: (1) YAML based descriptors, (2) driver that supports multiple options for a component specified in the descriptor, and (3) driver that supports declarative options for a component.

Similar to collection processing engine descriptor (CPE) in UIMA
SDK, the key element of ECD is a YAML associative array
of three elements:
* collection-reader
* the actual pipeline, and
* the optional post-process pipeline

Similar to the UIMA SDK component descriptor (collection reader,pipeline annotator, and cas consumer), the building blocks of ECD are component descriptors (a YAML associative array) defines:
* The Java class that implements the component,
* Parameters and values the component requires.

### Basics
#### YAML Format
[YAML Ain't Markup Language (YAML)](http://yaml.org) is a human-readable data serialization format that provides a simple but rich syntax to represent a data structure, like our component descriptors as well as engine descriptors.

One important aspect of YAML is that indentation matters. Each indentation after a newline forms a block literal, similar to other indentation-sensitive languages such as Python. Also, parameters can be passed to fields using a colon and space. So for example,
```xml
configuration: 
  name: oaqa-tutorial
  author: oaqa
```

Here the parameters `name`, and `author` are set to `oaqa-tutorial` and `oaqa` respectively, and are passed as a block to `configuration`.

#### Components
When writing an ECD for a UIMA annotator, the first line specifies whether it is a subclass of another annotator descriptor, or whether it refers to a UIMA annotator class directly. The first line in an ECD component can contain one of the following:

* inherit: will look for a resource file within the class-path on the path specified by the doted syntax a.b.c => a/b/c.yaml. Inherited parameters. can be overridden directly on the body of the resource file.

* class: will look for the specified class within the class-path, and is intended to be used as a shorthand for classes that do not have configurable parameters.

For example, if `bar.yaml` refers to a concrete class `Bar.java` that resides in the package bar with some fixed parameter `fixed-param` set to `a`, then the YAML descriptor will look like:

```xml
# bar.yaml
class: bar.Bar
fixed-param: a
```

If a descriptor `foo.yaml` is a subclass of the `bar.yaml`, the YAML descriptor will look like:

```xml
# foo.yaml
inherit: bar
var: [x, y]
```
![resources][resources]
#### Parameters
Resources on a descriptor are configured using named parameters; any dash-separated string is a valid parameter name except for the reserved keywords: inherit, options, class and pipeline. The actual value of the parameter is either a Java primitive wrapper: Integer, Float, Long, Double, Boolean, or a String. For nested resources compound parameters are passed as Strings that are further parsed within the resource. For example, passing a RegEx pattern as a String parameter to a RoomAnnotator will look like:
```xml
class: annotators.RoomAnnotator

pattern: \\b[0-4]\\d[0-2]\\d\\d\\b
```
#### Cross-opts:
Combinatorial parameters are specified using the cross-opts mapping and declaring the desired values as elements on a list. For example for the following annotator descriptor:

```xml
class: annotators.RoomAnnotator

foo: bar

cross-opts:
    parameter-a: [value100,value200]
    parameter-b: [value300,value400] 
```

The configuration on Listing 2 will result in the 2x2 cross-product of configurations of the component: 
```xml
[foo:bar, parameter-a: value100, parameter-b: value300] 
[foo:bar, parameter-a: value200, parameter-b: value300]
[foo:bar, parameter-a: value100, parameter-b: value400] 
[foo:bar, parameter-a: value200, parameter-b: value400]
```

### CSE pipeline descriptors

#### Configuration
#### Phases

#### Pipeline

#### Options
![In-phase pipeline][inphasepipeline]

![Execution path][executionpathinphasepipeline] 
#### (Optional) Post-processing

## Pipeline code

### Writing collection reader code

### Writing annotator code

### Writing cas consumer code

## Creating a type system


## Examples
These examples are based on the UIMA SDK tutorial (USDK)(put anchor here), to see how they were migrated see section (Migrating your existing UIMA pipeline to CSE (and here)). 


### Example 1 (simple RoomNumberAnnotator)
For this example we will only use one type --- RoomNumber. You can use the existing `TutorialTypeSystem.xml` under `src/main/resources/types/`. 

```xml
<?xml version="1.0" encoding="UTF-8" ?>
  <typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
    <name>TutorialTypeSystem</name>
    <description>Type System Definition for the tutorial examples - 
        as of Exercise 1</description>
    <vendor>Apache Software Foundation</vendor>
    <version>1.0</version>
    <types>
      <typeDescription>
        <name>org.apache.uima.tutorial.RoomNumber</name>
        <description></description>
        <supertypeName>uima.tcas.Annotation</supertypeName>
        <features>
          <featureDescription>
            <name>building</name>
            <description>Building containing this room</description>
            <rangeTypeName>uima.cas.String</rangeTypeName>
          </featureDescription>
        </features>
      </typeDescription>
    </types>
  </typeSystemDescription>
```

Simply include the `TutorialTypeSystem.xml` file to you META-INF/types.txt file. Your types.txt file should look like:

```xml
classpath*:types/TutorialTypeSystem.xml
classpath*:types/SourceDocumentInformation.xml
```

Next, we will write the main yaml descriptor for the example. Notice that we are using the same collection-reader and cas consumer from before only we added the phase RoomNumberAnnotator.

```YAML
configuration: 
  name: oaqa-tutorial
  author: oaqa

collection-reader:
  inherit: collection_reader.filesystem-collection-reader
  InputDirectory: data/
pipeline:
  - inherit: ecd.phase  
    name: RoomNumberAnnotator
    options: |
      - inherit: tutorial.ex1.RoomNumberAnnotator 

  - inherit: cas_consumer.AnnotationPrinter
```

Now we define the descriptor for the RoomNumberAnnotator by just specifying where the class is located.

```yaml
class: org.apache.uima.tutorial.ex1.RoomNumberAnnotator
```
Finally, we write the code for the annotator, taken directly from the UIMA SDK tutorial (link). 

```java
package org.apache.uima.tutorial.ex1;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.jcas.JCas;
import org.apache.uima.tutorial.RoomNumber;

/**
 * Example annotator that detects room numbers using Java 1.4 regular expressions.
 */
public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
  private Pattern mYorktownPattern = Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b");
  private Pattern mHawthornePattern = Pattern.compile("\\b[JG1-4][1-2NS]-[A-Z]\\d\\d\\b");
  /**
   * @see JCasAnnotator_ImplBase#process(JCas)
   */
  public void process(JCas aJCas) {
    // get document text
    String docText = aJCas.getDocumentText();
    // search for Yorktown room numbers
    Matcher matcher = mYorktownPattern.matcher(docText);
    while (matcher.find()) {
      // found one - create annotation
      RoomNumber annotation = new RoomNumber(aJCas);
      annotation.setBegin(matcher.start());
      annotation.setEnd(matcher.end());
      annotation.setBuilding("Yorktown");
      annotation.addToIndexes();
    }
    // search for Hawthorne room numbers
    matcher = mHawthornePattern.matcher(docText);
    while (matcher.find()) {
      // found one - create annotation
      RoomNumber annotation = new RoomNumber(aJCas);
      annotation.setBegin(matcher.start());
      annotation.setEnd(matcher.end());
      annotation.setBuilding("Hawthorne");
      annotation.addToIndexes();
    }
  }
}
```
To run simply launch ex1.launch by right-clicking it and selecting Run As...-> ex1.

You should now be getting this output:

### Example 2 (passing parameters to an annotator)

```yaml
#oaqa-tutorial-ex2.yaml
configuration: 
  name: oaqa-tutorial
  author: oaqa

collection-reader:
  inherit: collection_reader.fs-collection-reader
  file: /data/UIMA_Seminars.txt
pipeline:
  - inherit: ecd.phase  
    name: RoomNumberAnnotator
    options: |
      - inherit: tutorial.ex2.RoomNumberAnnotator
  - inherit: cas_consumer.XmiWriterCasConsumer
```

```java
package org.apache.uima.tutorial.ex2;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.AnalysisComponent;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.tutorial.RoomNumber;
import org.apache.uima.util.Level;

public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
  private Pattern[] mPatterns;
  private String[] mLocations;
 
  public void initialize(UimaContext aContext) throws ResourceInitializationException {
    super.initialize(aContext);
    // Get config. parameter values from oaqa-tutorial-ex2.yaml
    String[] patternStrings = (String[]) aContext.getConfigParameterValue("Patterns"); 
    mLocations = (String[]) aContext.getConfigParameterValue("Locations");

    // compile regular expressions
    mPatterns = new Pattern[patternStrings.length];
    for (int i = 0; i < patternStrings.length; i++) {
      mPatterns[i] = Pattern.compile(patternStrings[i]);
    }
  }

  /**
   * @see JCasAnnotator_ImplBase#process(JCas)
   */
  public void process(JCas aJCas) throws AnalysisEngineProcessException {
    // get document text
    String docText = aJCas.getDocumentText();
    // loop over patterns
    for (int i = 0; i < mPatterns.length; i++) {
      Matcher matcher = mPatterns[i].matcher(docText);
      while (matcher.find()) {
        // found one - create annotation
        RoomNumber annotation = new RoomNumber(aJCas);
        annotation.setBegin(matcher.start());
        annotation.setEnd(matcher.end());
        annotation.setBuilding(mLocations[i]);    
        annotation.addToIndexes();
        getContext().getLogger().log(Level.FINEST, "Found: " + annotation);
      }
    }
  }
}
```
To run simply launch ex2.launch by right-clicking it and selecting Run As...-> ex2.

You should now be getting this output:
# Migrating your existing UIMA pipeline to CSE

* Have to use UIMA-fit style of declaring type system in META-INF/org.uimafit/types.txt (see slides)

* Have to use UIMA-fit hierarchy of cas_consumerimpl in order for it to work in the pipeline
 * initialize() -> initialize(UimaContext context)
     * initialize takes argument UimaContext context
 * instead of processCas() -> process(CAS aCas)

* Collection reader has to extend AbstractCollectionReader
    * can not use out-of-the-box UIMA collection readers
    * implement method getNextElement
        * returns the next data element in the pipeline
        * This is instead of using getNext(jcas)
    * Override and call super of initialize


# FAQ


# Acknowledgements 
Thanks to Zi Yang (@ziy), Elmer Garduno (@elmerg), and OAQA team members! 

# TODO
1. Take UIMA GUI screenshots of the output of each example.
2. How to create your own launch configuration using a yaml descriptor (ECD-Driver).

# References

[1] (Session 12: Configuration Space Exploration (SE class)) - (SE)  
[2] (Elmer's technical report) - (E)  
[3] (Zi's Slides) - (Z)  
[4] Zi HW0 - (hw0)  
[5] Zi HW1 - (hw1)  
[6] Zi HW2 - (hw2)  
[7] [UIMA SDK Tutorial](http://uima.apache.org/d/uimaj-2.4.0/index.html) (USDK)  
[resources]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/resources.png "resources"
[inphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/inphasepipeline.png "inphasepipeline"
[executionpathinphasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/executionpathinphasepipeline.png "executionpathinphasepipeline"
[threephasepipeline]: https://github.com/oaqa/oaqa-tutorial/raw/master/resources/imgs/threephasepipeline.png "threephasepipeline"
Clone this wiki locally