Skip to content
James Baker edited this page Oct 9, 2015 · 6 revisions

Below is a sample pipeline configuration including the most common annotators. This pipeline will read from a folder called C:\baleen\data, and output into a Mongo database and an Elasticsearch index. The configuration for these two persistance stores is included below, although the default parameters are used so this information is optional. It is assumed that the relevant OpenNLP models have been downloaded and placed in the models directory.

mongo:
  db: baleen
  host: localhost

elasticsearch:
  cluster: elasticsearch
  host: localhost

collectionreader:
  class: FolderReader
  folders:
  - C:\baleen\data

annotators:
- language.OpenNLP
- class: misc.DocumentTypeByLocation
  baseDirectory: C:\baleen\data
- gazetteer.Country
- class: gazetteer.Mongo
  type: Buzzword
  collection: buzzwords
- class: gazetteer.Mongo
  type: Location
  collection: location
- class: gazetteer.Mongo
  type: Organisation
  collection: organisations
- class: gazetteer.Mongo
  type: Person
  collection: people
- regex.Area
- regex.BritishArmyUnits
- regex.Callsign
- regex.Date
- regex.DateTime
- regex.Distance
- regex.Dtg
- regex.Email
- regex.FlightNumber
- regex.Frequency
- regex.IpV4
- regex.LatLon
- regex.Mgrs
- regex.Money
- regex.Nationality
- regex.Osgb
- regex.Postcode
- regex.TaskForce
- regex.Telephone
- regex.Time
- regex.TimeQuantity
- regex.Url
- regex.Volume
- regex.Weight
- class: stats.OpenNLP
  model: models/en-ner-location.bin
  type: Location
- class: stats.OpenNLP
  model: models/en-ner-organization.bin
  type: Organisation
- class: stats.OpenNLP
  model: models/en-ner-person.bin
  type: Person
- cleaners.MergeAdjacentQuantities
- grammatical.NPTitleEntity
- grammatical.QuantityNPEntity
- grammatical.TOLocationEntity
- cleaners.RemoveLowConfidenceEntities
- cleaners.RemoveNestedDateTimes
- cleaners.RemoveNestedEntities
- cleaners.RemoveNestedLocations
- cleaners.NormalizeWhitespace
- cleaners.CleanDates
- cleaners.CleanPunctuations
- cleaners.AddTimeSpans
- cleaners.CorefCapitalisationAndApostrophe
- cleaners.CorefLocationCoordinate

consumers:
- Mongo
- Elasticsearch

For a full list of all the annotators, collection readers and consumers available, see the Wiki documentation, the included Javadoc, or the REST API.