-
Notifications
You must be signed in to change notification settings - Fork 39
Sample Pipeline
James Baker edited this page Oct 9, 2015
·
6 revisions
Below is a sample pipeline configuration including the most common annotators. This pipeline will read from a folder called C:\baleen\data, and output into a Mongo database and an Elasticsearch index. The configuration for these two persistance stores is included below, although the default parameters are used so this information is optional. It is assumed that the relevant OpenNLP models have been downloaded and placed in the models directory.
mongo:
db: baleen
host: localhost
elasticsearch:
cluster: elasticsearch
host: localhost
collectionreader:
class: FolderReader
folders:
- C:\baleen\data
annotators:
- language.OpenNLP
- class: misc.DocumentTypeByLocation
baseDirectory: C:\baleen\data
- gazetteer.Country
- class: gazetteer.Mongo
type: Buzzword
collection: buzzwords
- class: gazetteer.Mongo
type: Location
collection: location
- class: gazetteer.Mongo
type: Organisation
collection: organisations
- class: gazetteer.Mongo
type: Person
collection: people
- regex.Area
- regex.BritishArmyUnits
- regex.Callsign
- regex.Date
- regex.DateTime
- regex.Distance
- regex.Dtg
- regex.Email
- regex.FlightNumber
- regex.Frequency
- regex.IpV4
- regex.LatLon
- regex.Mgrs
- regex.Money
- regex.Nationality
- regex.Osgb
- regex.Postcode
- regex.TaskForce
- regex.Telephone
- regex.Time
- regex.TimeQuantity
- regex.Url
- regex.Volume
- regex.Weight
- class: stats.OpenNLP
model: models/en-ner-location.bin
type: Location
- class: stats.OpenNLP
model: models/en-ner-organization.bin
type: Organisation
- class: stats.OpenNLP
model: models/en-ner-person.bin
type: Person
- cleaners.MergeAdjacentQuantities
- grammatical.NPTitleEntity
- grammatical.QuantityNPEntity
- grammatical.TOLocationEntity
- cleaners.RemoveLowConfidenceEntities
- cleaners.RemoveNestedDateTimes
- cleaners.RemoveNestedEntities
- cleaners.RemoveNestedLocations
- cleaners.NormalizeWhitespace
- cleaners.CleanDates
- cleaners.CleanPunctuations
- cleaners.AddTimeSpans
- cleaners.CorefCapitalisationAndApostrophe
- cleaners.CorefLocationCoordinate
consumers:
- Mongo
- Elasticsearch
For a full list of all the annotators, collection readers and consumers available, see the Wiki documentation, the included Javadoc, or the REST API.