-
Notifications
You must be signed in to change notification settings - Fork 39
Sample Pipeline
Below is a sample pipeline configuration including some of the most common annotators. This pipeline will read from a folder called C:\baleen\data, and output into a Mongo database and an Elasticsearch index. The configuration for these two persistance stores is included below, although the default parameters are used so this information is optional. It is assumed that the relevant OpenNLP models have been downloaded and placed in the models directory.
This pipeline should be considered a sample only, and it is strongly advised that pipelines are tailored to individual corpora to achieve the best results. This can be done either by manually creating a pipeline configuration, or using the Plankton tool that is built into Baleen.
This pipeline is written for Baleen 2.4. Be aware that Baleen 2.4 will automatically order the pipeline, and so the order the annotators are listed in below is not the order they will run in.
mongo:
db: baleen
host: localhost
elasticsearch:
cluster: elasticsearch
host: localhost
collectionreader:
class: FolderReader
folders:
- C:\baleen\data
annotators:
- cleaners.AddGenderToPerson
- cleaners.AddTitleToPerson
- cleaners.CleanPunctuation
- cleaners.CleanTemporal
- cleaners.CollapseLocations
- cleaners.CorefBrackets
- cleaners.CorefCapitalisationAndApostrophe
- cleaners.CurrencyDetection
- cleaners.EntityInitials
- cleaners.ExpandLocationToDescription
- cleaners.MergeAdjacent
- cleaners.MergeAdjacentQuantities
- cleaners.MergeNationalityIntoEntity
- cleaners.NaiveMergeRelations
- cleaners.NormalizeOSGB
- cleaners.NormalizeTemporal
- cleaners.NormalizeWhitespace
- cleaners.ReferentToEntity
- cleaners.RelationTypeFilter
- cleaners.RemoveLowConfidenceEntities
- cleaners.RemoveNestedEntities
- cleaners.RemoveNestedLocations
- cleaners.RemoveOverlappingEntities
- cleaners.SplitBrackets
- cleaners.Surname
- coreference.SieveCoreference
- gazetteer.Country
- gazetteer.File
- class: gazetteer.Mongo
type: Buzzword
collection: buzzwords
- class: gazetteer.Mongo
type: Location
collection: location
- class: gazetteer.Mongo
type: Organisation
collection: organisations
- class: gazetteer.Mongo
type: Person
collection: people
- grammatical.NPAtCoordinate
- grammatical.NPElement
- grammatical.NPLocation
- grammatical.NPOrganisation
- grammatical.NPTitleEntity
- grammatical.QuantityNPEntity
- grammatical.TOLocationEntity
- language.OpenNLP
- class: misc.DocumentTypeByLocation
baseDirectory: C:\baleen\data
- misc.GenericMilitaryPlatform
- misc.GenericVehicle
- misc.GenericWeapon
- misc.MentionedAgain
- misc.NationalityToLocation
- misc.OrganisationPersonRole
- misc.People
- misc.Pronouns
- regex.Area
- regex.BritishArmyUnits
- regex.Callsign
- regex.CasRegistryNumber
- regex.Date
- regex.DateTime
- regex.Distance
- regex.DocumentNumber
- regex.Dtg
- regex.Email
- regex.FlightNumber
- regex.Frequency
- regex.Hms
- regex.IpV4
- regex.LatLon
- regex.Mgrs
- regex.Money
- regex.Nationality
- regex.Osgb
- regex.Postcode
- regex.RelativeDate
- regex.SocialMediaUsername
- regex.TaskForce
- regex.Telephone
- regex.Time
- regex.TimeQuantity
- regex.USTelephone
- regex.UnqualifiedDate
- regex.Url
- regex.Volume
- regex.Weight
- class: relations.NPVNP
onlyExisting: true
- stats.DocumentLanguage
- class: stats.OpenNLP
model: models/en-ner-location.bin
type: Location
- class: stats.OpenNLP
model: models/en-ner-organization.bin
type: Organisation
- class: stats.OpenNLP
model: models/en-ner-person.bin
type: Person
consumers:
- Mongo
- Elasticsearch
For a full list of all the annotators, collection readers and consumers available, see the Wiki documentation, the included Javadoc, or the REST API.