-
Notifications
You must be signed in to change notification settings - Fork 39
Triage
Baleen 2.6 contains a number of tools to extend Baleen's functionality for document triage.
The Mallet project has been integrated for document classification and topic model generation.
Functionality for document summarisation has been added with the triage.WordDistributionDocumentSummary
annotator.
Functionality for document prioritisation is introduced with the triage.ShannonEntropyAnnotator
which uses Shannon Entropy as a measure of information content to prioritise documents with more information in fewer words.
To use the categorisation annotators first a model must be created. Three new Baleen jobs have been created for this purpose.
-
triage.TopicModelTrainer
for when there are no prior labels. -
triage.MaxEntClassifierTrainer
for when the classes are provided by the user and identified by a list of keywords that suggest the class. -
triage.MalletClassifierTrainer
to be used when there is existing classification data to use as training.
The trainer tasks read their corpus from Mongo. The collection and content field are configurable, but by default, it uses the standard Mongo consumer’s ‘documents’ collection and ‘content’ field. So a corpus can be produced simply by running a Baleen pipeline with an appropriate collection reader and the Mongo consumer; no other annotators are required.
The topic model can be generated using the following Baleen job, where the job is configured in topictraining.yml
java -jar baleen.jar -j topictraining.yml
mongo: db: baleen host: localhost tasks: - class: triage.TopicModelTrainer modelFile: ./models/topic.mallet numTopics: 10 # numIterations: 1000 # numThreads: 2 # collection: documents # field: content
The maximum entropy model is trained by providing user defined classes along with a set of keywords which define them. For example, "positive" and "negative" classes could be trained usein the maxent.yml job base on the the text file labels.txt which contains a label and a number of keywords defining it.
mongo: db: baleen host: localhost tasks: - class: triage.MaxEntClassifierTrainer labelsFile: labels.txt modelFile: ./models/maxent.mallet # numIterations: 1000 # variance: 1.0 # collection: documents # field: content
positive good love amazing best awesome negative not can't enemy horrible ain't
Clearly this is a very brief labels file and so will not produce a robust model.
The model can be trained as a Baleen job:
java -jar baleen.jar -j maxent.yml
If you have labelled data then there are further Mallet classifiers that can be trained on the data. The trainer allows multiple classifiers to be trained in the same job, and can output an assessment of accuracy based on randomly partitioning the data for training and testing. Then the best performing model can be taken forward. We have added an example job to train multiple classifiers but labelled data must be loaded into Mongo to use it. The collection and labelfield within Mongo are configurable, but default to 'documents' and 'labels' respectively.
The following job can be used to train the model on this data
java -jar baleen.jar -j classify.yml
configured by the classify.yml pipeline file:
mongo: db: baleen host: localhost tasks: - class: triage.MalletClassifierTrainer trainer: - RandomAssignmentTrainer - NaiveBayes - DecisionTree,maxDepth=10 - DecisionTree,maxDepth=20 - DecisionTree,maxDepth=40 - BalancedWinnow - MaxEnt forTesting: 0.2 resultFile: ./models/classifyTrials.csv modelFile: ./models/classify # collection: documents # labelField: label
Note that this classifier produces a number of mallet models prefixed with "classify". The example below selects one and renames it "classify.mallet".
Given suitably trained models the full set of triage annotators can be run using the following Baleen pipeline.
mongo: db: baleen host: localhost collectionreader: - class: FolderReader folders: - ./files/ annotators: - language.OpenNLP - class: triage.CommonKeywords stemming: ENGLISH - class: triage.RakeKeywords stemming: ENGLISH - class: triage.ShannonEntropyAnnotator - class: triage.WordDistributionDocumentSummary # summaryCharacterCount: 100 - class: triage.TopicModel modelFile: ./models/topic.mallet - class: triage.MalletClassifier modelFile: ./models/classify.mallet - class: triage.MalletClassifier modelFile: ./models/maxent.mallet consumers: - Mongo