Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello!
Pull request overview
ConfusionMatrix
tag.doctest
andmetrics.doctest
that show these additions.tag
package.Method overview
Every Tagger in NLTK subclasses the
TaggerI
interface, which used to provide the following methods:tag(tokens)
tag_sents(sentences)
evaluate(gold)
After this PR, it also provides
confusion(gold)
recall(gold)
precision(gold)
f_measure(gold, alpha=0.5)
evaluate_per_tag(self, gold, alpha=0.5, truncate=None, sort_by_count=False)
Beyond that,
nltk/metrics/confusionmatrix
provides aConfusionMatrix
class, which this PR gives the following methods:recall(value)
precision(value)
f_measure(value, alpha=0.5)
evaluate(alpha=0.5, truncate=None, sort_by_count=False)
Reasoning
In my experience of working with the NLTK Taggers, the evaluation that can easily be done is very minimal. You can call
tagger.evaluate(gold)
to compute an accuracy, but that gives no information on what tokens are actually being tagged correctly, or whether we're over- or under-fitting certain tags. Only accuracy simply isn't enough.So, I went looking for recall, precision and f-measures in the codebase. We've implemented these in
nltk/metrics/scores.py
, but they're very much written for IR tasks. They take sets, and use set intersections to compute the values. This doesn't work for Taggers, as tags need to be able to occur multiple times without being removed due to the defining set property.Changes for
ConfusionMatrix
For
ConfusionMatrix
,recall(value)
,precision(value)
andf_measure(value, alpha=0.5)
are all very similar, and they return simply the floats for the corresponding metrics, for that value.E.g., in the following
ConfusionMatrix
:The recall for
VB
will be1.0
(True positive is 1, False negative is 0), and the precision forVB
will be0.5
(True positive is 1, False positive is 1).Furthermore, the new method
evaluate
will output all of this information concisely in a tabular format:This method
evaluate
usesrecall
,precision
andf_measure
internally. These 3 methods can also be normally called to get the float result.Changes for
TaggerI
These new changes for
ConfusionMatrix
have very interesting consequences, for example for taggers. I've introduced methods forTaggerI
, which mostly speak for themselves, especially when I provide some examples:Set up a (pretrained) tagger
Evaluate with accuracy
This method already existed!
Evaluate with
recall
This method, and the next two, will return the per-tag metrics, so developers have a machine-readable way to use these metrics for whatever use they see fit.
Evaluate with
precision
Evaluate with
f_measure
Evaluate with
evaluate_per_tag
This method provides the human-readable form of the
recall
,precision
andf_measure
methods, allowing developers of taggers to inspect where their taggers are still performing suboptimally. Immediately upon looking at this output can you see that the default NLTK pre-trainedPerceptronTagger
has a really high recall forCD
, while it has a low precision there. This is indicative that too many tokens are tagged asCD
, and is something that the developer could look into.This is only for 10 sentences, but there's a lot of interesting information to be gleamed when you use the entire treebank section that NLTK has access to. (For example,
JJ
has a precision of 0.6163 with a recall of 0.9131!)Evaluate with
confusion
This method goes perfectly with the previous one: A mismatch in precision/recall doesn't always give all the information that a developer would need to find out what truly is the issue at hand. Being able to quickly show a confusion matrix like this can ease understanding significantly.
I would recommend having a look at the updated
nltk/test/tag.doctest
, which shows some more examples of how these methods can be very very useful in the development process of taggers.Implementation details
The implementation on the
ConfusionMatrix
side is very simple. It's simply a case of recognising TP, FP, FN and TN, and using them to compute the precision, recall, f_measure and the evaluation table.And for the
TaggerI
side it's also fairly simple: Thegold
parameter (i.e. the known correct list of tagged sentences) is used as the "reference", while the sentences from thisgold
are tagged by the tagger to produce "predicted" tags. Together, these are the two dimensions for aConfusionMatrix
. Then,recall
,precision
,f_measure
,confusion
andevaluate_per_tag
all simply use theConfusionMatrix
methods.The only bit of implementation magic is that the
confusion(gold)
method will call another method:self._confusion_cached
, it will first convertgold
to a tuple of tuples, rather than a list of lists. This is because tuples are hashable, while lists aren't. So, with the input toself._confusion_cached
being a tuple, we can (as the name suggests) cache this method call. I've set the maxsize of the cache to 1, so only 1 confusion matrix is ever cached. That should most likely be fine.In short, despite that every method calls
self.confusion()
, the tagging and setting up the ConfusionMatrix is only done once.Doctest changes
As you might be able to see in the PR, I've added
# doctest: +NORMALIZE_WHITESPACE
in a few places. Previously, doctest would fail here, as the predicted output lists were spread out over multiple lines.Beyond that,
nltk/test/metrics.doctest
has 3 more tests, andnltk/test/tag.doctest
has been improved significantly.