Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

tomaarsen · 2021-10-20T16:06:35Z

Hello!

Pull request overview

Implement Precision, Recall, F-measure, Confusion matrices and a per-tag evaluation for all Taggers.
Implement Precision, Recall, F-measure and per-tag evaluation for ConfusionMatrix
Add large sections to tag.doctest and metrics.doctest that show these additions.
Small fixes for some in-method doctests throughout the tag package.

Method overview

Every Tagger in NLTK subclasses the TaggerI interface, which used to provide the following methods:

tag(tokens)
tag_sents(sentences)
evaluate(gold)

After this PR, it also provides

confusion(gold)
recall(gold)
precision(gold)
f_measure(gold, alpha=0.5)
evaluate_per_tag(self, gold, alpha=0.5, truncate=None, sort_by_count=False)

Beyond that, nltk/metrics/confusionmatrix provides a ConfusionMatrix class, which this PR gives the following methods:

recall(value)
precision(value)
f_measure(value, alpha=0.5)
evaluate(alpha=0.5, truncate=None, sort_by_count=False)

Reasoning

In my experience of working with the NLTK Taggers, the evaluation that can easily be done is very minimal. You can call tagger.evaluate(gold) to compute an accuracy, but that gives no information on what tokens are actually being tagged correctly, or whether we're over- or under-fitting certain tags. Only accuracy simply isn't enough.

So, I went looking for recall, precision and f-measures in the codebase. We've implemented these in nltk/metrics/scores.py, but they're very much written for IR tasks. They take sets, and use set intersections to compute the values. This doesn't work for Taggers, as tags need to be able to occur multiple times without being removed due to the defining set property.

Changes for `ConfusionMatrix`

For ConfusionMatrix, recall(value), precision(value) and f_measure(value, alpha=0.5) are all very similar, and they return simply the floats for the corresponding metrics, for that value.

E.g., in the following ConfusionMatrix:

>>> reference = "DET NN VB DET JJ NN NN IN DET NN".split()
>>> test = "DET VB VB DET NN NN NN IN DET NN".split()
>>> cm = ConfusionMatrix(reference, test)
>>> cm.pretty_format(sort_by_count=True)
    |   D       |
    | N E I J V |
    | N T N J B |
----+-----------+
 NN |<3>. . . 1 |
DET | .<3>. . . |
 IN | . .<1>. . |
 JJ | 1 . .<.>. |
 VB | . . . .<1>|
----+-----------+
(row = reference; col = test)

The recall for VB will be 1.0 (True positive is 1, False negative is 0), and the precision for VB will be 0.5 (True positive is 1, False positive is 1).

Furthermore, the new method evaluate will output all of this information concisely in a tabular format:

>>> print(cm.evaluate())
Tag | Prec.  | Recall | F-measure
----+--------+--------+-----------
DET | 1.0000 | 1.0000 | 1.0000
 IN | 1.0000 | 1.0000 | 1.0000
 JJ | 0.0000 | 0.0000 | 0.0000
 NN | 0.7500 | 0.7500 | 0.7500
 VB | 0.5000 | 1.0000 | 0.6667

This method evaluate uses recall, precision and f_measure internally. These 3 methods can also be normally called to get the float result.

Changes for `TaggerI`

These new changes for ConfusionMatrix have very interesting consequences, for example for taggers. I've introduced methods for TaggerI, which mostly speak for themselves, especially when I provide some examples:

Set up a (pretrained) tagger

>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import treebank
>>> tagger = PerceptronTagger()
>>> gold_data = treebank.tagged_sents()[:10]

Evaluate with accuracy

This method already existed!

>>> tagger.evaluate(gold_data)
0.8940677966101694

Evaluate with `recall`

This method, and the next two, will return the per-tag metrics, so developers have a machine-readable way to use these metrics for whatever use they see fit.

>>> tagger.recall(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 1.0, 'DT': 1.0, 'EX': 1.0, 'IN': 0.88, 'JJ': 0.8888888888888888, 'JJR': 0.0,
'JJS': 1.0, 'MD': 1.0, 'NN': 0.9333333333333333, 'NNP': 1.0, 'NNS': 1.0, 'POS': 1.0, 'PRP': 1.0, 'PRP$': 1.0, 'RB': 1.0, 'RBR': 0.5,
'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 0.8, 'VBN': 0.8, 'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with `precision`

>>> tagger.precision(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 0.7142857142857143, 'DT': 1.0, 'EX': 1.0, 'IN': 0.9166666666666666,
'JJ': 0.8888888888888888, 'JJR': 0.0, 'JJS': 1.0, 'MD': 1.0, 'NN': 0.8, 'NNP': 0.8928571428571429, 'NNS': 0.95, 'POS': 1.0,
'PRP': 1.0, 'PRP$': 1.0, 'RB': 0.4, 'RBR': 1.0, 'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 1.0, 'VBN': 1.0,
'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with `f_measure`

>>> tagger.f_measure(gold_data)
{"''": 1.0, ',': 1.0, '-NONE-': 0.0, '.': 1.0, 'CC': 1.0, 'CD': 0.8333333333333334, 'DT': 1.0, 'EX': 1.0, 'IN': 0.8979591836734693,
'JJ': 0.8888888888888888, 'JJR': 0.0, 'JJS': 1.0, 'MD': 1.0, 'NN': 0.8615384615384616, 'NNP': 0.9433962264150942,
'NNS': 0.9743589743589745, 'POS': 1.0, 'PRP': 1.0, 'PRP$': 1.0, 'RB': 0.5714285714285714, 'RBR': 0.6666666666666666,
'RP': 1.0, 'TO': 1.0, 'VB': 1.0, 'VBD': 0.8571428571428571, 'VBG': 0.8888888888888888, 'VBN': 0.8888888888888888, 
'VBP': 1.0, 'VBZ': 1.0, 'WDT': 0.0, '``': 1.0}

Evaluate with `evaluate_per_tag`

This method provides the human-readable form of the recall, precision and f_measure methods, allowing developers of taggers to inspect where their taggers are still performing suboptimally. Immediately upon looking at this output can you see that the default NLTK pre-trained PerceptronTagger has a really high recall for CD, while it has a low precision there. This is indicative that too many tokens are tagged as CD, and is something that the developer could look into.

This is only for 10 sentences, but there's a lot of interesting information to be gleamed when you use the entire treebank section that NLTK has access to. (For example, JJ has a precision of 0.6163 with a recall of 0.9131!)

>>> print(tagger.evaluate_per_tag(gold_data, sort_by_count=True))
   Tag | Prec.  | Recall | F-measure
-------+--------+--------+-----------
    NN | 0.8000 | 0.9333 | 0.8615
    IN | 0.9167 | 0.8800 | 0.8980
   NNP | 0.8929 | 1.0000 | 0.9434
    DT | 1.0000 | 1.0000 | 1.0000
   NNS | 0.9500 | 1.0000 | 0.9744
    JJ | 0.8889 | 0.8889 | 0.8889
     , | 1.0000 | 1.0000 | 1.0000
-NONE- | 0.0000 | 0.0000 | 0.0000
     . | 1.0000 | 1.0000 | 1.0000
   VBD | 0.8571 | 0.8571 | 0.8571
   VBZ | 1.0000 | 1.0000 | 1.0000
    CD | 0.7143 | 1.0000 | 0.8333
    TO | 1.0000 | 1.0000 | 1.0000
   VBG | 1.0000 | 0.8000 | 0.8889
   VBN | 1.0000 | 0.8000 | 0.8889
   PRP | 1.0000 | 1.0000 | 1.0000
    RB | 0.4000 | 1.0000 | 0.5714
    VB | 1.0000 | 1.0000 | 1.0000
   VBP | 1.0000 | 1.0000 | 1.0000
  PRP$ | 1.0000 | 1.0000 | 1.0000
   RBR | 1.0000 | 0.5000 | 0.6667
   WDT | 0.0000 | 0.0000 | 0.0000
    '' | 1.0000 | 1.0000 | 1.0000
    CC | 1.0000 | 1.0000 | 1.0000
    EX | 1.0000 | 1.0000 | 1.0000
   JJS | 1.0000 | 1.0000 | 1.0000
    MD | 1.0000 | 1.0000 | 1.0000
   POS | 1.0000 | 1.0000 | 1.0000
    RP | 1.0000 | 1.0000 | 1.0000
    `` | 1.0000 | 1.0000 | 1.0000
   JJR | 0.0000 | 0.0000 | 0.0000

Evaluate with `confusion`

This method goes perfectly with the previous one: A mismatch in precision/recall doesn't always give all the information that a developer would need to find out what truly is the issue at hand. Being able to quickly show a confusion matrix like this can ease understanding significantly.

>>> print(tagger.confusion(gold_data))
       |        -                                                                                     |
       |        N                                                                                     |
       |        O                                               P                                     |
       |        N                       J  J        N  N  P  P  R     R           V  V  V  V  V  W    |
       |  '     E     C  C  D  E  I  J  J  J  M  N  N  N  O  R  P  R  B  R  T  V  B  B  B  B  B  D  ` |
       |  '  ,  -  .  C  D  T  X  N  J  R  S  D  N  P  S  S  P  $  B  R  P  O  B  D  G  N  P  Z  T  ` |
-------+----------------------------------------------------------------------------------------------+
    '' | <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
     , |  .<15> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
-NONE- |  .  . <.> .  .  2  .  .  .  2  .  .  .  5  1  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  . |
     . |  .  .  .<10> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CC |  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    CD |  .  .  .  .  . <5> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    DT |  .  .  .  .  .  .<20> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    EX |  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    IN |  .  .  .  .  .  .  .  .<22> .  .  .  .  .  .  .  .  .  .  3  .  .  .  .  .  .  .  .  .  .  . |
    JJ |  .  .  .  .  .  .  .  .  .<16> .  .  .  .  1  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . |
   JJR |  .  .  .  .  .  .  .  .  .  . <.> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   JJS |  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    MD |  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
    NN |  .  .  .  .  .  .  .  .  .  .  .  .  .<28> 1  1  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .<25> .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   NNS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .<19> .  .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   POS |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  .  .  .  .  . |
   PRP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  .  .  . |
  PRP$ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <2> .  .  .  .  .  .  .  .  .  .  .  . |
    RB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  .  .  .  .  .  .  . |
   RBR |  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  .  . |
    RP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1> .  .  .  .  .  .  .  .  . |
    TO |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <5> .  .  .  .  .  .  .  . |
    VB |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  .  .  .  .  . |
   VBD |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  . <6> .  .  .  .  .  . |
   VBG |  .  .  .  .  .  .  .  .  .  .  .  .  .  1  .  .  .  .  .  .  .  .  .  .  . <4> .  .  .  .  . |
   VBN |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  1  . <4> .  .  .  . |
   VBP |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <3> .  .  . |
   VBZ |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <7> .  . |
   WDT |  .  .  .  .  .  .  .  .  2  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <.> . |
    `` |  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . <1>|
-------+----------------------------------------------------------------------------------------------+
(row = reference; col = test)

I would recommend having a look at the updated nltk/test/tag.doctest, which shows some more examples of how these methods can be very very useful in the development process of taggers.

Implementation details

The implementation on the ConfusionMatrix side is very simple. It's simply a case of recognising TP, FP, FN and TN, and using them to compute the precision, recall, f_measure and the evaluation table.

And for the TaggerI side it's also fairly simple: The gold parameter (i.e. the known correct list of tagged sentences) is used as the "reference", while the sentences from this gold are tagged by the tagger to produce "predicted" tags. Together, these are the two dimensions for a ConfusionMatrix. Then, recall, precision, f_measure, confusion and evaluate_per_tag all simply use the ConfusionMatrix methods.

The only bit of implementation magic is that the confusion(gold) method will call another method: self._confusion_cached, it will first convert gold to a tuple of tuples, rather than a list of lists. This is because tuples are hashable, while lists aren't. So, with the input to self._confusion_cached being a tuple, we can (as the name suggests) cache this method call. I've set the maxsize of the cache to 1, so only 1 confusion matrix is ever cached. That should most likely be fine.
In short, despite that every method calls self.confusion(), the tagging and setting up the ConfusionMatrix is only done once.

Doctest changes

As you might be able to see in the PR, I've added # doctest: +NORMALIZE_WHITESPACE in a few places. Previously, doctest would fail here, as the predicted output lists were spread out over multiple lines.

Beyond that, nltk/test/metrics.doctest has 3 more tests, and nltk/test/tag.doctest has been improved significantly.

Tom Aarsen

…tion to Taggers And add precision, recall and f-measure to ConfusionMatrix. Includes large doctests, and some small doctest fixes throughout the tag module

stevenbird · 2021-10-24T11:48:42Z

tagger.evaluate(gold_data)

How about deprecating this in favour of tagger.accuracy(gold_data)

tomaarsen · 2021-10-24T21:05:54Z

Sounds great. I wasn't a big fan of an evaluate method simply returning an accuracy float to begin with. Accuracy is just a somewhat naive evaluation metric after all. I'll get on it.

tomaarsen · 2021-10-25T10:06:07Z

This PR now has some additional changes not described in the original text:

TaggerI's evaluate(gold) is now deprecated in favor of accuracy(gold). The former can still be used, but it throws a warning.
Similarly, ChunkParserI's evaluate(gold) is now deprecated in favor of accuracy(gold).

So, this PR is no longer exclusively about taggers, but also affects a parser.

stevenbird · 2021-12-15T23:01:27Z

Thanks @tomaarsen – great contribution!

tomaarsen added 2 commits October 20, 2021 16:54

Add Precision, Recall, F-measure, Confusion Matrix and per-tag evalua…

c723469

…tion to Taggers And add precision, recall and f-measure to ConfusionMatrix. Includes large doctests, and some small doctest fixes throughout the tag module

Move evaluation of ConfusionMatrix into nltk\metrics\confusionmatrix.py

67f6dfd

tomaarsen added enhancement tagger metrics labels Oct 20, 2021

Add self as author in significantly updated files

ed4286e

stevenbird self-assigned this Oct 22, 2021

tomaarsen added 3 commits October 25, 2021 11:24

Deprecate tagger evaluate(gold) in favor of accuracy(gold)

f622d99

Missed one case of Tagger evaluate still being used - fixed now

147d0fb

Deprecate ChunkParser's evaluate(gold) in favor of accuracy(gold)

a1adb5a

Merge branch 'develop' into feature/tagger-metrics

c2c705f

stevenbird merged commit a28d256 into nltk:develop Dec 15, 2021

tomaarsen deleted the feature/tagger-metrics branch December 16, 2021 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

tomaarsen commented Oct 20, 2021

stevenbird commented Oct 24, 2021 •

edited

Loading

tomaarsen commented Oct 24, 2021

tomaarsen commented Oct 25, 2021 •

edited

Loading

stevenbird commented Dec 15, 2021

Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

Add Precision, Recall, F-measure, Confusion Matrix to Taggers #2862

Conversation

tomaarsen commented Oct 20, 2021

Pull request overview

Method overview

Reasoning

Changes for ConfusionMatrix

Changes for TaggerI

Set up a (pretrained) tagger

Evaluate with accuracy

Evaluate with recall

Evaluate with precision

Evaluate with f_measure

Evaluate with evaluate_per_tag

Evaluate with confusion

Implementation details

Doctest changes

stevenbird commented Oct 24, 2021 • edited Loading

tomaarsen commented Oct 24, 2021

tomaarsen commented Oct 25, 2021 • edited Loading

stevenbird commented Dec 15, 2021

Changes for `ConfusionMatrix`

Changes for `TaggerI`

Evaluate with `recall`

Evaluate with `precision`

Evaluate with `f_measure`

Evaluate with `evaluate_per_tag`

Evaluate with `confusion`

stevenbird commented Oct 24, 2021 •

edited

Loading

tomaarsen commented Oct 25, 2021 •

edited

Loading