Fix logging of conversation level core metrics. #8030

kedz · 2021-02-23T21:08:20Z

Fix logging of conversation level core metrics to only compute and log conversation level accuracy. Formerly F1-Score and precision were being printed to console but upon investigation, these metrics were not very meaningful (precision was always 1.0 if any story was completely correct and 0 otherwise). Additionally, in-data fraction was being printed twice. The new output of rasa test looks like this:

2021-02-23 15:54:43 INFO     rasa.core.test  - Finished collecting predictions.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          6 / 7
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.857
2021-02-23 15:54:43 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          33 / 35
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.960
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        0.970
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.961
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886

Formerly the output looked like this:

2021-02-23 15:54:43 INFO     rasa.core.test  - Finished collecting predictions.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          6 / 7
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.923   # <-- harmonic mean of 1 and recall 
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        1.0     # <-- this is always 1 if correct > 0
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.857
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886   # <-- printed again below in ACTION level results
2021-02-23 15:54:43 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          33 / 35
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.960
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        0.970
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.961
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886

Additionally, conversation level accuracy is logged to file (i.e., results/story_report.json) as an additional field in a JSON dictionary:

{
...
 "conversation_accuracy": {"accuracy": 0.857, "correct": 6, "total": 7}
...
}

Status (please check what you already did):

added some tests for the functionality
updated the documentation
updated the changelog (please check changelog for instructions)
reformat files using black (please check Readme for instructions)

…ion level accuracy. Add conversation level accuracy to story_report.json

koernerfelicia · 2021-02-24T10:52:13Z

rasa/core/test.py

+            num_failed = len(story_evaluation.failed_stories)
+            num_correct = len(story_evaluation.successful_stories)
+            num_convs = num_failed + num_correct
+            conv_acc = num_correct / num_correct if num_correct else 0.0


This seems like a typo? num_correct/num_correct == 1

Also GH won't let me comment on the particular line because it's too far out of scope, but while you're at it would you remove the unused import on line 670? (from rasa.test import get_evaluation_metrics)

Oof that's embarrassing! Thank you for catching! Will fix!

koernerfelicia · 2021-02-24T11:10:56Z

@akelad would we need to update the docs for this? I couldn't find any reference to the old CL output, or the old story_report.json in https://rasa.com/docs/rasa/testing-your-assistant

There wouldn't be anywhere else, right?

joejuzl · 2021-02-24T11:19:45Z

rasa/core/test.py

-        include_report=False,
+    num_convs = len(correct_dialogues)
+    num_correct = sum(correct_dialogues)
+    accuracy = num_correct / num_convs if num_convs else 0.0


Optional:
Could do metrics.accuracy_score([1] * len(completed_trackers), correct_dialogues)

joejuzl · 2021-02-24T11:19:48Z

rasa/core/test.py

+    num_correct = sum(correct_dialogues)
+    accuracy = num_correct / num_convs if num_convs else 0.0
+
+    logger.info(


Could we make _log_evaluation_table handle None (and skip those) so we can reuse that method?

Yeah that sounds good! Will do.

joejuzl · 2021-02-24T11:22:43Z

rasa/core/test.py

+            num_failed = len(story_evaluation.failed_stories)
+            num_correct = len(story_evaluation.successful_stories)
+            num_convs = num_failed + num_correct
+            conv_acc = num_correct / num_correct if num_correct else 0.0


Does this differ from the accuracy above?

Yeah the accuracy above it is the percentage of correctly predicted next actions. Should I modify it to be action_accuracy to make it more clear?

I went ahead and renamed the previous accuracy to action_accuracy to make the distinction clear.

akelad · 2021-02-24T11:29:14Z

no need to update docs, no

…st_core_logging

…e optional arguments and reuse that function. Use sklearn.metrics.accuracy_score instead of computing manually. Remove unneccesary get_evaluation_metrics import.

koernerfelicia

Looks good! It still needs a changelog entry, though

joejuzl

Changes look good 🚀 - but I think we should definitely test the additions to the report file. We could also test the actual log output too as it's part of the UX.

joejuzl · 2021-02-25T13:14:55Z

rasa/core/test.py

+            num_convs = num_failed + num_correct
+            if num_convs:
+                conv_accuracy = num_correct / num_convs
+                report["conversation_accuracy"] = {


We should test this new data is there and correct in a unit test.

Totally! I'll create a unit test in tests/core/test_test.py

koernerfelicia · 2021-02-25T13:40:33Z

rasa/core/test.py

    _log_evaluation_table(
-        [1] * len(completed_trackers),
+        [1] * len(correct_dialogues),


I'm a bit lost on why this parameter changed. Was it wrong before? It doesn't look like the logic that uses it changed

This was unintentional but yes it does not affect the logic at all. I had deleted the use of the whole _log_evaluation_table method here previously. When I put it back in, I just needed a list of 1s with length equal to the number of test stories and I had just used correct_dialogues above so I used that list (I lexically entrained myself :)). Will change it back!

kedz · 2021-02-25T18:40:41Z

@koernerfelicia I don't know what to do/where is/best practices for making a changelog entry. Is there somewhere I can find that?

kedz · 2021-02-25T19:10:01Z

@joejuzl @koernerfelicia Working on a unit test but realizing I need an Agent with a core policy. I really just need one with a rule policy to test this. Can either of you point to an example of setting up a simple agent in a unit test?

…o _log_evaluation_table to preserve continuity with previous code version.

koernerfelicia · 2021-02-26T07:25:37Z

@kedz ah I didn't realise this was your first time doing a changelog entry, otherwise I would've said! It's here, and linked in the "to do" item under "Status" in the PR template if you ever need to refer to it again.

koernerfelicia · 2021-02-26T07:36:01Z

I need an Agent with a core policy.

I think you should find this in one of the conftest files. Or test_agent.py might have some other ideas?

kedz · 2021-03-01T21:51:01Z

@koernerfelicia @joejuzl ok, I added a changelog file and I added some tests for rasa.core.test, testing both that the results of rasa.core.test.test get logged to file and that _log_evaluation_table logs output to console as expected. Let me know if there is anything else. Otherwise I think its good to merge?

koernerfelicia · 2021-03-02T15:28:53Z

tests/core/test_test.py

+    expected_results: Dict[Text, Dict[Text, Any]],
+) -> None:
+
+    stories_path = tmpdir_factory.mktemp("test_rasa_core_test").join("eval_stories.yml")


Correct me if I'm wrong, but I think you want to clean up this file after the test is done? Or is that something that happens automatically?

tmpdir_factory and tmpdir test fixtures get deleted automatically, with the caveat that the three most recent tmpdirs are kept alive in case you need to inspect the contents due to a test failing. I did clean this up slightly and moved the creation of stories_path to a test fixture so that the args to test_test are clearer.

joejuzl

Thanks for adding tests! A couple of comments on the test code.

joejuzl · 2021-03-03T08:53:25Z

tests/core/test_test.py

+
+
+@pytest.fixture(scope="function")
+def out_directory(tmpdir: pathlib.Path):


Is there any need for this fixture?

I guess not. I just thought it looked cleaner to have the args to test_test be more informative, e.g., out_directory instead of tmpdir, but I can change it.

make sense - and I'm generally all for more informative names. tmpdir is pretty widely used around the code base for this purpose though.

joejuzl · 2021-03-03T08:55:37Z

tests/core/test_test.py

+        is_rule_tracker=True,
+    )
+
+    policy.train([rt1, rt2], domain, RegexInterpreter())


would it be possible to use one of the already trained model fixtures? These are session scope so save a lot of time in the CI.

Oh I did not know about those. Where can I find them?

Could you use the agent below? You'd have to change the expected values, but I think that could work
https://github.com/RasaHQ/rasa/blob/main/tests/conftest.py#L127

there are lots in rasa/tests/conftest.py for example default_agent (which is used in tests/core/test_evaluation.py too)

Thank you both!

joejuzl · 2021-03-03T08:56:29Z

tests/core/test_test.py

+                },
+            },
+        ],
+        ["", {}],


Optional: IMO this could just be a separate test to make it more readable.

joejuzl · 2021-03-03T08:59:47Z

tests/core/test_test.py

+from rasa.shared.core.events import ActionExecuted, UserUttered
+from rasa.shared.core.generator import TrackerWithCachedStates
+from rasa.shared.nlu.interpreter import RegexInterpreter
+


These tests could be added to tests/core/test_evaluation.py where (for some reason) rasa.core.test is imported as evaluate_stories

kedz · 2021-03-04T19:23:44Z

OK, I made the following changes to the tests:

Used an agent test fixture.
Moved tests from tests/core/test_test.py to tests/core/test_evaluation.py
Split test_test into two tests, one for testing a stories.yml with stories, and one for the edge case for when the stories file is empty.
Removed creation of file test fixtures and just used tmpdir, and made the necessary files in the test.
test_test is now renamed to test_story_report and test_story_report_with_empty_stories, since the test_test name is confusing in test_evaluation, where test is mapped to evaluate_stories, and the tests are really about the creation and contents of story_report.json anyway.

koernerfelicia

This looks good to me! Tests looks so much neater with the core_agent (and faster ofc). I won't approve yet because I'd like to see what @joejuzl has to say about the tests

joejuzl

Great testing! 🚀

koernerfelicia · 2021-03-05T09:22:29Z

Actually @joejuzl what's good with the failing test? Does Chris need to flag the test as being a "training" test or something like that? I'm not in the loop on the splitting of CI tests

joejuzl · 2021-03-05T09:25:31Z

Actually @joejuzl what's good with the failing test? Does Chris need to flag the test as being a "training" test or something like that? I'm not in the loop on the splitting of CI tests

haha yes good spot, this was my change too 🤦
You need to use the decorator @pytest.mark.trains_model on any test that advertently or inadvertently trains a model.
(see: https://rasa-hq.slack.com/archives/C36SS4N8M/p1614682343228200)

kedz · 2021-03-05T14:53:42Z

Ah I was also not aware of marking tests that train! Added the marks. I'm still seeing some errors on the CI but I don't know what they relate to. Any ideas?

koernerfelicia · 2021-03-05T15:02:48Z

@kedz no idea, they don't look related! I just restarted the jobs, maybe it was just a fluke?

kedz · 2021-03-05T17:58:02Z

Looks like it worked this time! Should I go ahead and merge?

koernerfelicia · 2021-03-05T19:20:43Z

@kedz go for it!

kedz · 2021-03-05T19:40:41Z

Oh wow, my first merge into Rasa OS, thanks @koernerfelicia and @joejuzl for reviewing/pointing me to all the things!

koernerfelicia · 2021-03-05T19:50:29Z

Congratulations!

Fix logging of conversation level core metrics to only show conversat…

3859394

…ion level accuracy. Add conversation level accuracy to story_report.json

kedz linked an issue Feb 23, 2021 that may be closed by this pull request

Clean up and log to file Conversation level performance measures. #8000

Closed

kedz marked this pull request as ready for review February 23, 2021 21:12

kedz requested review from dakshvar22, koernerfelicia and joejuzl and removed request for dakshvar22 February 23, 2021 21:12

koernerfelicia reviewed Feb 24, 2021

View reviewed changes

joejuzl suggested changes Feb 24, 2021

View reviewed changes

kedz added 2 commits February 24, 2021 11:59

Merge branch 'main' of https://github.com/RasaHQ/rasa into improve_te…

bb00828

…st_core_logging

Fix typo on accuracy calculation. Modify _log_evaluation_table to tak…

5526cfc

…e optional arguments and reuse that function. Use sklearn.metrics.accuracy_score instead of computing manually. Remove unneccesary get_evaluation_metrics import.

kedz requested review from koernerfelicia and joejuzl February 24, 2021 17:47

joejuzl approved these changes Feb 25, 2021

View reviewed changes

koernerfelicia suggested changes Feb 25, 2021

View reviewed changes

joejuzl suggested changes Feb 25, 2021

View reviewed changes

koernerfelicia reviewed Feb 25, 2021

View reviewed changes

Revert to original argument ([1] * len(completed_trackers)) in call t…

666f54a

…o _log_evaluation_table to preserve continuity with previous code version.

kedz added 4 commits March 1, 2021 15:43

Added a test to check rasa.core.test.test writes policy results to file.

b8e9b15

Merge branch 'main' into improve_test_core_logging

b57b73a

Add changelog entry.

20e21a8

Added test for _log_evaluation_table method.

45bc56a

kedz requested a review from joejuzl March 1, 2021 21:51

koernerfelicia reviewed Mar 2, 2021

View reviewed changes

kedz and others added 3 commits March 2, 2021 13:00

Fix typo in changelog filename. Make changlog text shorter.

553821d

Cleaned up temp path fixtures.

a484bfb

Merge branch 'main' into improve_test_core_logging

02ad9e9

joejuzl suggested changes Mar 3, 2021

View reviewed changes

Chris Kedzie and others added 2 commits March 4, 2021 09:57

Merge branch 'main' into improve_test_core_logging

08b6b6a

Move tests to tests/core/test_evaluation. Use agent test fixture.

ff9cf07

Rename tests.

7afc7f7

kedz requested review from joejuzl and koernerfelicia March 4, 2021 19:28

koernerfelicia reviewed Mar 5, 2021

View reviewed changes

joejuzl approved these changes Mar 5, 2021

View reviewed changes

koernerfelicia approved these changes Mar 5, 2021

View reviewed changes

kedz and others added 3 commits March 5, 2021 09:01

Fixed wording in doc string.

140ad14

Added @pytest.mark.trains_model to core policy evaluation tests.

0377d0f

Merge branch 'main' into improve_test_core_logging

64628eb

kedz requested review from joejuzl and koernerfelicia March 5, 2021 14:54

joejuzl approved these changes Mar 5, 2021

View reviewed changes

kedz merged commit c78fc50 into main Mar 5, 2021

kedz deleted the improve_test_core_logging branch March 5, 2021 19:39



		@pytest.fixture(scope="function")
		def out_directory(tmpdir: pathlib.Path):

Fix logging of conversation level core metrics. #8030

Fix logging of conversation level core metrics. #8030

Conversation

kedz commented Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koernerfelicia commented Feb 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akelad commented Feb 24, 2021

koernerfelicia left a comment

Choose a reason for hiding this comment

joejuzl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kedz commented Feb 25, 2021

kedz commented Feb 25, 2021 • edited Loading

koernerfelicia commented Feb 26, 2021 • edited Loading

koernerfelicia commented Feb 26, 2021 • edited Loading

kedz commented Mar 1, 2021

koernerfelicia Mar 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joejuzl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kedz commented Mar 4, 2021

koernerfelicia left a comment

Choose a reason for hiding this comment

joejuzl left a comment

Choose a reason for hiding this comment

koernerfelicia commented Mar 5, 2021

joejuzl commented Mar 5, 2021

kedz commented Mar 5, 2021

koernerfelicia commented Mar 5, 2021

kedz commented Mar 5, 2021

koernerfelicia commented Mar 5, 2021

kedz commented Mar 5, 2021

koernerfelicia commented Mar 5, 2021

kedz commented Feb 23, 2021 •

edited

Loading

kedz commented Feb 25, 2021 •

edited

Loading

koernerfelicia commented Feb 26, 2021 •

edited

Loading

koernerfelicia commented Feb 26, 2021 •

edited

Loading

koernerfelicia Mar 2, 2021 •

edited

Loading