Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logging of conversation level core metrics. #8030

Merged
merged 18 commits into from
Mar 5, 2021
Merged

Conversation

kedz
Copy link
Contributor

@kedz kedz commented Feb 23, 2021

Fix logging of conversation level core metrics to only compute and log conversation level accuracy. Formerly F1-Score and precision were being printed to console but upon investigation, these metrics were not very meaningful (precision was always 1.0 if any story was completely correct and 0 otherwise). Additionally, in-data fraction was being printed twice. The new output of rasa test looks like this:

2021-02-23 15:54:43 INFO     rasa.core.test  - Finished collecting predictions.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          6 / 7
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.857
2021-02-23 15:54:43 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          33 / 35
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.960
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        0.970
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.961
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886

Formerly the output looked like this:

2021-02-23 15:54:43 INFO     rasa.core.test  - Finished collecting predictions.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on END-TO-END level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          6 / 7
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.923   # <-- harmonic mean of 1 and recall 
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        1.0     # <-- this is always 1 if correct > 0
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.857
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886   # <-- printed again below in ACTION level results
2021-02-23 15:54:43 INFO     rasa.core.test  - Stories report saved to results/story_report.json.
2021-02-23 15:54:43 INFO     rasa.core.test  - Evaluation Results on ACTION level:
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Correct:          33 / 35
2021-02-23 15:54:43 INFO     rasa.core.test  - 	F1-Score:         0.960
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Precision:        0.970
2021-02-23 15:54:43 INFO     rasa.core.test  - 	Accuracy:         0.961
2021-02-23 15:54:43 INFO     rasa.core.test  - 	In-data fraction: 0.886

Additionally, conversation level accuracy is logged to file (i.e., results/story_report.json) as an additional field in a JSON dictionary:

{
...
 "conversation_accuracy": {"accuracy": 0.857, "correct": 6, "total": 7}
...
}

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

…ion level accuracy. Add conversation level accuracy to story_report.json
@kedz kedz linked an issue Feb 23, 2021 that may be closed by this pull request
@kedz kedz marked this pull request as ready for review February 23, 2021 21:12
@kedz kedz requested review from dakshvar22, koernerfelicia and joejuzl and removed request for dakshvar22 February 23, 2021 21:12
num_failed = len(story_evaluation.failed_stories)
num_correct = len(story_evaluation.successful_stories)
num_convs = num_failed + num_correct
conv_acc = num_correct / num_correct if num_correct else 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a typo? num_correct/num_correct == 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also GH won't let me comment on the particular line because it's too far out of scope, but while you're at it would you remove the unused import on line 670? (from rasa.test import get_evaluation_metrics)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof that's embarrassing! Thank you for catching! Will fix!

@koernerfelicia
Copy link
Contributor

@akelad would we need to update the docs for this? I couldn't find any reference to the old CL output, or the old story_report.json in https://rasa.com/docs/rasa/testing-your-assistant

There wouldn't be anywhere else, right?

include_report=False,
num_convs = len(correct_dialogues)
num_correct = sum(correct_dialogues)
accuracy = num_correct / num_convs if num_convs else 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:
Could do metrics.accuracy_score([1] * len(completed_trackers), correct_dialogues)

num_correct = sum(correct_dialogues)
accuracy = num_correct / num_convs if num_convs else 0.0

logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make _log_evaluation_table handle None (and skip those) so we can reuse that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good! Will do.

num_failed = len(story_evaluation.failed_stories)
num_correct = len(story_evaluation.successful_stories)
num_convs = num_failed + num_correct
conv_acc = num_correct / num_correct if num_correct else 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this differ from the accuracy above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the accuracy above it is the percentage of correctly predicted next actions. Should I modify it to be action_accuracy to make it more clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and renamed the previous accuracy to action_accuracy to make the distinction clear.

@akelad
Copy link
Contributor

akelad commented Feb 24, 2021

no need to update docs, no

kedz added 2 commits February 24, 2021 11:59
…e optional arguments and reuse that function. Use sklearn.metrics.accuracy_score instead of computing manually. Remove unneccesary get_evaluation_metrics import.
Copy link
Contributor

@koernerfelicia koernerfelicia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! It still needs a changelog entry, though

Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good 🚀 - but I think we should definitely test the additions to the report file. We could also test the actual log output too as it's part of the UX.

num_convs = num_failed + num_correct
if num_convs:
conv_accuracy = num_correct / num_convs
report["conversation_accuracy"] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test this new data is there and correct in a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I'll create a unit test in tests/core/test_test.py

_log_evaluation_table(
[1] * len(completed_trackers),
[1] * len(correct_dialogues),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost on why this parameter changed. Was it wrong before? It doesn't look like the logic that uses it changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was unintentional but yes it does not affect the logic at all. I had deleted the use of the whole _log_evaluation_table method here previously. When I put it back in, I just needed a list of 1s with length equal to the number of test stories and I had just used correct_dialogues above so I used that list (I lexically entrained myself :)). Will change it back!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@kedz
Copy link
Contributor Author

kedz commented Feb 25, 2021

@koernerfelicia I don't know what to do/where is/best practices for making a changelog entry. Is there somewhere I can find that?

@kedz
Copy link
Contributor Author

kedz commented Feb 25, 2021

@joejuzl @koernerfelicia Working on a unit test but realizing I need an Agent with a core policy. I really just need one with a rule policy to test this. Can either of you point to an example of setting up a simple agent in a unit test?

…o _log_evaluation_table to preserve continuity with previous code version.
@koernerfelicia
Copy link
Contributor

koernerfelicia commented Feb 26, 2021

@kedz ah I didn't realise this was your first time doing a changelog entry, otherwise I would've said! It's here, and linked in the "to do" item under "Status" in the PR template if you ever need to refer to it again.

Screenshot 2021-02-26 at 08 23 12

@koernerfelicia
Copy link
Contributor

koernerfelicia commented Feb 26, 2021

I need an Agent with a core policy.

I think you should find this in one of the conftest files. Or test_agent.py might have some other ideas?

@kedz
Copy link
Contributor Author

kedz commented Mar 1, 2021

@koernerfelicia @joejuzl ok, I added a changelog file and I added some tests for rasa.core.test, testing both that the results of rasa.core.test.test get logged to file and that _log_evaluation_table logs output to console as expected. Let me know if there is anything else. Otherwise I think its good to merge?

@kedz kedz requested a review from joejuzl March 1, 2021 21:51
expected_results: Dict[Text, Dict[Text, Any]],
) -> None:

stories_path = tmpdir_factory.mktemp("test_rasa_core_test").join("eval_stories.yml")
Copy link
Contributor

@koernerfelicia koernerfelicia Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I think you want to clean up this file after the test is done? Or is that something that happens automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmpdir_factory and tmpdir test fixtures get deleted automatically, with the caveat that the three most recent tmpdirs are kept alive in case you need to inspect the contents due to a test failing. I did clean this up slightly and moved the creation of stories_path to a test fixture so that the args to test_test are clearer.

Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding tests! A couple of comments on the test code.



@pytest.fixture(scope="function")
def out_directory(tmpdir: pathlib.Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need for this fixture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not. I just thought it looked cleaner to have the args to test_test be more informative, e.g., out_directory instead of tmpdir, but I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense - and I'm generally all for more informative names. tmpdir is pretty widely used around the code base for this purpose though.

is_rule_tracker=True,
)

policy.train([rt1, rt2], domain, RegexInterpreter())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to use one of the already trained model fixtures? These are session scope so save a lot of time in the CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I did not know about those. Where can I find them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use the agent below? You'd have to change the expected values, but I think that could work
https://github.com/RasaHQ/rasa/blob/main/tests/conftest.py#L127

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are lots in rasa/tests/conftest.py for example default_agent (which is used in tests/core/test_evaluation.py too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both!

},
},
],
["", {}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: IMO this could just be a separate test to make it more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

from rasa.shared.core.events import ActionExecuted, UserUttered
from rasa.shared.core.generator import TrackerWithCachedStates
from rasa.shared.nlu.interpreter import RegexInterpreter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests could be added to tests/core/test_evaluation.py where (for some reason) rasa.core.test is imported as evaluate_stories

@kedz
Copy link
Contributor Author

kedz commented Mar 4, 2021

OK, I made the following changes to the tests:

  1. Used an agent test fixture.
  2. Moved tests from tests/core/test_test.py to tests/core/test_evaluation.py
  3. Split test_test into two tests, one for testing a stories.yml with stories, and one for the edge case for when the stories file is empty.
  4. Removed creation of file test fixtures and just used tmpdir, and made the necessary files in the test.
  5. test_test is now renamed to test_story_report and test_story_report_with_empty_stories, since the test_test name is confusing in test_evaluation, where test is mapped to evaluate_stories, and the tests are really about the creation and contents of story_report.json anyway.

@kedz kedz requested review from joejuzl and koernerfelicia March 4, 2021 19:28
Copy link
Contributor

@koernerfelicia koernerfelicia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Tests looks so much neater with the core_agent (and faster ofc). I won't approve yet because I'd like to see what @joejuzl has to say about the tests

Copy link
Contributor

@joejuzl joejuzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great testing! 🚀

@koernerfelicia
Copy link
Contributor

Actually @joejuzl what's good with the failing test? Does Chris need to flag the test as being a "training" test or something like that? I'm not in the loop on the splitting of CI tests

@joejuzl
Copy link
Contributor

joejuzl commented Mar 5, 2021

Actually @joejuzl what's good with the failing test? Does Chris need to flag the test as being a "training" test or something like that? I'm not in the loop on the splitting of CI tests

haha yes good spot, this was my change too 🤦
You need to use the decorator @pytest.mark.trains_model on any test that advertently or inadvertently trains a model.
(see: https://rasa-hq.slack.com/archives/C36SS4N8M/p1614682343228200)

@kedz
Copy link
Contributor Author

kedz commented Mar 5, 2021

Ah I was also not aware of marking tests that train! Added the marks. I'm still seeing some errors on the CI but I don't know what they relate to. Any ideas?

@kedz kedz requested review from joejuzl and koernerfelicia March 5, 2021 14:54
@koernerfelicia
Copy link
Contributor

@kedz no idea, they don't look related! I just restarted the jobs, maybe it was just a fluke?

@kedz
Copy link
Contributor Author

kedz commented Mar 5, 2021

Looks like it worked this time! Should I go ahead and merge?

@koernerfelicia
Copy link
Contributor

@kedz go for it!

@kedz kedz merged commit c78fc50 into main Mar 5, 2021
@kedz kedz deleted the improve_test_core_logging branch March 5, 2021 19:39
@kedz
Copy link
Contributor Author

kedz commented Mar 5, 2021

Oh wow, my first merge into Rasa OS, thanks @koernerfelicia and @joejuzl for reviewing/pointing me to all the things!

@koernerfelicia
Copy link
Contributor

Congratulations!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up and log to file Conversation level performance measures.
4 participants