-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix logging of conversation level core metrics. #8030
Conversation
…ion level accuracy. Add conversation level accuracy to story_report.json
rasa/core/test.py
Outdated
num_failed = len(story_evaluation.failed_stories) | ||
num_correct = len(story_evaluation.successful_stories) | ||
num_convs = num_failed + num_correct | ||
conv_acc = num_correct / num_correct if num_correct else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a typo? num_correct/num_correct == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also GH won't let me comment on the particular line because it's too far out of scope, but while you're at it would you remove the unused import on line 670? (from rasa.test import get_evaluation_metrics
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof that's embarrassing! Thank you for catching! Will fix!
@akelad would we need to update the docs for this? I couldn't find any reference to the old CL output, or the old There wouldn't be anywhere else, right? |
rasa/core/test.py
Outdated
include_report=False, | ||
num_convs = len(correct_dialogues) | ||
num_correct = sum(correct_dialogues) | ||
accuracy = num_correct / num_convs if num_convs else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional:
Could do metrics.accuracy_score([1] * len(completed_trackers), correct_dialogues)
rasa/core/test.py
Outdated
num_correct = sum(correct_dialogues) | ||
accuracy = num_correct / num_convs if num_convs else 0.0 | ||
|
||
logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make _log_evaluation_table
handle None
(and skip those) so we can reuse that method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds good! Will do.
rasa/core/test.py
Outdated
num_failed = len(story_evaluation.failed_stories) | ||
num_correct = len(story_evaluation.successful_stories) | ||
num_convs = num_failed + num_correct | ||
conv_acc = num_correct / num_correct if num_correct else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this differ from the accuracy
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the accuracy above it is the percentage of correctly predicted next actions. Should I modify it to be action_accuracy to make it more clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and renamed the previous accuracy to action_accuracy to make the distinction clear.
no need to update docs, no |
…st_core_logging
…e optional arguments and reuse that function. Use sklearn.metrics.accuracy_score instead of computing manually. Remove unneccesary get_evaluation_metrics import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! It still needs a changelog entry, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good 🚀 - but I think we should definitely test the additions to the report file. We could also test the actual log output too as it's part of the UX.
num_convs = num_failed + num_correct | ||
if num_convs: | ||
conv_accuracy = num_correct / num_convs | ||
report["conversation_accuracy"] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should test this new data is there and correct in a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally! I'll create a unit test in tests/core/test_test.py
rasa/core/test.py
Outdated
_log_evaluation_table( | ||
[1] * len(completed_trackers), | ||
[1] * len(correct_dialogues), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost on why this parameter changed. Was it wrong before? It doesn't look like the logic that uses it changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unintentional but yes it does not affect the logic at all. I had deleted the use of the whole _log_evaluation_table method here previously. When I put it back in, I just needed a list of 1s with length equal to the number of test stories and I had just used correct_dialogues above so I used that list (I lexically entrained myself :)). Will change it back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@koernerfelicia I don't know what to do/where is/best practices for making a changelog entry. Is there somewhere I can find that? |
@joejuzl @koernerfelicia Working on a unit test but realizing I need an Agent with a core policy. I really just need one with a rule policy to test this. Can either of you point to an example of setting up a simple agent in a unit test? |
…o _log_evaluation_table to preserve continuity with previous code version.
I think you should find this in one of the |
@koernerfelicia @joejuzl ok, I added a changelog file and I added some tests for rasa.core.test, testing both that the results of rasa.core.test.test get logged to file and that _log_evaluation_table logs output to console as expected. Let me know if there is anything else. Otherwise I think its good to merge? |
tests/core/test_test.py
Outdated
expected_results: Dict[Text, Dict[Text, Any]], | ||
) -> None: | ||
|
||
stories_path = tmpdir_factory.mktemp("test_rasa_core_test").join("eval_stories.yml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I think you want to clean up this file after the test is done? Or is that something that happens automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmpdir_factory and tmpdir test fixtures get deleted automatically, with the caveat that the three most recent tmpdirs are kept alive in case you need to inspect the contents due to a test failing. I did clean this up slightly and moved the creation of stories_path to a test fixture so that the args to test_test are clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding tests! A couple of comments on the test code.
tests/core/test_test.py
Outdated
|
||
|
||
@pytest.fixture(scope="function") | ||
def out_directory(tmpdir: pathlib.Path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any need for this fixture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not. I just thought it looked cleaner to have the args to test_test be more informative, e.g., out_directory instead of tmpdir, but I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense - and I'm generally all for more informative names. tmpdir
is pretty widely used around the code base for this purpose though.
tests/core/test_test.py
Outdated
is_rule_tracker=True, | ||
) | ||
|
||
policy.train([rt1, rt2], domain, RegexInterpreter()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to use one of the already trained model fixtures? These are session scope so save a lot of time in the CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I did not know about those. Where can I find them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use the agent below? You'd have to change the expected values, but I think that could work
https://github.com/RasaHQ/rasa/blob/main/tests/conftest.py#L127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are lots in rasa/tests/conftest.py
for example default_agent
(which is used in tests/core/test_evaluation.py
too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both!
tests/core/test_test.py
Outdated
}, | ||
}, | ||
], | ||
["", {}], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: IMO this could just be a separate test to make it more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
tests/core/test_test.py
Outdated
from rasa.shared.core.events import ActionExecuted, UserUttered | ||
from rasa.shared.core.generator import TrackerWithCachedStates | ||
from rasa.shared.nlu.interpreter import RegexInterpreter | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests could be added to tests/core/test_evaluation.py
where (for some reason) rasa.core.test
is imported as evaluate_stories
OK, I made the following changes to the tests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Tests looks so much neater with the core_agent
(and faster ofc). I won't approve yet because I'd like to see what @joejuzl has to say about the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great testing! 🚀
Actually @joejuzl what's good with the failing test? Does Chris need to flag the test as being a "training" test or something like that? I'm not in the loop on the splitting of CI tests |
haha yes good spot, this was my change too 🤦 |
Ah I was also not aware of marking tests that train! Added the marks. I'm still seeing some errors on the CI but I don't know what they relate to. Any ideas? |
@kedz no idea, they don't look related! I just restarted the jobs, maybe it was just a fluke? |
Looks like it worked this time! Should I go ahead and merge? |
@kedz go for it! |
Oh wow, my first merge into Rasa OS, thanks @koernerfelicia and @joejuzl for reviewing/pointing me to all the things! |
Congratulations! |
Fix logging of conversation level core metrics to only compute and log conversation level accuracy. Formerly F1-Score and precision were being printed to console but upon investigation, these metrics were not very meaningful (precision was always 1.0 if any story was completely correct and 0 otherwise). Additionally, in-data fraction was being printed twice. The new output of rasa test looks like this:
Formerly the output looked like this:
Additionally, conversation level accuracy is logged to file (i.e.,
results/story_report.json
) as an additional field in a JSON dictionary:Status (please check what you already did):
black
(please check Readme for instructions)