Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data split cli #6604

Merged
merged 22 commits into from
Sep 14, 2020
Merged

Fix data split cli #6604

merged 22 commits into from
Sep 14, 2020

Conversation

m-vdb
Copy link
Collaborator

@m-vdb m-vdb commented Sep 8, 2020

Proposed changes:

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

nlg_test_data.yml test_data.json
nlg_training_data.yml training_data.json
nlg_test_data.yml test_data.yml
nlg_training_data.yml training_data.yml
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akelad after reading the docs page on rasa data split nlu, this seemed like the only change required. Did you have something else in mind in your description of #6588 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, since the response selector training data format changed, i wasn't actually sure that the nlg stuff is still relevant. but @dakshvar22 PR isn't merged yet, so it may be up to him!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will still be relevant for the time being because response templates for responses still need to be defined outside the regular nlu, stories, rules yml files. See #6566 cc @degiz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait so responses: separately and responses: under domain function differently? that seems error prone

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait for sasha to answer that since he has been working on it.

@m-vdb m-vdb requested a review from degiz September 9, 2020 07:38
Copy link
Contributor

@wochinge wochinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for taking this over 🙏

@degiz degiz self-requested a review September 11, 2020 10:09
Copy link
Contributor

@degiz degiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking this over! 🚀

I've tried this:

rasa data split nlu -u ./examples/responseselectorbot/data --out ~/Desktop/split

And it created 4 files with the same content. Then I've tried the same with a moodbot, and it gave me 4 empty files :(

Also one file is named nlg_training_data..yml, with two dots in the middle.

So I don't think that existing test is complete until it verifies the file's content.

@@ -19,6 +20,7 @@

logger = logging.getLogger(__name__)

DEFAULT_FILE_FORMAT = "yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use YAML_FILE_EXTENSIONS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it's a set(), so I don't know if we want to define the default file format there or not :/

@m-vdb m-vdb force-pushed the fix-data-split-cli branch from 587d787 to 86bde37 Compare September 11, 2020 13:59
@m-vdb m-vdb requested a review from degiz September 11, 2020 14:06
@@ -311,10 +311,10 @@ def persist_nlu(self, filename: Text = DEFAULT_TRAINING_DATA_OUTPUT_PATH) -> Non
elif rasa.shared.data.is_likely_markdown_file(filename):
rasa.shared.utils.io.write_text_file(self.nlu_as_markdown(), filename)
elif rasa.shared.data.is_likely_yaml_file(filename):
rasa.shared.utils.io.write_text_file(self.nlg_as_yaml(), filename)
rasa.shared.utils.io.write_text_file(self.nlu_as_yaml(), filename)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the bad guy

# Add nlg_ as prefix and change extension to .md
filename = (
Path(nlu_filename)
.with_name("nlg_" + Path(nlu_filename).name)
.with_suffix("." + extension)
.with_suffix(extension)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is another bad guy

Copy link
Contributor

@degiz degiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing that!
Existing tests didn't catch that, so I still think we need to extend them to be able to verify the content of the generated files.

@m-vdb
Copy link
Collaborator Author

m-vdb commented Sep 11, 2020

@degiz
Copy link
Contributor

degiz commented Sep 11, 2020

@m-vdb I have! I just think that it won't catch a situation when all four generated files have the same content.

@m-vdb
Copy link
Collaborator Author

m-vdb commented Sep 11, 2020

🤔 hum OK. I wasn't able to reproduce this one. I am not sure I'll be able to finish it before leaving. Can you take it over?

@degiz degiz self-assigned this Sep 11, 2020
@degiz degiz force-pushed the fix-data-split-cli branch from 3451cac to 40155c7 Compare September 14, 2020 18:55
@degiz degiz self-requested a review September 14, 2020 19:01
Copy link
Contributor

@degiz degiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed failing tests and accidentally fixed a bug with --persist-nlu-data flag for rasa train

@degiz degiz merged commit 4c8654c into master Sep 14, 2020
@degiz degiz deleted the fix-data-split-cli branch September 14, 2020 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rasa data split nlu no longer works Re-add test for test_data_split_nlu after YAML writer is ready
6 participants