Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow single token to be annotated by multiple entities in test stories #10773

Merged
merged 43 commits into from
Feb 8, 2022

Conversation

JEM-Mosig
Copy link

@JEM-Mosig JEM-Mosig commented Jan 28, 2022

This continues the work from here: #10394

Proposed changes:

  • ...

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

raoulvm and others added 23 commits November 25, 2021 12:09
and removed debug output
to pass `poetry run flake8 rasa tests --extend-ignore D`
"D415 First line should end with a period, question mark,
or exclamation point"
With best regards to @erohmensing ...
to work with overlapping entities.
though I do not have changed that code block
Copy link
Author

@JEM-Mosig JEM-Mosig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first few changes...

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2022

The model regression tests have started. It might take a while, please be patient.
As soon as results are ready you'll see a new comment with the results.

Used configuration can be found in the comment.

@raoulvm
Copy link
Contributor

raoulvm commented Feb 3, 2022

Great work, thanks for your effort!

@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2022

@raoulvm
Copy link
Contributor

raoulvm commented Feb 4, 2022

Hi @JEM-Mosig I had a look at the regression tests... The data for those tests seem to include invalid data, such as

UserWarning: Misaligned entity annotation in message 'set the time zone to g. m. t.' with intent 'datetime_convert'. Make sure the start and end values of entities ([(21, 29, 'g. m. t.')]) in the training data match the token boundaries ([(0, 3, 'set'), (4, 7, 'the'), (8, 12, 'time'), (13, 17, 'zone'), (18, 20, 'to'), (21, 22, 'g'), (24, 25, 'm'), (27, 28, 't')]). Common causes:

  1. entities include trailing whitespaces or punctuation
  2. the tokenizer gives an unexpected result, due to languages such as Chinese that don't use whitespace for word separation
    More info at https://rasa.com/docs/rasa/nlu-training-data

I searches the repo for those test strings but didn't see them in rasa. Wanted to help but could not. :-(

@JEM-Mosig
Copy link
Author

@raoulvm Thanks! Yes, that's mainly why I ran those regression tests. It's so easy to overlook something and this way we have a better chance of catching such problems. I'll look into this today.

@rctatman Can you please review the docs changes of this PR in the meantime? It's not a big change.

@JEM-Mosig
Copy link
Author

JEM-Mosig commented Feb 4, 2022

So the warnings come from here, but these are related to other issues that also appear in the released versions. The tests fail due to out of memory problems, however. 🤦‍♂️

@@ -323,6 +344,20 @@ your trainable entity extractors are trained to recognize.
Only trainable entity extractors, such as the `DIETClassifier` and `CRFEntityExtractor` are
evaluated by `rasa test`. Pretrained extractors like the `DucklingHTTPExtractor` are not evaluated.

If you have multiple entity extractors in your pipeline, or use some custom extractors,
it is possible that multiple entities are associated with the same token. In this case,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"multiple entities can be associated"?

Copy link

@rctatman rctatman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs changes LGTM, left on comment on a bit of phrasing that seemed awkward but isn't obligatory to change. Otherwise very clear.

@JEM-Mosig
Copy link
Author

Ok, looks like there is one last test failing. I'll try to fix that today. Afterwards, I'd like another review from @tayfun. In particular, I had to add two docstrings for classes that are not in the research codebase, eventhough I didn't touch them (the linter complains otherwise).

@raoulvm
Copy link
Contributor

raoulvm commented Feb 7, 2022

I am not sure you'll be able to fix this worker crash by changing the code...
Continuous Integration / Run Tests (test-nlu-featurizers, windows-latest, 3.7) (pull_request)

================================== FAILURES ===================================
[55](https://github.com/RasaHQ/rasa/runs/5069805251?check_suite_focus=true#step:14:55)
_________________ tests/nlu/featurizers/test_lm_featurizer.py _________________
[56](https://github.com/RasaHQ/rasa/runs/5069805251?check_suite_focus=true#step:14:56)
[gw1] win32 -- Python 3.7.9 d:\a\rasa\rasa\.venv\scripts\python.exe
[57](https://github.com/RasaHQ/rasa/runs/5069805251?check_suite_focus=true#step:14:57)
worker 'gw1' crashed while running 'tests/nlu/featurizers/test_lm_featurizer.py::TestShapeValuesTrainAndProcess::test_lm_featurizer_shape_values_process[bert-bert-base-uncased-texts1-expected_shape1-expected_sequence_vec1-expected_cls_vec1]'

And it only crashed with Python 3.7 environment.

@JEM-Mosig JEM-Mosig requested a review from tayfun February 7, 2022 09:13
@JEM-Mosig JEM-Mosig marked this pull request as ready for review February 7, 2022 09:19
@JEM-Mosig JEM-Mosig requested review from a team and tttthomasssss and removed request for a team February 7, 2022 09:19
@JEM-Mosig
Copy link
Author

JEM-Mosig commented Feb 7, 2022

GitHub is broken and we can't open the Files tab for reviews (it yields an Error 500). I reached out to GitHub. But if it's not fixed by tomorrow morning I'll create a separate PR to work around this.

Copy link
Contributor

@tayfun tayfun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You broke Github but 💯

@raoulvm
Copy link
Contributor

raoulvm commented Feb 8, 2022

I'd never expected this to be such a braking change.

@JEM-Mosig
Copy link
Author

JEM-Mosig commented Feb 8, 2022

It works again! (I notified Tayfun and Thomas on Slack)

Copy link
Contributor

@tttthomasssss tttthomasssss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only have a small (and optional) suggestion, otherwise LGTM.

@JEM-Mosig JEM-Mosig enabled auto-merge (squash) February 8, 2022 09:55
@JEM-Mosig JEM-Mosig merged commit b506c70 into 2.8.x Feb 8, 2022
@JEM-Mosig JEM-Mosig deleted the 2.8.18-multi-entities-johannes branch February 8, 2022 10:24
@JEM-Mosig JEM-Mosig mentioned this pull request Feb 10, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants