Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move token_pattern to tokenizers #6073

Merged
merged 16 commits into from
Jul 7, 2020
Merged

Move token_pattern to tokenizers #6073

merged 16 commits into from
Jul 7, 2020

Conversation

tabergma
Copy link
Contributor

@tabergma tabergma commented Jun 26, 2020

Proposed changes:
Remove option token_pattern from CountVectorsFeaturizer.
Instead all tokenizers now have the option token_pattern.
If a regular expression is set, all detected tokens of the tokenizer are further split into tokens according to the
regular expression.

closes #5905

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@tabergma tabergma requested a review from dakshvar22 June 26, 2020 13:36
Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🚀 One small edit suggested

rasa/nlu/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
changelog/5905.bugfix.rst Outdated Show resolved Hide resolved
rasa/nlu/tokenizers/convert_tokenizer.py Outdated Show resolved Hide resolved
@rasabot
Copy link
Collaborator

rasabot commented Jul 6, 2020

Could not update branch. Most likely this is due to a merge conflict. Please update the branch manually and fix any issues.

@tabergma tabergma merged commit 87f2e9d into master Jul 7, 2020
@tabergma tabergma deleted the token-pattern branch July 7, 2020 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Token_pattern is incorrectly applied on incoming text during inference
3 participants