Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConvertTokenizer, CLS token & features as sequence #4996

Merged
merged 57 commits into from
Jan 13, 2020
Merged

Conversation

tabergma
Copy link
Contributor

@tabergma tabergma commented Dec 19, 2019

Proposed changes:

  • Add ConveRT Tokenizer
  • Remove option use_cls_token from all tokenizers. Tokenizers add CLS token by default now.
  • Remove option return_sequence from featurizers. Featurizers return sequence by default. The feature vector of the CLS token contains the feature for the complete utterance.
  • Implement train and process of tokenizers in Tokenizer class. Subclasses just need to implement tokenize.

closes #4978

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

@tabergma tabergma requested a review from Ghostvv January 2, 2020 15:18
@tabergma tabergma changed the title Convert tokenizer ConvertTokenizer, CLS token & features as sequence Jan 3, 2020
Copy link
Contributor

@Ghostvv Ghostvv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! add a couple of minor comments

Copy link
Contributor

@dakshvar22 dakshvar22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor suggestions. Great work! 🚀

tabergma and others added 6 commits January 13, 2020 10:53
Co-Authored-By: Daksh Varshneya <d.varshneya@rasa.com>
Co-Authored-By: Daksh Varshneya <d.varshneya@rasa.com>
Co-Authored-By: Daksh Varshneya <d.varshneya@rasa.com>
@tabergma tabergma merged commit 71c1228 into master Jan 13, 2020
@tabergma tabergma deleted the convert-tokenizer branch January 13, 2020 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer for ConveRT
3 participants