Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

Closed
robinsongh381 opened this issue Mar 27, 2020 · 17 comments
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@robinsongh381
Copy link

Rasa version: 1.9.2

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version:
3.6
Operating system (windows, osx, ...):
linux
Issue:
When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107

Error (including full traceback):

2020-03-27 13:13:06 INFO     rasa.nlu.model  - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Traceback (most recent call last):
  File "/home/gunsu/diet/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
    persist_nlu_training_data=args.persist_nlu_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
    persist_nlu_training_data,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
    model_data = self.preprocess_train_data(training_data)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
    label_attribute=label_attribute,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
    model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
    self.num_examples = self.number_of_examples()
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
    f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

Command or request that led to error:

rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models

Content of configuration file (config.yml) (if relevant):

language: "xx"
pipeline:
  - name: "component.KoreanTokenizer"
  - name: "intent_entity_featurizer_regex"
  - name: "intent_featurizer_count_vectors"
    "token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
  - name: DIETClassifier
    intent_classification: True
    entity_recognition: False
    use_masked_language_model: False
    BILOU_flag: False
    number_of_transformer_layers: 0
    epochs: 100

Content of domain file (domain.yml) (if relevant):

@robinsongh381 robinsongh381 added the type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 27, 2020
@robinsongh381 robinsongh381 changed the title [Diet Classifier] [Diet Classifier] ValueError: Number of examples should be the same for all data. Mar 27, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @rgstephens will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@Ghostvv
Copy link
Contributor

Ghostvv commented Mar 27, 2020

It looks like some examples don't have intent labels

@robinsongh381
Copy link
Author

@Ghostvv

Hi thanks for reply

I had a look at my nlu.md file and didn't find any issues

I trained rasa nlu with the same nlu.md for a lower version of rasa-nlu (0.14.1) and the training was successful, so I don't think it's got to do with nlu.md

@Ghostvv
Copy link
Contributor

Ghostvv commented Apr 1, 2020

otherwise, it could be that some examples couldn't be featurized for some reason.

0.14.1 version didn't have this check

@stale
Copy link

stale bot commented Jul 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 2, 2020
@JoaoVFelipe
Copy link

Any update on that? I'm getting the same issue here, using rasa 1.10.5.

@stale stale bot removed the stale label Jul 5, 2020
@shfshf
Copy link

shfshf commented Jul 6, 2020

me too! using rasa 1.10.5
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

@JoaoVFelipe
Copy link

Hi!
As a temporary solution, I managed to do my bot training by downgrading to rasa 1.10.1. At least here, it issues some warnings, but finishes the training and works correctly.

@shfshf
Copy link

shfshf commented Jul 6, 2020

Hi!
when I use rasa 1.10.1,the result is still reported the same error

@tabergma
Copy link
Contributor

tabergma commented Jul 6, 2020

@shfshf @JoaoVFelipe Is one of you able to share his NLU data + config.yml so that I can take a closer look at the problem? Without the data to reproduce the issue it is hard to tell what is going wrong. Thanks.

@howl-anderson
Copy link
Contributor

howl-anderson commented Jul 7, 2020

@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:

  1. The same issue as Rasa_nlu returns intent as null for training samples using tensorflow classifier(Chinese)  #1515, I have a very detailed explanation in there and I think it affects all the East Asian language (Chinese, Keras and more)
  2. out-of-date custom tokenizer: the tokenizer which I provide doesn't compatible with current Rasa (1.10.5). Rasa changed the tokenizer protocol since 1.7.0 (https://github.com/RasaHQ/rasa/releases/tag/1.7.0):

    By default all tokenizer add a special token (CLS) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."

Solutions:

  1. Set token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)
  2. Check your tokenizer whether it supports the new tokenizer protocol if you are using a custom tokenizer (if it is not, try to rewrite your custom tokenizer according to one of the official tokenizers, for example, jieba tokenizer is a good one)

@tabergma
Copy link
Contributor

tabergma commented Jul 7, 2020

Thanks @howl-anderson for the comment. We actually tackle problem 1 already in #5905. It is already merged into master.

Just to be sure, if you update your custom tokenizer and solve the token_pattern issue, the problem is gone?

@howl-anderson
Copy link
Contributor

@tabergma It's good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using jieba as the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.

@JoaoVFelipe
Copy link

Thanks @tabergma and @howl-anderson for the help, setting the token_pattern for CountVectorsFeaturizer solved the problem. I actually not training an bot in any Asian language, but some of my training data to recognize out of scope languages has some Chinese, Japanese and Korean characters, and I didn't noticed.

By the way, sorry for not sharing the NLU data before. It is pretty big, and I was instructed to not share it since some of it is enterprise sensitive. Thank you very much.

@howl-anderson
Copy link
Contributor

@tabergma It's proved by @shfshf that updating the custom tokenizer indeed works! So, I think at least part of @robinsongh381's issue is related to the custom tokenizer too, since his tokenizer works in v0.14.1, but doesn't work in v1.9.2. I hope this message can help him. If @robinsongh381 has trouble rewrite his custom tokenizer, I can try my best to help him.

@shfshf
Copy link

shfshf commented Jul 8, 2020

Thanks @howl-anderson my colleague,
@robinsongh381 @JoaoVFelipe @tabergma,I solved this bug through his solutions successfully,with chinese language the custom tokenizer

@tabergma
Copy link
Contributor

tabergma commented Jul 8, 2020

Great, glad to hear that it works for you! I will close the issue as there is nothing code wise we can do. If you have trouble rewriting your tokenizers, feel free to ask a question on our forum. We are happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

7 participants