[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

robinsongh381 · 2020-03-27T04:21:03Z

Rasa version: 1.9.2

Rasa SDK version (if used & relevant):

Rasa X version (if used & relevant):

Python version:
3.6
Operating system (windows, osx, ...):
linux
Issue:
When training rasa nlu (i.e. rasa nlu train) there is an error from rasa/utils/tensorflow/model_data.py line 107

Error (including full traceback):

2020-03-27 13:13:06 INFO     rasa.nlu.model  - Starting to train component tokenizer_whitespace
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component RegexFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component CountVectorsFeaturizer
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Finished training component.
2020-03-27 13:13:09 INFO     rasa.nlu.model  - Starting to train component DIETClassifier
Traceback (most recent call last):
  File "/home/gunsu/diet/bin/rasa", line 8, in <module>
    sys.exit(main())
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/__main__.py", line 91, in main
    cmdline_arguments.func(cmdline_arguments)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/cli/train.py", line 140, in train_nlu
    persist_nlu_training_data=args.persist_nlu_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 414, in train_nlu
    persist_nlu_training_data,
  File "uvloop/loop.pyx", line 1456, in uvloop.loop.Loop.run_until_complete
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 445, in _train_nlu_async
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/train.py", line 474, in _train_nlu_with_validated_data
    persist_nlu_training_data=persist_nlu_training_data,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/train.py", line 86, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/model.py", line 191, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 622, in train
    model_data = self.preprocess_train_data(training_data)
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 601, in preprocess_train_data
    label_attribute=label_attribute,
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/nlu/classifiers/diet_classifier.py", line 549, in _create_model_data
    model_data.add_features(LABEL_FEATURES, [Y_sparse, Y_dense])
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 145, in add_features
    self.num_examples = self.number_of_examples()
  File "/home/gunsu/diet/lib/python3.6/site-packages/rasa/utils/tensorflow/model_data.py", line 107, in number_of_examples
    f"Number of examples differs for keys '{data.keys()}'. Number of "
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

Command or request that led to error:

rasa train nlu -c ./bots/lib/config.yml -u ./bots/nlu_train.md --out ./models

Content of configuration file (config.yml) (if relevant):

language: "xx"
pipeline:
  - name: "component.KoreanTokenizer"
  - name: "intent_entity_featurizer_regex"
  - name: "intent_featurizer_count_vectors"
    "token_pattern": '(?u)\b\w+\b' # 1개의 character도 인식하도록 regex 변경
  - name: DIETClassifier
    intent_classification: True
    entity_recognition: False
    use_masked_language_model: False
    BILOU_flag: False
    number_of_transformer_layers: 0
    epochs: 100

Content of domain file (domain.yml) (if relevant):

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-03-27T07:21:45Z

Thanks for the issue, @rgstephens will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

Ghostvv · 2020-03-27T19:02:34Z

It looks like some examples don't have intent labels

robinsongh381 · 2020-03-28T06:13:11Z

@Ghostvv

Hi thanks for reply

I had a look at my nlu.md file and didn't find any issues

I trained rasa nlu with the same nlu.md for a lower version of rasa-nlu (0.14.1) and the training was successful, so I don't think it's got to do with nlu.md

Ghostvv · 2020-04-01T09:08:52Z

otherwise, it could be that some examples couldn't be featurized for some reason.

0.14.1 version didn't have this check

stale · 2020-07-02T07:21:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

JoaoVFelipe · 2020-07-05T14:20:57Z

Any update on that? I'm getting the same issue here, using rasa 1.10.5.

shfshf · 2020-07-06T09:09:14Z

me too! using rasa 1.10.5
ValueError: Number of examples differs for keys 'dict_keys(['text_features', 'label_features'])'. Number of examples should be the same for all data.

JoaoVFelipe · 2020-07-06T11:39:42Z

Hi!
As a temporary solution, I managed to do my bot training by downgrading to rasa 1.10.1. At least here, it issues some warnings, but finishes the training and works correctly.

shfshf · 2020-07-06T12:07:53Z

Hi！
when I use rasa 1.10.1，the result is still reported the same error

tabergma · 2020-07-06T15:00:54Z

@shfshf @JoaoVFelipe Is one of you able to share his NLU data + config.yml so that I can take a closer look at the problem? Without the data to reproduce the issue it is hard to tell what is going wrong. Thanks.

howl-anderson · 2020-07-07T12:41:24Z

@robinsongh381 @JoaoVFelipe @tabergma @Ghostvv I am the colleague of @shfshf who provides the custom tokenizer component for his pipeline. And I finally find the there are two root causes of this issue:

The same issue as Rasa_nlu returns intent as null for training samples using tensorflow classifier(Chinese) #1515, I have a very detailed explanation in there and I think it affects all the East Asian language (Chinese, Keras and more)
out-of-date custom tokenizer: the tokenizer which I provide doesn't compatible with current Rasa (1.10.5). Rasa changed the tokenizer protocol since 1.7.0 (https://github.com/RasaHQ/rasa/releases/tag/1.7.0):

By default all tokenizer add a special token (CLS) to the end of the list of tokens. This token will be used to capture the features of the whole utterance."

Solutions:

Set token_pattern to "(?u)\b\w+\b" for CountVectorsFeaturizer if you are using East Asian language (I will try to make a PR to make it as the default option for East Asian language setting)
Check your tokenizer whether it supports the new tokenizer protocol if you are using a custom tokenizer (if it is not, try to rewrite your custom tokenizer according to one of the official tokenizers, for example, jieba tokenizer is a good one)

tabergma · 2020-07-07T12:47:37Z

Thanks @howl-anderson for the comment. We actually tackle problem 1 already in #5905. It is already merged into master.

Just to be sure, if you update your custom tokenizer and solve the token_pattern issue, the problem is gone?

howl-anderson · 2020-07-07T13:04:19Z

@tabergma It's good to see that the official team already takes action for problem 1. For problem 2, I am just working on the tokenizer rewriting process, but because when we using jieba as the tokenizer, all problem is gone, so there is definitely something wrong with the custom tokenizer. I will keep you informed whether updating the custom tokenizer works or not.

JoaoVFelipe · 2020-07-07T13:45:15Z

Thanks @tabergma and @howl-anderson for the help, setting the token_pattern for CountVectorsFeaturizer solved the problem. I actually not training an bot in any Asian language, but some of my training data to recognize out of scope languages has some Chinese, Japanese and Korean characters, and I didn't noticed.

By the way, sorry for not sharing the NLU data before. It is pretty big, and I was instructed to not share it since some of it is enterprise sensitive. Thank you very much.

howl-anderson · 2020-07-08T03:03:01Z

@tabergma It's proved by @shfshf that updating the custom tokenizer indeed works! So, I think at least part of @robinsongh381's issue is related to the custom tokenizer too, since his tokenizer works in v0.14.1, but doesn't work in v1.9.2. I hope this message can help him. If @robinsongh381 has trouble rewrite his custom tokenizer, I can try my best to help him.

shfshf · 2020-07-08T03:15:47Z

Thanks @howl-anderson my colleague，
@robinsongh381 @JoaoVFelipe @tabergma，I solved this bug through his solutions successfully，with chinese language the custom tokenizer

tabergma · 2020-07-08T06:13:51Z

Great, glad to hear that it works for you! I will close the issue as there is nothing code wise we can do. If you have trouble rewriting your tokenizers, feel free to ask a question on our forum. We are happy to help.

robinsongh381 added the type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. label Mar 27, 2020

robinsongh381 changed the title ~~[Diet Classifier]~~ [Diet Classifier] ValueError: Number of examples should be the same for all data. Mar 27, 2020

stale bot added the stale label Jul 2, 2020

stale bot removed the stale label Jul 5, 2020

tabergma closed this as completed Jul 8, 2020

howl-anderson mentioned this issue Dec 3, 2020

Add use_word_boundaries option for RegexFeaturizer and RegexEntityExtractor #7422

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

robinsongh381 commented Mar 27, 2020

sara-tagger commented Mar 27, 2020

Ghostvv commented Mar 27, 2020

robinsongh381 commented Mar 28, 2020

Ghostvv commented Apr 1, 2020

stale bot commented Jul 2, 2020

JoaoVFelipe commented Jul 5, 2020

shfshf commented Jul 6, 2020 •

edited

Loading

JoaoVFelipe commented Jul 6, 2020

shfshf commented Jul 6, 2020

tabergma commented Jul 6, 2020 •

edited

Loading

howl-anderson commented Jul 7, 2020 •

edited

Loading

tabergma commented Jul 7, 2020

howl-anderson commented Jul 7, 2020

JoaoVFelipe commented Jul 7, 2020

howl-anderson commented Jul 8, 2020

shfshf commented Jul 8, 2020

tabergma commented Jul 8, 2020

[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

[Diet Classifier] ValueError: Number of examples should be the same for all data. #5508

Comments

robinsongh381 commented Mar 27, 2020

sara-tagger commented Mar 27, 2020

You may find help in the docs and the forum, too 🤗

Ghostvv commented Mar 27, 2020

robinsongh381 commented Mar 28, 2020

Ghostvv commented Apr 1, 2020

stale bot commented Jul 2, 2020

JoaoVFelipe commented Jul 5, 2020

shfshf commented Jul 6, 2020 • edited Loading

JoaoVFelipe commented Jul 6, 2020

shfshf commented Jul 6, 2020

tabergma commented Jul 6, 2020 • edited Loading

howl-anderson commented Jul 7, 2020 • edited Loading

tabergma commented Jul 7, 2020

howl-anderson commented Jul 7, 2020

JoaoVFelipe commented Jul 7, 2020

howl-anderson commented Jul 8, 2020

shfshf commented Jul 8, 2020

tabergma commented Jul 8, 2020

shfshf commented Jul 6, 2020 •

edited

Loading

tabergma commented Jul 6, 2020 •

edited

Loading

howl-anderson commented Jul 7, 2020 •

edited

Loading