Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible NLU pipeline #5863

Merged
merged 57 commits into from
Jun 5, 2020
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
d0c173b
create Features class
tabergma May 4, 2020
d0a22a8
Draft message.get_features
tabergma May 4, 2020
20bb3b5
fix get_sparse/dense_features
tabergma May 5, 2020
9bd4e8b
padding lower dim features
tabergma May 5, 2020
a9efca9
update DIETClassifier
tabergma May 6, 2020
313478d
training works
tabergma May 7, 2020
4b64f10
prediction works
tabergma May 7, 2020
c89d8af
refactoring
tabergma May 7, 2020
b4f2a01
Merge branch 'master' into nlu-configuration
tabergma May 7, 2020
5a7c97f
convert featurizer is independent from tokenizer
tabergma May 7, 2020
0bafdfd
set eager mode to False again
tabergma May 7, 2020
857f10f
naming
tabergma May 11, 2020
8775bab
naming
tabergma May 11, 2020
ca4a653
check if additional ffn is needed
tabergma May 11, 2020
57280e6
Merge branch 'master' into nlu-configuration
tabergma May 11, 2020
6f9685b
fix concat bug
tabergma May 12, 2020
062af61
use sparse dense dim
tabergma May 12, 2020
707e1cc
remove convert tokenizer
tabergma May 13, 2020
05fda3f
remove not needed constants
tabergma May 13, 2020
b14464c
start fixing tests
tabergma May 13, 2020
096cb29
use dense dim
tabergma May 13, 2020
3f9df89
fix more tests
tabergma May 13, 2020
71c8c41
fix more tests
tabergma May 13, 2020
892a122
fix testing
tabergma May 13, 2020
c602b26
Merge branch 'master' into nlu-configuration
tabergma May 14, 2020
5154a85
revert changes in classifiers
tabergma May 18, 2020
4796b00
update featurizers
tabergma May 18, 2020
7f33431
update tests
tabergma May 19, 2020
ec7b9c4
clean up
tabergma May 19, 2020
f2d630e
fix tests
tabergma May 19, 2020
0ea9f7c
add changelog
tabergma May 19, 2020
78498f3
update docs
tabergma May 19, 2020
812f5f1
Merge branch 'master' into flexible-nlu-pipeline
tabergma May 19, 2020
ee72fd6
increase version to 1.11.0a2
tabergma May 19, 2020
2f3e95f
fix crf entity extractor
tabergma May 19, 2020
344a66a
Fix dense features in CRFEntityExtractor
tabergma May 20, 2020
3ab7b4f
Merge branch 'master' into flexible-nlu-pipeline
tabergma May 20, 2020
fd1d2d5
Create method 'features_present'
tabergma May 20, 2020
f603454
update tests
tabergma May 20, 2020
aabb4f6
address deepsource issues
tabergma May 20, 2020
ad4d8ce
fix types
tabergma May 20, 2020
f128a6f
set alias name automatically if not present
tabergma May 20, 2020
dc756b4
Update docs
tabergma May 20, 2020
6620987
Add docstrings
tabergma May 20, 2020
ae09a1b
update the changelog
tabergma May 20, 2020
9080b7e
fix changelog entry
tabergma May 20, 2020
45e4a37
fix tests
tabergma May 20, 2020
b9ddc79
Merge branch 'master' into flexible-nlu-pipeline
Ghostvv May 25, 2020
5ed1d5f
Merge branch 'master' into flexible-nlu-pipeline
tabergma Jun 4, 2020
1303df4
update docs
tabergma Jun 4, 2020
624d9df
rename ALIAS to FEATURIZER_CLASS_ALIAS
tabergma Jun 4, 2020
47cb08e
update docstrings
tabergma Jun 4, 2020
743f9fc
review comments
tabergma Jun 4, 2020
9f0e8bf
fix incorrect import
tabergma Jun 4, 2020
754a345
fix incorrect import
tabergma Jun 4, 2020
ad4a26b
add test
tabergma Jun 4, 2020
1766a53
fix issue in convert featurizer process
tabergma Jun 5, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions changelog/5510.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
You can now define what kind of features should be used by what component.

You can set an alias for every featurizer in your pipeline.
You can then specify on, for example, the :ref:`diet-classifier` what features from which featurizers should go in.
If you don't set the option ``featurizers`` all available features will be used.
This is also the default behaviour.

Here is an example pipeline that shows the new option:
tabergma marked this conversation as resolved.
Show resolved Hide resolved

.. code-block::
pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
alias: "convert"
- name: CountVectorsFeaturizer
alias: "cvf_word"
- name: CountVectorsFeaturizer
alias: "cvf_char"
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: RegexFeaturizer
alias: "regex"
- name: LexicalSyntacticFeaturizer
alias: "lsf"
- name: DIETClassifier:
featurizers: ["convert", "cvf_word", "cvf_char", "regex", "lsf"]
- name: ResponseSelector
epochs: 50
featurizers: ["convert", "cvf_word"]
- name: EntitySynonymMapper

.. warning::
This change is model breaking. Please, retrain your models.
tabergma marked this conversation as resolved.
Show resolved Hide resolved
117 changes: 72 additions & 45 deletions docs/nlu/components.rst
Original file line number Diff line number Diff line change
Expand Up @@ -359,6 +359,8 @@ MitieFeaturizer
# Specify what pooling operation should be used to calculate the vector of
# the __CLS__ token. Available options: 'mean' and 'max'.
"pooling": "mean"
# alias name of the featurizer
"alias": "mitie_featurizer"
tabergma marked this conversation as resolved.
Show resolved Hide resolved


.. _SpacyFeaturizer:
Expand Down Expand Up @@ -386,6 +388,8 @@ SpacyFeaturizer
# Specify what pooling operation should be used to calculate the vector of
# the __CLS__ token. Available options: 'mean' and 'max'.
"pooling": "mean"
# alias name of the featurizer
"alias": "spacy_featurizer"


.. _ConveRTFeaturizer:
Expand Down Expand Up @@ -417,6 +421,8 @@ ConveRTFeaturizer

pipeline:
- name: "ConveRTFeaturizer"
# alias name of the featurizer
"alias": "convert_featurizer"


.. _LanguageModelFeaturizer:
Expand Down Expand Up @@ -447,6 +453,8 @@ LanguageModelFeaturizer

pipeline:
- name: "LanguageModelFeaturizer"
# alias name of the featurizer
"alias": "language_model_featurizer"


.. _RegexFeaturizer:
Expand Down Expand Up @@ -474,6 +482,8 @@ RegexFeaturizer

pipeline:
- name: "RegexFeaturizer"
# alias name of the featurizer
"alias": "regex_featurizer"

.. _CountVectorsFeaturizer:

Expand Down Expand Up @@ -560,6 +570,8 @@ CountVectorsFeaturizer
"OOV_token": "_oov_"
# Whether to use a shared vocab
"use_shared_vocab": False
# alias name of the featurizer
"alias": "convert_featurizer"

.. container:: toggle

Expand All @@ -570,51 +582,53 @@ CountVectorsFeaturizer

.. code-block:: none

+-------------------+-------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+===================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------+--------------------------------------------------------------+
+-------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------+--------------------------------------------------------------------+
| alias | count_vector_featurizer | Alias name of featurizer. |
+-------------------+-------------------------+--------------------------------------------------------------+


.. _LexicalSyntacticFeaturizer:
Expand Down Expand Up @@ -672,6 +686,8 @@ LexicalSyntacticFeaturizer
["BOS", "EOS", "low", "upper", "title", "digit"],
["low", "title", "upper"],
]
# alias name of the featurizer
"alias": "lexical_syntactic_featurizer"

This configuration is also the default configuration.

Expand Down Expand Up @@ -1225,6 +1241,9 @@ CRFEntityExtractor
"L1_c": 0.1
# weight of the L2 regularization
"L2_c": 0.1
# Name of dense featurizers to use.
# If list is empty all available dense features are used.
"featurizers": []

.. note::
If POS features are used (``pos`` or ``pos2`), you need to have ``SpacyTokenizer`` in your pipeline.
Expand Down Expand Up @@ -1513,6 +1532,10 @@ ResponseSelector
| | | logged. Either after every epoch ("epoch") or for every |
| | | training step ("minibatch"). |
+---------------------------------+-------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features from |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+-------------------+--------------------------------------------------------------+

.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down Expand Up @@ -1749,6 +1772,10 @@ DIETClassifier
| | | logged. Either after every epoch ('epoch') or for every |
| | | training step ('minibatch'). |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features from |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+------------------+--------------------------------------------------------------+

.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ exclude = "((.eggs | .git | .pytype | .pytest_cache | build | dist))"

[tool.poetry]
name = "rasa"
version = "1.11.0a1"
version = "1.11.0a2"
description = "Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants"
authors = [ "Rasa Technologies GmbH <hi@rasa.com>",]
maintainers = [ "Tom Bocklisch <tom@rasa.com>",]
Expand Down
2 changes: 1 addition & 1 deletion rasa/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
CONFIG_MANDATORY_KEYS_NLU = ["language", "pipeline"]
CONFIG_MANDATORY_KEYS = CONFIG_MANDATORY_KEYS_CORE + CONFIG_MANDATORY_KEYS_NLU

MINIMUM_COMPATIBLE_VERSION = "1.11.0a1"
MINIMUM_COMPATIBLE_VERSION = "1.11.0a2"
tabergma marked this conversation as resolved.
Show resolved Hide resolved

GLOBAL_USER_CONFIG_PATH = os.path.expanduser("~/.config/rasa/global.yml")

Expand Down
Loading