Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible NLU pipeline #5863

Merged
merged 57 commits into from
Jun 5, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
d0c173b
create Features class
tabergma May 4, 2020
d0a22a8
Draft message.get_features
tabergma May 4, 2020
20bb3b5
fix get_sparse/dense_features
tabergma May 5, 2020
9bd4e8b
padding lower dim features
tabergma May 5, 2020
a9efca9
update DIETClassifier
tabergma May 6, 2020
313478d
training works
tabergma May 7, 2020
4b64f10
prediction works
tabergma May 7, 2020
c89d8af
refactoring
tabergma May 7, 2020
b4f2a01
Merge branch 'master' into nlu-configuration
tabergma May 7, 2020
5a7c97f
convert featurizer is independent from tokenizer
tabergma May 7, 2020
0bafdfd
set eager mode to False again
tabergma May 7, 2020
857f10f
naming
tabergma May 11, 2020
8775bab
naming
tabergma May 11, 2020
ca4a653
check if additional ffn is needed
tabergma May 11, 2020
57280e6
Merge branch 'master' into nlu-configuration
tabergma May 11, 2020
6f9685b
fix concat bug
tabergma May 12, 2020
062af61
use sparse dense dim
tabergma May 12, 2020
707e1cc
remove convert tokenizer
tabergma May 13, 2020
05fda3f
remove not needed constants
tabergma May 13, 2020
b14464c
start fixing tests
tabergma May 13, 2020
096cb29
use dense dim
tabergma May 13, 2020
3f9df89
fix more tests
tabergma May 13, 2020
71c8c41
fix more tests
tabergma May 13, 2020
892a122
fix testing
tabergma May 13, 2020
c602b26
Merge branch 'master' into nlu-configuration
tabergma May 14, 2020
5154a85
revert changes in classifiers
tabergma May 18, 2020
4796b00
update featurizers
tabergma May 18, 2020
7f33431
update tests
tabergma May 19, 2020
ec7b9c4
clean up
tabergma May 19, 2020
f2d630e
fix tests
tabergma May 19, 2020
0ea9f7c
add changelog
tabergma May 19, 2020
78498f3
update docs
tabergma May 19, 2020
812f5f1
Merge branch 'master' into flexible-nlu-pipeline
tabergma May 19, 2020
ee72fd6
increase version to 1.11.0a2
tabergma May 19, 2020
2f3e95f
fix crf entity extractor
tabergma May 19, 2020
344a66a
Fix dense features in CRFEntityExtractor
tabergma May 20, 2020
3ab7b4f
Merge branch 'master' into flexible-nlu-pipeline
tabergma May 20, 2020
fd1d2d5
Create method 'features_present'
tabergma May 20, 2020
f603454
update tests
tabergma May 20, 2020
aabb4f6
address deepsource issues
tabergma May 20, 2020
ad4d8ce
fix types
tabergma May 20, 2020
f128a6f
set alias name automatically if not present
tabergma May 20, 2020
dc756b4
Update docs
tabergma May 20, 2020
6620987
Add docstrings
tabergma May 20, 2020
ae09a1b
update the changelog
tabergma May 20, 2020
9080b7e
fix changelog entry
tabergma May 20, 2020
45e4a37
fix tests
tabergma May 20, 2020
b9ddc79
Merge branch 'master' into flexible-nlu-pipeline
Ghostvv May 25, 2020
5ed1d5f
Merge branch 'master' into flexible-nlu-pipeline
tabergma Jun 4, 2020
1303df4
update docs
tabergma Jun 4, 2020
624d9df
rename ALIAS to FEATURIZER_CLASS_ALIAS
tabergma Jun 4, 2020
47cb08e
update docstrings
tabergma Jun 4, 2020
743f9fc
review comments
tabergma Jun 4, 2020
9f0e8bf
fix incorrect import
tabergma Jun 4, 2020
754a345
fix incorrect import
tabergma Jun 4, 2020
ad4a26b
add test
tabergma Jun 4, 2020
1766a53
fix issue in convert featurizer process
tabergma Jun 5, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions changelog/5510.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
You can now define what kind of features should be used by what component (see :ref:`choosing-a-pipeline`).

You can set an alias via the option ``alias`` for every featurizer in your pipeline.
tabergma marked this conversation as resolved.
Show resolved Hide resolved
The ``alias`` can be anything, by default it is set to the full featurizer class name.
You can then specify, for example, on the :ref:`diet-classifier` what features from which featurizers should be used.
If you don't set the option ``featurizers`` all available features will be used.
This is also the default behavior.
Check :ref:`components` to see what components have the option ``featurizers`` available.

Here is an example pipeline that shows the new option.
We define an alias for all featurizers in the pipeline.
All features will be used in the ``DIETClassifier``.
However, the ``ResponseSelector`` only takes the features from the ``ConveRTFeaturizer`` and the
``CountVectorsFeaturizer`` (word level).

.. code-block:: none

pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
alias: "convert"
- name: CountVectorsFeaturizer
alias: "cvf_word"
- name: CountVectorsFeaturizer
alias: "cvf_char"
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: RegexFeaturizer
alias: "regex"
- name: LexicalSyntacticFeaturizer
alias: "lsf"
- name: DIETClassifier:
- name: ResponseSelector
epochs: 50
featurizers: ["convert", "cvf_word"]
- name: EntitySynonymMapper

.. warning::
This change is model-breaking. Please retrain your models.
23 changes: 23 additions & 0 deletions data/configs_for_docs/config_featurizers.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
language: "en"

pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
alias: "convert"
- name: RegexFeaturizer
alias: "regex"
- name: LexicalSyntacticFeaturizer
alias: "lexical-syntactic"
- name: CountVectorsFeaturizer
alias: "cvf-word"
- name: CountVectorsFeaturizer
alias: "cvf-char"
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
featurizers: ["convert", "cvf-word"]
epochs: 100
18 changes: 17 additions & 1 deletion docs/nlu/choosing-a-pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ You should only use featurizers from the category :ref:`sparse featurizers <text
:ref:`CountVectorsFeaturizer`, :ref:`RegexFeaturizer` or :ref:`LexicalSyntacticFeaturizer`, if you don't want to use
pre-trained word embeddings.


Entity Recognition / Intent Classification / Response Selectors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -191,6 +190,23 @@ We support several components for each of the tasks. All of them are listed in :
We recommend using :ref:`diet-classifier` for intent classification and entity recognition
and :ref:`response-selector` for response selection.

By default all of these components consume all available features produced in the pipeline.
However, sometimes it makes sense to restrict the features that are used by a specific component.
For example, :ref:`response-selector` is likely to perform better if no features from the
:ref:`RegexFeaturizer` or :ref:`LexicalSyntacticFeaturizer` are used.
To achieve that, you can do the following:
Set an alias for every featurizer in your pipeline via the option ``alias``.
By default the alias is set the the full featurizer class name, for example, ``RegexFeaturizer``.
You can then specify, for example, on the :ref:`response-selector` via the option ``featurizers`` what features from
which featurizers should be used.
If you don't set the option ``featurizers`` all available features will be used.
To check which components have the option ``featurizers`` available, see :ref:`components`.

Here is an example configuration file where the ``DIETClassifier`` is using all available features and the
``ResponseSelector`` is just using the features from the ``ConveRTFeaturizer`` and the ``CountVectorsFeaturizer``.

.. literalinclude:: ../../data/configs_for_docs/config_featurizers.yml
:language: yaml

Multi-Intent Classification
***************************
Expand Down
104 changes: 59 additions & 45 deletions docs/nlu/components.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,7 @@ This feature vector can be used in any bag-of-words model.
The corresponding classifier can therefore decide what kind of features to use.



.. _MitieFeaturizer:

MitieFeaturizer
Expand Down Expand Up @@ -570,51 +571,53 @@ CountVectorsFeaturizer

.. code-block:: none

+-------------------+-------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+===================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------+--------------------------------------------------------------+
+-------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
tabergma marked this conversation as resolved.
Show resolved Hide resolved
+-------------------+-------------------------+--------------------------------------------------------------+


.. _LexicalSyntacticFeaturizer:
Expand Down Expand Up @@ -1038,6 +1041,9 @@ CRFEntityExtractor
"L1_c": 0.1
# weight of the L2 regularization
"L2_c": 0.1
# Name of dense featurizers to use.
# If list is empty all available dense features are used.
"featurizers": []

.. note::
If POS features are used (``pos`` or ``pos2`), you need to have ``SpacyTokenizer`` in your pipeline.
Expand Down Expand Up @@ -1326,6 +1332,10 @@ ResponseSelector
| | | logged. Either after every epoch ("epoch") or for every |
| | | training step ("minibatch"). |
+---------------------------------+-------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+-------------------+--------------------------------------------------------------+

.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down Expand Up @@ -1562,6 +1572,10 @@ DIETClassifier
| | | logged. Either after every epoch ('epoch') or for every |
| | | training step ('minibatch'). |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+------------------+--------------------------------------------------------------+

.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ exclude = "((.eggs | .git | .pytype | .pytest_cache | build | dist))"

[tool.poetry]
name = "rasa"
version = "1.11.0a1"
version = "1.11.0a2"
description = "Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants"
authors = [ "Rasa Technologies GmbH <hi@rasa.com>",]
maintainers = [ "Tom Bocklisch <tom@rasa.com>",]
Expand Down
2 changes: 1 addition & 1 deletion rasa/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
CONFIG_MANDATORY_KEYS_NLU = ["language", "pipeline"]
CONFIG_MANDATORY_KEYS = CONFIG_MANDATORY_KEYS_CORE + CONFIG_MANDATORY_KEYS_NLU

MINIMUM_COMPATIBLE_VERSION = "1.11.0a1"
MINIMUM_COMPATIBLE_VERSION = "1.11.0a2"
tabergma marked this conversation as resolved.
Show resolved Hide resolved

GLOBAL_USER_CONFIG_PATH = os.path.expanduser("~/.config/rasa/global.yml")

Expand Down
Loading