Skip to content

Commit

Permalink
Flexible NLU pipeline (#5863)
Browse files Browse the repository at this point in the history
  • Loading branch information
tabergma authored Jun 5, 2020
1 parent aa4c97d commit 815a9af
Show file tree
Hide file tree
Showing 41 changed files with 830 additions and 491 deletions.
40 changes: 40 additions & 0 deletions changelog/5510.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
You can now define what kind of features should be used by what component (see :ref:`choosing-a-pipeline`).

You can set an alias via the option ``alias`` for every featurizer in your pipeline.
The ``alias`` can be anything, by default it is set to the full featurizer class name.
You can then specify, for example, on the :ref:`diet-classifier` what features from which featurizers should be used.
If you don't set the option ``featurizers`` all available features will be used.
This is also the default behavior.
Check :ref:`components` to see what components have the option ``featurizers`` available.

Here is an example pipeline that shows the new option.
We define an alias for all featurizers in the pipeline.
All features will be used in the ``DIETClassifier``.
However, the ``ResponseSelector`` only takes the features from the ``ConveRTFeaturizer`` and the
``CountVectorsFeaturizer`` (word level).

.. code-block:: none
pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
alias: "convert"
- name: CountVectorsFeaturizer
alias: "cvf_word"
- name: CountVectorsFeaturizer
alias: "cvf_char"
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: RegexFeaturizer
alias: "regex"
- name: LexicalSyntacticFeaturizer
alias: "lsf"
- name: DIETClassifier:
- name: ResponseSelector
epochs: 50
featurizers: ["convert", "cvf_word"]
- name: EntitySynonymMapper
.. warning::
This change is model-breaking. Please retrain your models.
23 changes: 23 additions & 0 deletions data/configs_for_docs/config_featurizers.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
language: "en"

pipeline:
- name: ConveRTTokenizer
- name: ConveRTFeaturizer
alias: "convert"
- name: RegexFeaturizer
alias: "regex"
- name: LexicalSyntacticFeaturizer
alias: "lexical-syntactic"
- name: CountVectorsFeaturizer
alias: "cvf-word"
- name: CountVectorsFeaturizer
alias: "cvf-char"
analyzer: "char_wb"
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 100
- name: EntitySynonymMapper
- name: ResponseSelector
featurizers: ["convert", "cvf-word"]
epochs: 100
18 changes: 17 additions & 1 deletion docs/nlu/choosing-a-pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,6 @@ You should only use featurizers from the category :ref:`sparse featurizers <text
:ref:`CountVectorsFeaturizer`, :ref:`RegexFeaturizer` or :ref:`LexicalSyntacticFeaturizer`, if you don't want to use
pre-trained word embeddings.


Entity Recognition / Intent Classification / Response Selectors
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -191,6 +190,23 @@ We support several components for each of the tasks. All of them are listed in :
We recommend using :ref:`diet-classifier` for intent classification and entity recognition
and :ref:`response-selector` for response selection.

By default all of these components consume all available features produced in the pipeline.
However, sometimes it makes sense to restrict the features that are used by a specific component.
For example, :ref:`response-selector` is likely to perform better if no features from the
:ref:`RegexFeaturizer` or :ref:`LexicalSyntacticFeaturizer` are used.
To achieve that, you can do the following:
Set an alias for every featurizer in your pipeline via the option ``alias``.
By default the alias is set the the full featurizer class name, for example, ``RegexFeaturizer``.
You can then specify, for example, on the :ref:`response-selector` via the option ``featurizers`` what features from
which featurizers should be used.
If you don't set the option ``featurizers`` all available features will be used.
To check which components have the option ``featurizers`` available, see :ref:`components`.

Here is an example configuration file where the ``DIETClassifier`` is using all available features and the
``ResponseSelector`` is just using the features from the ``ConveRTFeaturizer`` and the ``CountVectorsFeaturizer``.

.. literalinclude:: ../../data/configs_for_docs/config_featurizers.yml
:language: yaml

Multi-Intent Classification
***************************
Expand Down
104 changes: 59 additions & 45 deletions docs/nlu/components.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,7 @@ This feature vector can be used in any bag-of-words model.
The corresponding classifier can therefore decide what kind of features to use.



.. _MitieFeaturizer:

MitieFeaturizer
Expand Down Expand Up @@ -570,51 +571,53 @@ CountVectorsFeaturizer

.. code-block:: none
+-------------------+-------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+===================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------+--------------------------------------------------------------+
+-------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+-------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option ‘char_wb’ creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| token_pattern | r"(?u)\b\w\w+\b" | Regular expression used to detect tokens. |
| | | Only used if 'analyzer' is set to 'word'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+-------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+-------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+-------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+-------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+-------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+-------------------+-------------------------+--------------------------------------------------------------+
.. _LexicalSyntacticFeaturizer:
Expand Down Expand Up @@ -1038,6 +1041,9 @@ CRFEntityExtractor
"L1_c": 0.1
# weight of the L2 regularization
"L2_c": 0.1
# Name of dense featurizers to use.
# If list is empty all available dense features are used.
"featurizers": []
.. note::
If POS features are used (``pos`` or ``pos2`), you need to have ``SpacyTokenizer`` in your pipeline.
Expand Down Expand Up @@ -1326,6 +1332,10 @@ ResponseSelector
| | | logged. Either after every epoch ("epoch") or for every |
| | | training step ("minibatch"). |
+---------------------------------+-------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+-------------------+--------------------------------------------------------------+
.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down Expand Up @@ -1562,6 +1572,10 @@ DIETClassifier
| | | logged. Either after every epoch ('epoch') or for every |
| | | training step ('minibatch'). |
+---------------------------------+------------------+--------------------------------------------------------------+
| featurizers | [] | List of featurizer names (alias names). Only features |
| | | coming from the listed names are used. If list is empty |
| | | all available features are used. |
+---------------------------------+------------------+--------------------------------------------------------------+
.. note:: For ``cosine`` similarity ``maximum_positive_similarity`` and ``maximum_negative_similarity`` should
be between ``-1`` and ``1``.
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ exclude = "((.eggs | .git | .pytype | .pytest_cache | build | dist))"

[tool.poetry]
name = "rasa"
version = "1.11.0a1"
version = "1.11.0a2"
description = "Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants"
authors = [ "Rasa Technologies GmbH <hi@rasa.com>",]
maintainers = [ "Tom Bocklisch <tom@rasa.com>",]
Expand Down
2 changes: 1 addition & 1 deletion rasa/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
CONFIG_MANDATORY_KEYS_NLU = ["language", "pipeline"]
CONFIG_MANDATORY_KEYS = CONFIG_MANDATORY_KEYS_CORE + CONFIG_MANDATORY_KEYS_NLU

MINIMUM_COMPATIBLE_VERSION = "1.11.0a1"
MINIMUM_COMPATIBLE_VERSION = "1.11.0a2"

GLOBAL_USER_CONFIG_PATH = os.path.expanduser("~/.config/rasa/global.yml")

Expand Down
Loading

0 comments on commit 815a9af

Please sign in to comment.