Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated featurizers #4935

Merged
merged 278 commits into from
Dec 17, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
278 commits
Select commit Hold shift + click to select a range
d496b43
Add changelog entry.
tabergma Oct 18, 2019
291a24e
move code from init to own file
tabergma Oct 18, 2019
5986a0d
update changelog entry.
tabergma Oct 18, 2019
54b5f3a
make use_cls_token a class variable of tokenizer
tabergma Oct 18, 2019
c939387
tokenizer inherits from compoenent
tabergma Oct 18, 2019
944b716
remove not needed init methods
tabergma Oct 18, 2019
f1ed7d7
review comment
tabergma Oct 18, 2019
9112022
Add use_cls_token to default dict.
tabergma Oct 18, 2019
31dd425
thorw key error if use_cls_token is not set as default value.
tabergma Oct 18, 2019
e652d84
Disable cls token use in default pipeline.
tabergma Oct 20, 2019
1d77554
correct type
tabergma Oct 20, 2019
3d8a2e4
fix tests
tabergma Oct 21, 2019
2985938
Merge branch 'combined-entity-intent-model' into adapt-featurizers
tabergma Oct 21, 2019
50f68b2
spacy featurizer returns sequence
tabergma Oct 21, 2019
603d065
fix tests for count vectors featurizer
tabergma Oct 21, 2019
d1a19dc
mitie featurizer returns sequence
tabergma Oct 21, 2019
bd2ceb3
regex featurizer returns sequence
tabergma Oct 21, 2019
7e46fe8
clean up
tabergma Oct 21, 2019
a4b8b0e
Add changelog entry
tabergma Oct 21, 2019
f02b9c2
helper method to convert seq features back
tabergma Oct 21, 2019
d3a5dd5
remove print statement
tabergma Oct 21, 2019
46ab485
fix imports
tabergma Oct 22, 2019
076f33d
remove ner_features from restaurantbot
tabergma Oct 22, 2019
905f2d6
change default value
tabergma Oct 22, 2019
1941a25
fix imports
tabergma Oct 22, 2019
6483379
handle cls token in featurizers
tabergma Oct 22, 2019
952e95a
Remove ngram featurizer from registry
tabergma Oct 22, 2019
6faa44b
review comments
tabergma Oct 23, 2019
20a92ca
count vectors featurizer requires tokens
tabergma Oct 23, 2019
810cae5
remove not needed vocab check
tabergma Oct 23, 2019
b6ad85c
Add cls token to whitespace tokenizer.
tabergma Oct 18, 2019
fb24e35
Add cls token to spacy tokenizer.
tabergma Oct 18, 2019
ad64e50
Add cls token to mitie tokenizer.
tabergma Oct 18, 2019
2ce36d9
Add cls token to jieba tokenizer.
tabergma Oct 18, 2019
3f85199
Add changelog entry.
tabergma Oct 18, 2019
88964a0
move code from init to own file
tabergma Oct 18, 2019
acb7503
update changelog entry.
tabergma Oct 18, 2019
3d89a66
make use_cls_token a class variable of tokenizer
tabergma Oct 18, 2019
7ed1f27
tokenizer inherits from compoenent
tabergma Oct 18, 2019
b9e3188
remove not needed init methods
tabergma Oct 18, 2019
787e047
review comment
tabergma Oct 18, 2019
6fe28f0
Add use_cls_token to default dict.
tabergma Oct 18, 2019
172c0e5
thorw key error if use_cls_token is not set as default value.
tabergma Oct 18, 2019
45a5868
Disable cls token use in default pipeline.
tabergma Oct 20, 2019
7c9c679
correct type
tabergma Oct 20, 2019
dfeca3e
fix tests
tabergma Oct 21, 2019
d031f14
Merge branch 'combined-entity-intent-model' into adapt-featurizers
tabergma Oct 23, 2019
ce91597
Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurize…
tabergma Oct 23, 2019
f69673a
Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurize…
tabergma Oct 23, 2019
78d4d51
review comments
tabergma Oct 23, 2019
411d328
test regex featurizer on response
tabergma Oct 23, 2019
884e2b3
review comments
tabergma Oct 23, 2019
2075338
Merge pull request #4648 from RasaHQ/adapt-featurizers
tabergma Oct 23, 2019
e857c35
switch from ner to sparse features
tabergma Oct 23, 2019
01b4de6
add seq to senntence embedding method
tabergma Oct 23, 2019
f21bbd7
update crf entity extractor
tabergma Oct 23, 2019
d08b2d7
use constants
tabergma Oct 23, 2019
a5e3382
fix imports
tabergma Oct 23, 2019
b469bd6
Fix crf entity extractor.
tabergma Oct 23, 2019
d97ce14
Remove empty file.
tabergma Oct 23, 2019
3275f5a
add changelog entry.
tabergma Oct 23, 2019
f85fbe2
Remove case_sensitive option from WhitespaceTokenizer
tabergma Oct 23, 2019
a4f5e8e
Update docstring.
tabergma Oct 23, 2019
a6d93fb
review comments
tabergma Oct 23, 2019
ae8faf6
undo removing case sensitive from whitespace tokenizer
tabergma Oct 24, 2019
8791d40
Adapt tests.
tabergma Oct 24, 2019
8d86968
rename word_embeddings to text_dense_features
tabergma Oct 24, 2019
a8a5abf
combine correct features in regex featurizer
tabergma Oct 24, 2019
b1d371b
keep sparse sparse
tabergma Oct 25, 2019
5ded8c9
fix changelog
tabergma Oct 25, 2019
bbf4d43
update sequence to sentence
tabergma Oct 25, 2019
ecbf157
update sequence to sentence
tabergma Oct 25, 2019
74ec4c4
Merge pull request #4663 from RasaHQ/adapt-extractors-classifiers
tabergma Oct 25, 2019
934ae5e
Merge branch 'master' into combined-entity-intent-model
tabergma Oct 25, 2019
a62445c
Merge branch 'master' into combined-entity-intent-model
tabergma Oct 29, 2019
3cb90b7
update session data
tabergma Oct 23, 2019
3faf171
use dict for session data.
tabergma Oct 28, 2019
834d265
adapt classifiers
tabergma Oct 28, 2019
3f0750b
fix classifier
tabergma Oct 28, 2019
4c8f811
add more tests
tabergma Oct 28, 2019
e05a9d6
use sparse in tests
tabergma Oct 28, 2019
b110720
fix shapes
tabergma Oct 29, 2019
219d9dd
fix tests.
tabergma Oct 29, 2019
bca0b85
review comments
tabergma Oct 29, 2019
adc84fe
use label_key
tabergma Oct 29, 2019
1dbaa5c
intent classifier makes use of sparse and dense features.
tabergma Oct 29, 2019
f2a8599
remove default value for label_key
tabergma Oct 29, 2019
9c25095
clean up
tabergma Oct 29, 2019
dafbaf9
review comments
tabergma Oct 30, 2019
3c78a86
add more tests
tabergma Oct 30, 2019
31196cf
add test for blanance session data
tabergma Oct 30, 2019
9f4ed63
use given attribute in create session data
tabergma Oct 30, 2019
7545745
gen_batch can handle sequence
tabergma Oct 31, 2019
d3b48ea
session data is simple dict
tabergma Oct 31, 2019
a65f397
use sparse tensors
tabergma Nov 1, 2019
60bdec2
warp tf.layers.dense with dense_layer function
tabergma Nov 4, 2019
c448fe7
get feature_dim from session data instead of sparse tensor
tabergma Nov 4, 2019
093024f
Update rasa/utils/train_utils.py
Ghostvv Nov 7, 2019
b600c26
pass last dim of sparse tensor into the SparseTensor directly, separa…
Ghostvv Nov 7, 2019
d53ffb9
rephrase todo
Ghostvv Nov 7, 2019
086ee13
rephrase todo
Ghostvv Nov 8, 2019
e3f8a63
keep _encoded_all_label_ids scipy.sparse.csr_matrix.
tabergma Nov 8, 2019
98829a9
session data values are list of np.ndarray
tabergma Nov 8, 2019
f97f6df
fix encoded all label ids
tabergma Nov 8, 2019
5d53eb1
fix train utils methods
tabergma Nov 8, 2019
f79ed36
convert encoded_all_labels into a list of sparse,dense
Ghostvv Nov 8, 2019
0a06ff6
Merge branch 'adapt-session-data' of https://github.com/RasaHQ/rasa i…
Ghostvv Nov 8, 2019
c2447f3
Merge branch 'master' into combined-entity-intent-model
tabergma Nov 8, 2019
2a3966b
Merge branch 'combined-entity-intent-model' into adapt-session-data
tabergma Nov 8, 2019
b208db7
create sparse matrices, if no intent features provided
Ghostvv Nov 8, 2019
ee37852
embedding intent classifier is training.
tabergma Nov 11, 2019
b9256bd
create session data during prediction.
tabergma Nov 11, 2019
112f065
prediction of embedding intent classifier works.
tabergma Nov 11, 2019
2635464
clean up code
tabergma Nov 11, 2019
4104574
convert encoded all labels the same way as session data
Ghostvv Nov 11, 2019
87bb01a
merge
Ghostvv Nov 11, 2019
9d66575
add mask
tabergma Nov 11, 2019
1c4591e
check if tokens are present
tabergma Nov 11, 2019
7889033
add TODO
Ghostvv Nov 11, 2019
6f20dbc
fix wrong embed layer
Ghostvv Nov 11, 2019
6cf4385
more consistent var naming
Ghostvv Nov 11, 2019
bcc52c1
fix balance session data
tabergma Nov 12, 2019
c161a43
add comments
tabergma Nov 12, 2019
c7e3251
extract dense_dim from dense features
tabergma Nov 12, 2019
b98bab6
Fix test_train test.
tabergma Nov 12, 2019
615bb62
_compute_default_label_features works as expected
tabergma Nov 12, 2019
2ef8744
fix len error'
Ghostvv Nov 12, 2019
13a3550
Merge branch 'adapt-session-data' of https://github.com/RasaHQ/rasa i…
Ghostvv Nov 12, 2019
c558e0d
Merge branch 'master' into combined-entity-intent-model
tabergma Nov 12, 2019
42ba88e
Merge branch 'combined-entity-intent-model' into adapt-session-data
tabergma Nov 12, 2019
718aff0
use default label features if not present
tabergma Nov 12, 2019
220d6d0
correct use of session data in policy
tabergma Nov 12, 2019
18fe94f
Use coo_matrix.
tabergma Nov 12, 2019
c56db96
Update Changelog
tabergma Nov 12, 2019
d22055c
clean up
tabergma Nov 12, 2019
ad8695a
Fix imports.
tabergma Nov 12, 2019
1c835c5
add masks, update prediction batch creation
Ghostvv Nov 12, 2019
ff0c707
fix types
Ghostvv Nov 12, 2019
597265b
merge helper methods
Ghostvv Nov 12, 2019
29a9c7f
some refactoring
tabergma Nov 13, 2019
4c03841
add test for get number of features
tabergma Nov 13, 2019
fa7b50c
set initial tuple size to zero
Ghostvv Nov 13, 2019
fb7a6ed
Merge branch 'adapt-session-data' of https://github.com/RasaHQ/rasa i…
Ghostvv Nov 13, 2019
35e2bdb
rename the variable
Ghostvv Nov 13, 2019
9bbe1c1
formatting
tabergma Nov 13, 2019
f383cf0
Update cli startup test
tabergma Nov 13, 2019
7ab1f97
fix test.
tabergma Nov 13, 2019
2a54286
fix different sequence lengths in sparse and dense features
Ghostvv Nov 13, 2019
307e064
cosmetic changes
Ghostvv Nov 13, 2019
7c46caa
black
Ghostvv Nov 13, 2019
468ef3c
fix default Y features
Ghostvv Nov 13, 2019
b2391cf
use f strings
tabergma Nov 13, 2019
ed2b72f
formatting
tabergma Nov 13, 2019
38d83c3
fix types
tabergma Nov 14, 2019
5fdd251
store tuple sizes correctly
tabergma Nov 14, 2019
75b0e69
use float32 everywhere
tabergma Nov 14, 2019
38e8b81
fix docstrings
Ghostvv Nov 14, 2019
6e472a9
fix label_ids in core
Ghostvv Nov 14, 2019
5fdf957
use helper method
Ghostvv Nov 14, 2019
6aaf3ce
fix dynamic seq in label_id
Ghostvv Nov 14, 2019
4f9ecf7
raise if unsupported label_id dims
Ghostvv Nov 14, 2019
015c4d9
black
Ghostvv Nov 14, 2019
6004d65
fix import
Ghostvv Nov 14, 2019
5c050da
fix split session data tests
Ghostvv Nov 14, 2019
c08fe55
Merge pull request #4686 from RasaHQ/adapt-session-data
tabergma Nov 14, 2019
6fc18e5
Merge branch 'master' into combined-entity-intent-model
tabergma Nov 14, 2019
b40d6f4
slightly cleaner sparse to indicies code
Ghostvv Nov 15, 2019
af54fbc
use extend
Ghostvv Nov 15, 2019
5d435a3
remove else
Ghostvv Nov 15, 2019
8d41e5e
use numpy stack
Ghostvv Nov 15, 2019
f29415c
Merge pull request #4777 from RasaHQ/sparse-batch
Ghostvv Nov 15, 2019
308b487
fix split train val
tabergma Nov 15, 2019
62d9e60
mask combined input before averaging
Ghostvv Nov 20, 2019
b26cfac
Merge branch 'master' into updated-featurizers
tabergma Nov 21, 2019
931d5eb
fix oov token warning
Ghostvv Nov 25, 2019
afeeaf1
Merge branch 'master' into updated-featurizers
tabergma Nov 25, 2019
df104f3
Merge branch 'master' into updated-featurizers
tabergma Nov 27, 2019
232176c
move convert featurizer to dense featurizers
tabergma Nov 27, 2019
7c87d60
add future warning to ngram featurizer
tabergma Nov 27, 2019
a81d0a8
set default value of use_cls_token to false
tabergma Nov 27, 2019
b2d1ad2
Merge branch 'master' into updated-featurizers
tabergma Nov 28, 2019
4471dee
fix import (add root)
tabergma Nov 28, 2019
e48d7a5
add return_sequence flag
tabergma Nov 28, 2019
bcfb0ad
convert featurizer returns seq of 1
tabergma Nov 28, 2019
f2b9e4f
fix return_sequence not found in config
tabergma Nov 29, 2019
a9360e3
convert featurizer return seq of 1
tabergma Nov 29, 2019
2850813
add more tests
tabergma Nov 29, 2019
32586f3
add test for convert featurizer
tabergma Nov 29, 2019
5121edc
fix default pipeline test
tabergma Nov 29, 2019
3c20e33
refactor mitie featurizer
tabergma Nov 29, 2019
4d42a22
Merge branch 'master' into updated-featurizers
tabergma Nov 29, 2019
961b912
Merge branch 'updated-featurizers' into add-sequence-flag
tabergma Nov 29, 2019
aac64a8
fix import
tabergma Nov 29, 2019
a4b454b
Add warning to convert featurizer.
tabergma Dec 2, 2019
303ef4c
update warning in crf entity extractor
tabergma Dec 2, 2019
b4e1e04
Add empty documentation page.
tabergma Dec 2, 2019
2e32d7e
update documentation
tabergma Dec 2, 2019
24e92b6
raise value error if seq dimension does not match
tabergma Dec 3, 2019
388fb6e
take mean vec for cls token in mitie
tabergma Dec 4, 2019
d5579a3
fix bug in count vector featurizer
tabergma Dec 4, 2019
0a37a61
review comments
tabergma Dec 4, 2019
afec4a9
add comment to count vectors about input to vectorizer
tabergma Dec 9, 2019
f6507ca
throw error is return seq is true for convert featurizer
tabergma Dec 9, 2019
de9a5ed
update warnings
tabergma Dec 9, 2019
c01673c
update warning
tabergma Dec 9, 2019
3ae7626
fix tests
tabergma Dec 9, 2019
8156a4e
Merge pull request #4880 from RasaHQ/add-sequence-flag
tabergma Dec 9, 2019
ea57e20
Merge branch 'master' into updated-featurizers
tabergma Dec 10, 2019
fcf0474
remove default values from example configs
tabergma Dec 10, 2019
b8b4c2c
Merge branch 'updated-featurizers' into nlu-featurizer-documentation
tabergma Dec 10, 2019
79e0ceb
fix import
tabergma Dec 10, 2019
e47176a
update documentatioon
tabergma Dec 10, 2019
d39c322
Merge branch 'master' into updated-featurizers
tabergma Dec 10, 2019
66bdd62
Merge branch 'updated-featurizers' into nlu-featurizer-documentation
tabergma Dec 10, 2019
e3ed14f
fix links
tabergma Dec 10, 2019
225f1e4
reduce complexity
tabergma Dec 10, 2019
d434f04
update featurization link
tabergma Dec 11, 2019
a422dbd
Merge branch 'master' into updated-featurizers
tabergma Dec 11, 2019
d270cba
Merge branch 'updated-featurizers' into nlu-featurizer-documentation
tabergma Dec 11, 2019
8fdb9cf
review comment
tabergma Dec 11, 2019
dc47c40
Merge pull request #4934 from RasaHQ/nlu-featurizer-documentation
tabergma Dec 11, 2019
4c631e6
remove MESSAGE_ from nlu constants
tabergma Dec 11, 2019
50d54e3
rename spacy_featurizable_attributes to dense_featurizable_attributes
tabergma Dec 11, 2019
80e483b
Merge pull request #4944 from RasaHQ/rename-nlu-constants
tabergma Dec 11, 2019
f9b4f82
update changelog entry
tabergma Dec 12, 2019
1c1d95e
Merge branch 'master' into updated-featurizers
tabergma Dec 12, 2019
832755e
update docs around convert featurizer
tabergma Dec 12, 2019
4e4cef6
add description to public methods in embedding intent classifier
tabergma Dec 12, 2019
bb231b1
update train utils
tabergma Dec 12, 2019
aa3bf9d
update changelog entry
tabergma Dec 12, 2019
1125e11
Update nlu component documentation.
tabergma Dec 12, 2019
9628eb2
fix spelling mistakes
tabergma Dec 12, 2019
56e7f86
Merge branch 'master' into updated-featurizers
tabergma Dec 12, 2019
e4529c0
refactoring count vectors featurizer
tabergma Dec 12, 2019
a366b77
compute default intent features as dense features
tabergma Dec 12, 2019
47095d1
use different dense dim default value for intents
tabergma Dec 12, 2019
b8b4bec
Merge branch 'master' into updated-featurizers
tabergma Dec 12, 2019
cd58a51
Merge branch 'master' into updated-featurizers
tabergma Dec 16, 2019
2df3b36
update model version
tabergma Dec 16, 2019
ec2cb58
update changelog
tabergma Dec 16, 2019
2f148f3
increase version to 1.6.0a2
tabergma Dec 16, 2019
8ba153a
update documentation
tabergma Dec 16, 2019
a79916c
review comments
tabergma Dec 16, 2019
5ef7b80
Update rasa/nlu/featurizers/sparse_featurizer/ngram_featurizer.py
tabergma Dec 16, 2019
bb44fd6
add missing types
tabergma Dec 16, 2019
3032fc4
Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurize…
tabergma Dec 16, 2019
e1eade1
Update rasa/nlu/featurizers/sparse_featurizer/count_vectors_featurize…
tabergma Dec 16, 2019
ad30827
fix types
tabergma Dec 16, 2019
b83ee6f
Merge branch 'master' into updated-featurizers
tabergma Dec 16, 2019
2253200
Merge branch 'master' into updated-featurizers
tabergma Dec 17, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions changelog/4935.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Preparation for an upcoming change in the ``EmbeddingIntentClassifier``:

Add option ``use_cls_token`` to all tokenizers. If it is set to ``True``, the token ``__CLS__`` will be added to
the end of the list of tokens. Default is set to ``False``. No need to change the default value for now.

Add option ``return_sequence`` to all featurizers. By default all featurizers return a matrix of size
(1 x feature-dimension). If the option ``return_sequence`` is set to ``True``, the corresponding featurizer will return
a matrix of size (token-length x feature-dimension). See https://rasa.com/docs/rasa/nlu/components/#featurizers.
Default value is set to ``False``. However, you might want to set it to ``True`` if you want to use custom features
in the ``CRFEntityExtractor``.
See https://rasa.com/docs/rasa/nlu/entity-extraction/#passing-custom-features-to-crfentityextractor.

.. warning::

These changes break model compatibility. You will need to retrain your old models!
12 changes: 12 additions & 0 deletions changelog/4935.removal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Removed ``ner_features`` as a feature name from ``CRFEntityExtractor``, use ``text_dense_features`` instead. If

The following settings match the previous ``NGramFeaturizer``:
tabergma marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: yaml

- name: 'CountVectorsFeaturizer'
analyzer: 'char_wb'
min_ngram: 3
max_ngram: 17
max_features: 10
min_df: 5
5 changes: 5 additions & 0 deletions changelog/4957.removal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
To use custom features in the ``CRFEntityExtractor`` use ``text_dense_features`` instead of ``ner_features``. If
tabergma marked this conversation as resolved.
Show resolved Hide resolved
``text_dense_features`` are present in the feature set, the ``CRFEntityExtractor`` will automatically make use of
them. Just make sure to add a dense featurizer in front of the ``CRFEntityExtractor`` in your pipeline and set the
flag ``return_sequence`` to ``True`` for that featurizer.
See https://rasa.com/docs/rasa/nlu/entity-extraction/#passing-custom-features-to-crfentityextractor.
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
:desc: Find out how to apply machine learning algorithms to conversational AI
using vector representations of conversations with Rasa.

.. _featurization:
.. _featurization_conversations:

Featurization
==============
Featurization of Conversations
==============================

.. edit-link::

Expand Down
6 changes: 3 additions & 3 deletions docs/core/policies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ in the policy configuration yaml file.

Only the ``MaxHistoryTrackerFeaturizer`` uses a max history,
whereas the ``FullDialogueTrackerFeaturizer`` always looks at
the full conversation history. See :ref:`featurization` for details.
the full conversation history. See :ref:`featurization_conversations` for details.

As an example, let's say you have an ``out_of_scope`` intent which
describes off-topic user messages. If your bot sees this intent multiple
Expand Down Expand Up @@ -218,7 +218,7 @@ following steps:

It is recommended to use
``state_featurizer=LabelTokenizerSingleStateFeaturizer(...)``
(see :ref:`featurization` for details).
(see :ref:`featurization_conversations` for details).

**Configuration:**

Expand Down Expand Up @@ -308,7 +308,7 @@ It is recommended to use
Default ``max_history`` for this policy is ``None`` which means it'll use
the ``FullDialogueTrackerFeaturizer``. We recommend to set ``max_history`` to
some finite value in order to use ``MaxHistoryTrackerFeaturizer``
for **faster training**. See :ref:`featurization` for details.
for **faster training**. See :ref:`featurization_conversations` for details.
We recommend to increase ``batch_size`` for ``MaxHistoryTrackerFeaturizer``
(e.g. ``"batch_size": [32, 64]``)

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Understand messages, hold conversations, and connect to messaging channels and A
api/event-brokers
api/lock-stores
api/training-data-importers
api/featurization
api/core-featurization
migration-guide
changelog

Expand Down
2 changes: 1 addition & 1 deletion docs/migration-guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ General
- Default ``max_history`` for ``EmbeddingPolicy`` is ``None`` which means it'll use
the ``FullDialogueTrackerFeaturizer``. We recommend to set ``max_history`` to
some finite value in order to use ``MaxHistoryTrackerFeaturizer``
for **faster training**. See :ref:`featurization` for details.
for **faster training**. See :ref:`featurization_conversations` for details.
We recommend to increase ``batch_size`` for ``MaxHistoryTrackerFeaturizer``
(e.g. ``"batch_size": [32, 64]``)
- **Compare** mode of ``rasa train core`` allows the whole core config comparison.
Expand Down
156 changes: 91 additions & 65 deletions docs/nlu/components.rst

Large diffs are not rendered by default.

15 changes: 11 additions & 4 deletions docs/nlu/entity-extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -151,10 +151,17 @@ If you just want to match regular expressions exactly, you can do this in your c
as a postprocessing step after receiving the response from Rasa NLU.


.. _entity-extraction-custom-features:

Passing Custom Features to ``CRFEntityExtractor``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you want to pass custom features to ``CRFEntityExtractor``, you can create a ``Featurizer`` that provides ``ner_features``.
If you do, ``ner_features`` should be an iterable of ``len(tokens)``, where each entry is a vector.
If ``CRFEntityExtractor`` finds ``"ner_features"`` in one of the arrays in ``features`` in the config, it will pass the ``ner_features`` vectors to ``sklearn_crfsuite``.
The simplest example of this is to pass word vectors as features, which you can do using :ref:``SpacyFeaturizer``.
If you want to pass custom features, such as pre-trained word embeddings, to ``CRFEntityExtractor``, you can
add any dense featurizer (except ``ConveRTFeaturizer``) to the pipeline before the ``CRFEntityExtractor``.
Make sure to set ``"return_sequence"`` to ``True`` for the corresponding dense featurizer.
``CRFEntityExtractor`` automatically finds the additional dense features and checks if the dense features are an
iterable of ``len(tokens)``, where each entry is a vector.
A warning will be shown in case the check fails.
However, ``CRFEntityExtractor`` will continue to train just without the additional custom features.
In case dense features are present, ``CRFEntityExtractor`` will pass the dense features to ``sklearn_crfsuite``
and use them for training.
31 changes: 16 additions & 15 deletions examples/restaurantbot/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,24 @@ pipeline:
- name: "SpacyFeaturizer"
- name: "SklearnIntentClassifier"
- name: "CRFEntityExtractor"
features: [ ["low", "title", "upper"],
features: [
["low", "title", "upper"],
[
"bias",
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pattern",
"ner_features",
"bias",
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pattern",
"text_dense_features"
],
["low", "title", "upper"]]

["low", "title", "upper"],
]
- name: "EntitySynonymMapper"

policies:
Expand Down
2 changes: 1 addition & 1 deletion rasa/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
CONFIG_MANDATORY_KEYS_NLU = ["language", "pipeline"]
CONFIG_MANDATORY_KEYS = CONFIG_MANDATORY_KEYS_CORE + CONFIG_MANDATORY_KEYS_NLU

MINIMUM_COMPATIBLE_VERSION = "1.3.0a2"
MINIMUM_COMPATIBLE_VERSION = "1.6.0a2"

GLOBAL_USER_CONFIG_PATH = os.path.expanduser("~/.config/rasa/global.yml")

Expand Down
4 changes: 2 additions & 2 deletions rasa/core/actions/action.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from rasa.nlu.constants import (
DEFAULT_OPEN_UTTERANCE_TYPE,
OPEN_UTTERANCE_PREDICTION_KEY,
MESSAGE_SELECTOR_PROPERTY_NAME,
RESPONSE_SELECTOR_PROPERTY_NAME,
)

from rasa.core.events import (
Expand Down Expand Up @@ -201,7 +201,7 @@ async def run(
"""Query the appropriate response and create a bot utterance with that."""

response_selector_properties = tracker.latest_message.parse_data[
MESSAGE_SELECTOR_PROPERTY_NAME
RESPONSE_SELECTOR_PROPERTY_NAME
]

if self.intent_name_from_action() in response_selector_properties:
Expand Down
51 changes: 31 additions & 20 deletions rasa/core/policies/embedding_policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,25 +252,25 @@ def _label_features_for_Y(self, label_ids: "np.ndarray") -> "np.ndarray":
# noinspection PyPep8Naming
def _create_session_data(
self, data_X: "np.ndarray", data_Y: Optional["np.ndarray"] = None
) -> "train_utils.SessionData":
"""Combine all tf session related data into a named tuple"""

) -> "train_utils.SessionDataType":
"""Combine all tf session related data into dict."""
if data_Y is not None:
# training time
label_ids = self._label_ids_for_Y(data_Y)
Y = self._label_features_for_Y(label_ids)

# idea taken from sklearn's stratify split
if label_ids.ndim == 2:
# for multi-label y, map each distinct row to a string repr
# using join because str(row) uses an ellipsis if len(row) > 1000
label_ids = np.array([" ".join(row.astype("str")) for row in label_ids])
# explicitly add last dimension to label_ids
# to track correctly dynamic sequences
label_ids = np.expand_dims(label_ids, -1)
else:
# prediction time
label_ids = None
Y = None

return train_utils.SessionData(X=data_X, Y=Y, label_ids=label_ids)
return {
"dialogue_features": [data_X],
"bot_features": [Y],
"action_ids": [label_ids],
}

def _create_tf_bot_embed(self, b_in: "tf.Tensor") -> "tf.Tensor":
"""Create embedding bot vector."""
Expand Down Expand Up @@ -331,9 +331,9 @@ def _create_tf_dial(self, a_in) -> Tuple["tf.Tensor", "tf.Tensor"]:

def _build_tf_train_graph(self) -> Tuple["tf.Tensor", "tf.Tensor"]:
"""Bulid train graph using iterator."""
# iterator returns a_in, b_in, action_ids
self.a_in, self.b_in, _ = self._iterator.get_next()

# session data are int counts but we need a float tensors
self.a_in, self.b_in = self._iterator.get_next()
if isinstance(self.featurizer, MaxHistoryTrackerFeaturizer):
# add time dimension if max history featurizer is used
self.b_in = self.b_in[:, tf.newaxis, :]
Expand Down Expand Up @@ -364,23 +364,25 @@ def _build_tf_train_graph(self) -> Tuple["tf.Tensor", "tf.Tensor"]:
)

# prepare for prediction
def _create_tf_placeholders(self, session_data: "train_utils.SessionData") -> None:
def _create_tf_placeholders(
self, session_data: "train_utils.SessionDataType"
) -> None:
"""Create placeholders for prediction."""

dialogue_len = None # use dynamic time
self.a_in = tf.placeholder(
dtype=tf.float32,
shape=(None, dialogue_len, session_data.X.shape[-1]),
shape=(None, dialogue_len, session_data["dialogue_features"][0].shape[-1]),
name="a",
)
self.b_in = tf.placeholder(
dtype=tf.float32,
shape=(None, dialogue_len, None, session_data.Y.shape[-1]),
shape=(None, dialogue_len, None, session_data["bot_features"][0].shape[-1]),
name="b",
)

def _build_tf_pred_graph(
self, session_data: "train_utils.SessionData"
self, session_data: "train_utils.SessionDataType"
) -> "tf.Tensor":
"""Rebuild tf graph for prediction."""

Expand Down Expand Up @@ -440,7 +442,10 @@ def train(

if self.evaluate_on_num_examples:
session_data, eval_session_data = train_utils.train_val_split(
session_data, self.evaluate_on_num_examples, self.random_seed
session_data,
self.evaluate_on_num_examples,
self.random_seed,
label_key="action_ids",
)
else:
eval_session_data = None
Expand All @@ -458,7 +463,11 @@ def train(
train_init_op,
eval_init_op,
) = train_utils.create_iterator_init_datasets(
session_data, eval_session_data, batch_size_in, self.batch_strategy
session_data,
eval_session_data,
batch_size_in,
self.batch_strategy,
label_key="action_ids",
)

self._is_training = tf.placeholder_with_default(False, shape=())
Expand Down Expand Up @@ -512,7 +521,9 @@ def continue_training(
session_data = self._create_session_data(
training_data.X, training_data.y
)
train_dataset = train_utils.create_tf_dataset(session_data, batch_size)
train_dataset = train_utils.create_tf_dataset(
session_data, batch_size, label_key="action_ids"
)
train_init_op = self._iterator.make_initializer(train_dataset)
self.session.run(train_init_op)

Expand All @@ -535,7 +546,7 @@ def tf_feed_dict_for_prediction(
data_X = self.featurizer.create_X([tracker], domain)
session_data = self._create_session_data(data_X)

return {self.a_in: session_data.X}
return {self.a_in: session_data["dialogue_features"][0]}

def predict_action_probabilities(
self, tracker: "DialogueStateTracker", domain: "Domain"
Expand Down
Loading