Flexible handling of non-existent message_attribute for given training sample #4445

JulianGerhard21 · 2019-09-12T13:12:20Z

Description of Problem:

Currently, the SpacyNLP component provides the following:

provides = ["spacy_doc", "spacy_nlp", "intent_spacy_doc", "response_spacy_doc"]

which caused the necessity to handle non-existent / None-valued attributes for a given training sample. Currently this is realized by converting None values to empty strings since spaCy can't handle None values while creating its Doc-objects upon them.

Since simply filtering out those training samples and therefore disobey their order would cause consecutive problems, we need to find a more flexible solution.

Overview of the Solution:
I am going to think about a robust solution and update this issue likewise.

Examples:
If there are no samples for the response-attribute, currently this results in a list of empty Doc-objects while calling pipe on:

docs = [doc for doc in self.nlp.pipe(texts, batch_size=50)]

[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]

since an empty string is valid for Doc-objects but in fact is a problem for e.g. libraries like spacy-pytorch-transformers or other custom-components which can't handle this cases properly.

The coresponding forum entry to this conversation can be found here @dakshvar22

The text was updated successfully, but these errors were encountered:

sara-tagger · 2019-09-13T06:00:10Z

Thanks for submitting this feature request 🚀@dakshvar22 will get back to you about it soon!✨

JulianGerhard21 · 2019-09-13T09:06:03Z

Hi @dakshvar22 ,

I broke my head about a solution at this point. I tried several scenarios and I am now able to comprehend your problems with the architectural decision in this situation.

At least for now I'd say that it is more or less impossible to change things on the spaCy-pytorch-transformers side without the help of spaCy. There are several problems with the pipe-method of that library, one of them is, that they extended the normal spaCy library by ._.pytt_word_pieces which relies on the fact that those transformers use masking.

The simple conclusion is: They are doing things with their Doc-object that simply can't be done with an empty Doc - at the moment, e.g. because actually there are no sents or Token objects in this Doc.

I then came back to the idea to question this part:

        if text is None:
            # converted to empty string so that it can still be passed to spacy.
            # Another option could be to neglect tokenization of the attribute of this example, but since we are
            # processing in batch mode, it would get complex to collect all processed and neglected examples.
            text = ""

The more I thought about that the more I could feel with your struggle. On idea was to change "" into " " - which would at least pass in a regular Doc-object that actually has a one-lengthed-sentence Span object and that solution worked until:

ids = torch.as_tensor(ids, dtype=torch.float64)

in pytt, which brought up the error of pytorch's inability to handle zero entries in tensors.

I understand your thoughts about obeying the order of the attribute_docs entries but I think it at least would be possible to filter out None/"" entries, handle them properly and merging this back into the former order. The question then would be:

What should/could we do with them?

I am running out of ideas.

dakshvar22 · 2019-09-13T09:28:32Z

Hey @JulianGerhard21 , thanks for giving this a detailed look. I think, going by your observations, we can't rely on spacy or spacy-pytorch-transformers to help us out here.
So, as you said maybe we should just find an elegant solution to filter the None/"" entries, pass other entries to pipe and merge the empty documents back in the correct order. I think it isn't extremely nasty, so we could do something like -

convert each element of texts list inside docs_for_training_data as a tuple of (index,text)
Get two lists from this - one without empty texts and one with empty texts.
pass non-empty texts list to nlp.pipe, get a docs list and convert them back to a tuple of (index,Doc). Here index would be the original index. A simple zip would do it.
Create empty docs for empty text list. Also create a tuple of (index,Doc)
Merge the two lists, sort them and get the docs list.

I think it makes sense to do this because people can build custom components based on pre-trained BERT using spacy docs or integrating any other library that comes up and relies on spacy docs. Since we already have a SpacyNLP component, it would make sense to adapt to their Doc object functionality.

What do you think?

JulianGerhard21 · 2019-09-13T09:56:03Z

Hi @dakshvar22,

allright - I agree with you and I am going to start to work on this this afternoon. I will get back to you with a code proposal as soon as it is ready.

Thanks for your help!

Regards
Julian

stale · 2020-02-03T21:36:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

joejuzl · 2021-01-28T13:46:05Z

Closing as this is in a minor release around 1.3.x.

JulianGerhard21 added the type:enhancement ✨ Additions of new features or changes to existing ones, should be doable in a single PR label Sep 12, 2019

dakshvar22 added the area:rasa-oss 🎡 Anything related to the open source Rasa framework label Sep 12, 2019

JulianGerhard21 mentioned this issue Sep 16, 2019

Flexible handling of non-existent message_attribute for given training sample #4466

Closed

5 tasks

stale bot added the status:stale label Feb 3, 2020

tmbo removed the status:stale label Feb 3, 2020

joejuzl closed this as completed Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible handling of non-existent message_attribute for given training sample #4445

Flexible handling of non-existent message_attribute for given training sample #4445

JulianGerhard21 commented Sep 12, 2019

sara-tagger commented Sep 13, 2019 •

edited by TyDunn

Loading

JulianGerhard21 commented Sep 13, 2019

dakshvar22 commented Sep 13, 2019

JulianGerhard21 commented Sep 13, 2019

stale bot commented Feb 3, 2020

joejuzl commented Jan 28, 2021

Flexible handling of non-existent message_attribute for given training sample #4445

Flexible handling of non-existent message_attribute for given training sample #4445

Comments

JulianGerhard21 commented Sep 12, 2019

sara-tagger commented Sep 13, 2019 • edited by TyDunn Loading

JulianGerhard21 commented Sep 13, 2019

dakshvar22 commented Sep 13, 2019

JulianGerhard21 commented Sep 13, 2019

stale bot commented Feb 3, 2020

joejuzl commented Jan 28, 2021

sara-tagger commented Sep 13, 2019 •

edited by TyDunn

Loading