You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a two-fold question, conceptual and technical.
Conceptual: Using SparkNLP the way to build BERT embeddings is to create a pipeline that has a DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings.pretrained, EmbeddingsFinisher and Pipeline. Building embeddings for every word in a document is a task that requires a lot of time/resources. If I wanted to build embeddings just for SOME of the words in the document, for example the most salient words based on TF-IDF, would it be sensible to do that? I think the answer to this question will take one of two forms:
No, that is not sensible. You must build embeddings for each word in the sentence to build an embedding for the particular word in question. Without first building the embeddings for each previous word in its context BERT can't build the embedding for the word in question. Each embedding influences the others. It's not sensible to build only a few in isolation.
OR
Yes, that is sensible. You can just use the pretrained embedding already available in the BERT model and build your particular word's embedding based on those standard embeddings for that word and the words surrounding it in its context. You don't have to refine the pretrained embedding for every word. You can still get a reasonable amount of context information from those pretrained embeddings.
Technical: Given that the pipeline is set up like it is and we don't see the intermediate steps (for example, the pipeline doesn't output a list of tokens found by the tokenizer before moving on to the next step in the pipeline), I wanted to build embeddings just for SOME of the words in a document how would I go about doing that? Maybe someone has already done this somehow and some example exists somewhere.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have a two-fold question, conceptual and technical.
Conceptual: Using SparkNLP the way to build BERT embeddings is to create a pipeline that has a DocumentAssembler, SentenceDetector, Tokenizer, BertEmbeddings.pretrained, EmbeddingsFinisher and Pipeline. Building embeddings for every word in a document is a task that requires a lot of time/resources. If I wanted to build embeddings just for SOME of the words in the document, for example the most salient words based on TF-IDF, would it be sensible to do that? I think the answer to this question will take one of two forms:
OR
Technical: Given that the pipeline is set up like it is and we don't see the intermediate steps (for example, the pipeline doesn't output a list of tokens found by the tokenizer before moving on to the next step in the pipeline), I wanted to build embeddings just for SOME of the words in a document how would I go about doing that? Maybe someone has already done this somehow and some example exists somewhere.
Thanks for your input!
Beta Was this translation helpful? Give feedback.
All reactions