The CBoW model architecture tries to predict the current target word (the center word) based on
the source context words (surrounding words). Considering a simple sentence,
"the quick brown fox jumps over the lazy dogโ, this can be pairs of
(context_window, target_word) where if we consider a context window of size 2,
we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on.
Thus the model tries to predict the target_word
based on the context_window
words.
We will first introduce the Continuous Bag of Words (CBoW) Model in this Jupyter Notebook and then implement it on a small dataset consisting of textual data from Shakespeare Novels and create word embeddings for a few words in this Notebook.
We will then use pre-trained word embeddings from the standard word2vc implementation by Google and show how we can perform PCA (Principal Component Analysis) on our word embeddings. We also show how to perform logical comparisons and Language Translation using word embeddings in this Notebook.
The following steps have been followed in the overall pipeline: