-
Notifications
You must be signed in to change notification settings - Fork 25
Vowpal Wabbit LDA
Try to run the commands below, you make run into issues with your dev environment if you don't have some boost
libraries installed.
Installing the following libraries might be useful in advance, if you have brew installed already.
brew install libtool
brew install automake
brew install boost
Then we can get the vowpal wabbit source and install it.
git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make install
Move the train.tsv file to your src/lessson15 dir
cd GADS7/src/lesson15
wget https://www.dropbox.com/s/wuagevrmu2jzq2h/train.tsv
We need to transform the raw text into something vw can understand.
python parse_to_vw.py train.tsv
Vowpal Wabbit has implementations for logistic and linear regression, but also for a very fast online LDA implementation. The output for this is difficult to parse, but the model create good topics very quickly.
vw
-d data
--lda_alpha 0.1 \ #hyperparameter for prior distribution
--lda_rho 0.1 \ #hyperparameter for prior distribution
--lda_D 1980686 \
--minibatch 256 \
--power_t 0.5 \
--initial_t 1 \
-b 16 \
-p vw-predictions.dat \ # topic-document distributions
--readable_model vw-topics.dat # topic-word distributions
In vw-topics.dat
we have the topic-word distributions unormalized. Each row represents a word and each column a topic. To find out the top words in a topic we simply sort a column for the highest values.
One way to evaluate a topic model is to determine if the top words in a topic form a coherent collection.