This repository contains a Bokeh app for exploring topic models generated with the Anchored Correlation Explanation (CorEx) package. The dataset is a collection of scientific abstracts related to Energy Storage, see here for more information.
The app is hosted on Heroku here. It's hosted with the free tier of heroku so will take a moment for the server to start up.
The preferred way to install the required packages with anaconda. In an anaconda prompt: conda env create -f environment.yml
followed by conda activate corex-dashboard
.
You can also install the requirements with pip using python -m pip install -r requirements_local.txt
. requirements.txt
does not include some pacakages as it is used when building for the Heroku page.
The dashboard can be used with the example dataset by default. Follow these instructions to use your own text data.
The folder example_data
has example files that are used with the dashboard. All of the files in this folder are generated from input_data.csv
, except for anchor_default.txt
which is an optional list of default anchor words to be used in generating models.
To start, create a folder called data
and generate (e.g. write a script) an input_data.csv
file in this folder. The csv file has a unique integer index for each document called ID
as the first column. There are the additional columns
- title (required) - title of each document
- processed_text (required) - space separated list of strings corresponding to the text of each document
- url (optional) - A url to generate a hyperlink to a given text
- prob (optional) - This field in my case was generated by Microsoft Academic as a metric of how highly ranked a paper is in terms of citations and was a exponentially formatted number < 1. It is used in one of the displays to show highly ranked papers in a given topic
Run python gendata.py
to process the input texts into a form ready for the dashboard. This involve bigram generation with gensim, vectorization with CountVectorizer, and generating display text.
Run bokeh serve dashboard.py --show
and a browser window should open showing the dashboard.
-
Select a combination of unsupervised topics and anchor words, define a model name, and press 'Generate/Overwrite Model' to generate a model. Refresh the page to populate the model selection dropdown with this new model.A text display will indicate when the model is done being fit.
Anchor words for each topic should be separated by spaces, and each topic separated by a newline. See
example_data/anchor_default.txt
to see how anchor words should be formatted. The anchor strength slider determines how tightly each topic is constrained to the anchor words. You can check if a word is in your vocabulary with the 'Check word' input (press enter and it will turn green or red). -
Press Generate graph with a selected model. Note that this populates the model generation controls with the model data for tweaking.
Also note the Jupyter notebook graph_plots.ipynb
to explore and make plots from the generated models.