Skip to content

aspitarl/corex_dashboard

Repository files navigation

Corex Dashboard

This repository contains a Bokeh app for exploring topic models generated with the Anchored Correlation Explanation (CorEx) package. The dataset is a collection of scientific abstracts related to Energy Storage, see here for more information.

The app is hosted on Heroku here. It's hosted with the free tier of heroku so will take a moment for the server to start up.

Using locally with your own dataset

Installation

The preferred way to install the required packages with anaconda. In an anaconda prompt: conda env create -f environment.yml followed by conda activate corex-dashboard.

You can also install the requirements with pip using python -m pip install -r requirements_local.txt. requirements.txt does not include some pacakages as it is used when building for the Heroku page.

Preparing input data

The dashboard can be used with the example dataset by default. Follow these instructions to use your own text data.

The folder example_data has example files that are used with the dashboard. All of the files in this folder are generated from input_data.csv, except for anchor_default.txt which is an optional list of default anchor words to be used in generating models.

To start, create a folder called data and generate (e.g. write a script) an input_data.csv file in this folder. The csv file has a unique integer index for each document called ID as the first column. There are the additional columns

  • title (required) - title of each document
  • processed_text (required) - space separated list of strings corresponding to the text of each document
  • url (optional) - A url to generate a hyperlink to a given text
  • prob (optional) - This field in my case was generated by Microsoft Academic as a metric of how highly ranked a paper is in terms of citations and was a exponentially formatted number < 1. It is used in one of the displays to show highly ranked papers in a given topic

Processing the data

Run python gendata.py to process the input texts into a form ready for the dashboard. This involve bigram generation with gensim, vectorization with CountVectorizer, and generating display text.

Running the dashboard

Run bokeh serve dashboard.py --show and a browser window should open showing the dashboard.

  1. Select a combination of unsupervised topics and anchor words, define a model name, and press 'Generate/Overwrite Model' to generate a model. Refresh the page to populate the model selection dropdown with this new model.A text display will indicate when the model is done being fit.

    Anchor words for each topic should be separated by spaces, and each topic separated by a newline. See example_data/anchor_default.txt to see how anchor words should be formatted. The anchor strength slider determines how tightly each topic is constrained to the anchor words. You can check if a word is in your vocabulary with the 'Check word' input (press enter and it will turn green or red).

  2. Press Generate graph with a selected model. Note that this populates the model generation controls with the model data for tweaking.

Also note the Jupyter notebook graph_plots.ipynb to explore and make plots from the generated models.