NYT-Based News Tagger

A labeller for news articles trained on the NYT annotated corpus by Jasmin Rubinovitz as part of the MIT Media Lab SuperGlue project. Give it the clean text of a story (ie. no html content), and it returns various descriptors and taxonomic classifiers based on models trained on the taging in the NYT corpus.

We use it in the Media Cloud project to automatically tag every news stories with the themes we think they are about.

Installation

This is built with Python.

On OSX I had to install hdf5 first with brew: brew install hdf5.

Do this to install all the Python dependencies.

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

You also need the word2vec pre-trained Google News corpus and NYTLabels model. Run download_models.py to get them.

Lastly, you'll need punkt dataset from NLTK data:

python -m nltk.downloader -d /usr/local/share/nltk_data punkt

Usage

Simply do run.sh, or gunicorn app:app -t 900 and then visit localhost:8000/ to try it out.

Note: this consumes about 5GB of memory while running, to keep all the models loaded up.

Deploying

This is built to deploy on Dokku. Set the WORKERS environment variable to set how many workers gunicorn starts with.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
config		config
labeller		labeller
models		models
nginx.conf.d		nginx.conf.d
scaler		scaler
static/js		static/js
templates		templates
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
app.json		app.json
app.py		app.py
download_models.py		download_models.py
nltk.txt		nltk.txt
requirements.txt		requirements.txt
run.sh		run.sh
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYT-Based News Tagger

Installation

Usage

Deploying

About

Releases

Packages

Languages

License

rsingel/MediaCloud-NYT-News-Labeler

Folders and files

Latest commit

History

Repository files navigation

NYT-Based News Tagger

Installation

Usage

Deploying

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages