Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize nltk CoreNLP parser in simple way #2

Closed
ajratner opened this issue Feb 26, 2016 · 7 comments
Closed

Parallelize nltk CoreNLP parser in simple way #2

ajratner opened this issue Feb 26, 2016 · 7 comments

Comments

@ajratner
Copy link
Contributor

Emphasis on simple- this is not going to be an optimal preprocessing setup either way, we just want to make it a bit better through simple means that don't require any additional installs, configs, etc.

@ajratner
Copy link
Contributor Author

See branch multicore

@chrismre
Copy link

Why not use python's multiprocessing or something? Also, we could potentially put in some small hadoop/spark connector (if people wanted to hit AWS?) @netj @raphaelhoffmann @alldefector

@alldefector
Copy link
Contributor

With the current NN-SR model and the notebook docs, the java process seems to take 200MB - 900MB of memory during parsing (as opposed to 4GB with an old model). So spawning #core processes should be fine for a typical laptop.

A few months ago we also had a simple HTTP service wrapper for the parser:
https://github.com/HazyResearch/bazaar/blob/master/parser/src/main/scala/com/clearcut/nlp/Server.scala

@alldefector
Copy link
Contributor

http://www.nltk.org/_modules/nltk/parse/stanford.html

Looks like the nltk wrapper is actually using some old model that probably has a throughput of 1 sent / sec, as opposed to the SR / NN models that are 100 sents / sec...

The StanfordNeuralDependencyParser class ought to address that.
http://nlp.stanford.edu/software/nndep.shtml

@ajratner
Copy link
Contributor Author

Yeah I actually did set up a simple queue based multiprocessing version on
the 'multicore' branch (see comment on the issue), but think there's still
a bug; either way someone could work from that

The switch to SR / NN was what made the huge difference in our normal
pipeline when we did that, so this is probably the easiest gain to get...
On Sun, Feb 28, 2016 at 9:31 AM alldefector notifications@github.com
wrote:

http://www.nltk.org/_modules/nltk/parse/stanford.html

Looks like the nltk wrapper is actually using some old model that probably
has a throughput of 1 sent / sec, as opposed to the SR / NN models that are
100 sents / sec...


Reply to this email directly or view it on GitHub
#2 (comment).

@chrismre
Copy link

Let's definitely make that change!

On Sun, Feb 28, 2016 at 9:36 AM Alex Ratner notifications@github.com
wrote:

Yeah I actually did set up a simple queue based multiprocessing version on
the 'multicore' branch (see comment on the issue), but think there's still
a bug; either way someone could work from that

The switch to SR / NN was what made the huge difference in our normal
pipeline when we did that, so this is probably the easiest gain to get...
On Sun, Feb 28, 2016 at 9:31 AM alldefector notifications@github.com
wrote:

http://www.nltk.org/_modules/nltk/parse/stanford.html

Looks like the nltk wrapper is actually using some old model that
probably
has a throughput of 1 sent / sec, as opposed to the SR / NN models that
are
100 sents / sec...


Reply to this email directly or view it on GitHub
<#2 (comment)
.


Reply to this email directly or view it on GitHub
#2 (comment).

@henryre henryre modified the milestones: DeepDive Lite 0.3, DeepDive Lite 0.1 Mar 29, 2016
@ajratner ajratner removed this from the DeepDive Lite 0.3 milestone Jun 6, 2016
@ajratner
Copy link
Contributor Author

Subset of #228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants