Skip to content

Running the tokeniser in parallel does not gain a lot from more cores? #10306

Discussion options

You must be logged in to vote

Multiprocessing introduces a lot of overhead, and it's likely that the overhead outweighs the gains with 16->64 processes.

In general, I've found that tokenizing with nlp.pipe(n_process>1) is slow, and I haven't done detailed profiling (famous last words), but I strongly suspect it's due to the Doc serialization that's happening under the hood. (Peter has been profiling the serialization some and this may get at least a little bit faster in the next release: #10250)

But just for tokenization, the doc serialization is a lot of overhead. I've found that it's much faster to use nlp.pipe(n_process=1) with multiprocessing.Pool and just return the space-separated text rather than a Doc (or retu…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@KennethEnevoldsen
Comment options

Answer selected by KennethEnevoldsen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer perf / speed Performance: speed scaling Scaling, serving and parallelizing spaCy
2 participants