Running the tokeniser in parallel does not gain a lot from more cores? #10306

KennethEnevoldsen · 2022-02-16T12:46:22Z

KennethEnevoldsen
Feb 16, 2022

So I have been trying to tokenize a large quantity of text, but don't seem to gain a lot from the 'n_processflag innl.pipe`. So I ran some experiments to examine what the problem might be:

The setup is I read in a newline json and tokenized all texts. It includes (I have recalculated the number of tokens a read it in multiple times to simulate more text). The calculated number of tokens in the input length column (i.e. this is not calculated from the tokenization to avoid having that as a bottleneck).

n_cores correspond to the n_process. it is all run on a 64 CPU machine (linux)
batch is the batch size set in the pipe
convert to list is because I was worried that the problem was in the input so instead for using (convert to list = 1):

docs  = nlp.pipe(text_generator, **args)
list(docs)

I instead ran (convert to list = 0)

docs  = nlp.pipe(text_generator, **args)
for doc in docs:
  continue

the text chunks columns was to check if it was due to especially large documents. So it simply splits the text up into chunks and yield them iteratively

The outline of the script is this:

print("Create data stream")
def file_gen(f_paths):
    """read in a list of file path"""
    pbar = tqdm(f_paths)
    for f_path in pbar:
        with open(f_path) as f:
            reader = ndjson.reader(f)
            for row in reader:
                yield row["BodyText"]


def text_chunk(text: str, chunk_size: int):
    """chunks a token stream"""
    if chunk_size:
        start_c = 0
        end_c = start_c + chunk_size
        t = text[start_c:end_c]
        while t:
            yield t
            start_c = end_c
            end_c += chunk_size
            t = text[start_c:end_c]
    else:
        yield text

def token_stream(f_paths, chunk_size = None):
    """mostly this is here to get the it/s from the tqdm"""
    articles = file_gen(f_paths)
    for a in tqdm(articles):
        for t in text_chunk(a, chunk_size):
            yield t

nlp = spacy.blank("da")

f_paths = [path_to_file]*n_repeats
docs = nlp.pipe(token_stream(f_paths, chunk_size=2400), n_process=64, batch_size=1024)
list(docs)

Removed code for recording time taken an looping over e.g. chunk size.

So here i clearly gain some speed by using 64 cores vs. 16 cores, but nowhere near what I would have expected and it does not seem to be explained by factors such as inconsistent text length. Given that i get some improvement it also does not seem to be a problem with input. Is this scaling expected I should I expect better scaling?

Answered by adrianeboyd

Feb 22, 2022

Multiprocessing introduces a lot of overhead, and it's likely that the overhead outweighs the gains with 16->64 processes.

In general, I've found that tokenizing with nlp.pipe(n_process>1) is slow, and I haven't done detailed profiling (famous last words), but I strongly suspect it's due to the Doc serialization that's happening under the hood. (Peter has been profiling the serialization some and this may get at least a little bit faster in the next release: #10250)

But just for tokenization, the doc serialization is a lot of overhead. I've found that it's much faster to use nlp.pipe(n_process=1) with multiprocessing.Pool and just return the space-separated text rather than a Doc (or retu…

View full answer

adrianeboyd · 2022-02-22T08:59:42Z

adrianeboyd
Feb 22, 2022

Multiprocessing introduces a lot of overhead, and it's likely that the overhead outweighs the gains with 16->64 processes.

In general, I've found that tokenizing with nlp.pipe(n_process>1) is slow, and I haven't done detailed profiling (famous last words), but I strongly suspect it's due to the Doc serialization that's happening under the hood. (Peter has been profiling the serialization some and this may get at least a little bit faster in the next release: #10250)

But just for tokenization, the doc serialization is a lot of overhead. I've found that it's much faster to use nlp.pipe(n_process=1) with multiprocessing.Pool and just return the space-separated text rather than a Doc (or return something else similar that's fast to pickle). My approach looks clunky and is probably still far from optimal, but it works well enough for tokenizing and sentencizing larger quantities of text for training vectors. Locally the gains slow down a lot beyond ~12 processes, although I think it can depend a lot on the exact machine/input details. Once I got to >500MB/min. I stopped trying to optimize it further.

But for demos I have used nlp.pipe because it makes the scripts so much cleaner, e.g.:

https://github.com/explosion/projects/blob/d19d057cb9de6f5660ca3cf533a59164b4dfbdbe/pipelines/floret_ko_ud_demo/scripts/tokenize_dataset.py

1 reply

KennethEnevoldsen Feb 22, 2022
Author

Ahh, makes a lot of sense, both why I didn't get the speed-up using pool when I tried and why I typically saw a lot of activity on all cores for a short while before it switching to a single core.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the tokeniser in parallel does not gain a lot from more cores? #10306

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Running the tokeniser in parallel does not gain a lot from more cores? #10306

KennethEnevoldsen Feb 16, 2022

Replies: 1 comment · 1 reply

adrianeboyd Feb 22, 2022

KennethEnevoldsen Feb 22, 2022 Author

KennethEnevoldsen
Feb 16, 2022

Replies: 1 comment 1 reply

adrianeboyd
Feb 22, 2022

KennethEnevoldsen Feb 22, 2022
Author