Skip to content
/ pasha Public

PArallelized SHared memory map operations on large data sets

License

Notifications You must be signed in to change notification settings

philsmt/pasha

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EXtra-pasha

pasha (parallelized shared memory) provides tools to process data in a parallelized way with an emphasis on shared memory and zero copy. It uses the map pattern similar to Python's builtin map() function, where a callable is applied to potentially many elements in a collection. To avoid the high cost of IPC or other communication schemes, the results are meant to be written directly to memory shared between all workers as well as the calling site. The current implementations cover distribution across threads and processes on a single node.

Quick guide

To use it, simply import it, define your kernel function of choice and map away!

import numpy as np
import pasha as psh

# Get some random input data
inp = np.random.rand(100)

# Allocate output array via EXtra-pasha.
outp = psh.array(100)

# Define a kernel function multiplying each value with 3.
def triple_it(worker_id, index, value):
    outp[index] = 3 * value

# Map the kernel function.
psh.map(triple_it, inp)

# Check the result
np.testing.assert_allclose(outp, inp*3)

The runtime environment is controlled via a so called map context. The default context object is ProcessContext, which uses multiprocessing.Pool to distribute the work across several processes. The output array returned by array() resides in shared memory with this context in order to modify it by the worker processes without the need to copy anything around. This context only works on *nix systems supporting the fork() system call, as it expects any input data to be shared.

You may either create an explicit context object and use it directly or change the default context, e.g.

psh.set_default_context('threads', num_workers=4)

There are three different context types builtin: serial, threads and processes.

The input array passed to map() is called a functor and automatically wrapped in a suitable Functor object, here SequenceFunctor. This works for a number of common array and collection types, but you may also implement your own Functor object to wrap anything else. For example, there is built-in support for DataCollection and KeyData objects from the EXtra-data toolkit accessing run files from the European XFEL facilty:

def analysis_kernel(worker_id, index, train_id, data):
    # Do something with the data and save it to shared memory.

run = extra_data.open_run(proposal=700000, run=1)
psh.map(analysis_kernel, run[source, key])

About

PArallelized SHared memory map operations on large data sets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages