Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support shared-memory parallelism (multithreading) #1790

Closed
stevengj opened this issue Dec 20, 2012 · 104 comments
Closed

support shared-memory parallelism (multithreading) #1790

stevengj opened this issue Dec 20, 2012 · 104 comments
Labels
domain:multithreading Base.Threads and related functionality domain:parallelism Parallel or distributed computation

Comments

@stevengj
Copy link
Member

I realize that this would require major effort, but I'm hoping that you will keep this on your radar screen: the lack of native support for shared-memory parallelism is a huge shortcoming of Julia.

The reason to support shared-memory parallelism is that it is usually far, far easier to program than distributed-memory parallelism. Especially if you adopt the same infrastructure (already in gcc) as OpenMP or Cilk+, in which the programmer indicates which function calls and loops can be executed in parallel and the runtime system decides how many threads to spawn and how to distribute the load (e.g. by work stealing to parallelize irregular problems, as in Cilk).

Julia is attractive because it combines ease of programming with performance near that of C, but the latter is no longer true if you can simply put "#pragma parallel for" in your C loop and get 10x speedup on your 12-core machine. Distributed arrays cannot compete with this in ease-of-use, especially for problems in which extensive inter-processor communication is required.

(Using the OpenMP runtime system is also nice in that this way the runtime system will share the same worker threads with C libraries like OpenBLAS and FFTW, so that Julia and external C code won't fight over the CPUs.)

@JeffBezanson
Copy link
Sponsor Member

You certainly have a point here. Composability with parallel libraries is especially interesting. Work is underway to provide OpenMP support in LLVM, and if that happens it will be easier to interface with the openmp runtime. Otherwise, would it be sufficient to target a particular popular implementation like libgomp?

At a higher lever, it is true that programming distributed memory is much harder, but our thinking has been that once you have a distributed memory implementation it is relatively easy to make it take advantage of shared memory for performance without changing the programming model. Then if you can somehow make distributed memory programming easier, everything will be fine.

It would be very valuable to have shared memory parallelism available, but not 100% satisfying just to tack on another parallel programming model. We could take the approach of adding runtime support, then punting the rest to the user or future API designs though.

@ViralBShah
Copy link
Member

It is fairly easy to support the equivalent of pragma parallel for in julia, and we already have @parallel (reduce) for that. It could be certainly made faster on shared memory. The current inconvenience with @parallel is that it works with darrays and multiple julia processes, and our darray implementation still has ways to go. Making @parallel work on regular sequential arrays would give a lot of immediate parallelism for a lot of folks, without modifying the rest of their code. Of course, that would mean having having some kind of a threading model within julia.

@ViralBShah
Copy link
Member

Is the LLVM blocks runtime (used by Apple in Grand Central Dispatch) worth considering as an alternative to openmp?

http://libdispatch.macosforge.org
http://compiler-rt.llvm.org
http://llvm.org/svn/llvm-project/compiler-rt/trunk/BlocksRuntime/

@stevengj
Copy link
Member Author

Yes, the requirement to use distributed arrays and multiple processes is the whole difficulty -- the key advantage of shared memory is that you have only a single kind of memory to work with. This becomes even more important when you go beyond simple data-parallel circumstances. e.g. parallelizing a sparse-matrix multiply, or a finite-difference time-domain simulation, or an FFT, are all fairly easy in shared memory [at least, before you get into heavy optimization], but are a lot more complicated in distributed memory. And the history of computer science is littered with the corpses of languages that tried to make distributed-memory programming as transparent as shared-memory.

OpenMP seems to be by far the most popular technique for parallelizing compiled C/C++/Fortan code on shared-memory systems, so there is a lot to be said for playing nicely with the OpenMP runtime. But the runtime system may something of an implementation detail that in principle could even be switched at compile-time (or even at runtime). [This may be easier said than done; if you have to pick one, I would go with OpenMP.]

As I understand it, the main issue right now is that Julia's garbage collection is not multithreaded. As a first step, it wouldn't be so terrible if garbage collection were serialized in a multithreaded program, with the understanding that people writing multithreaded code should try to optimize things so that garbage collection can be deferred to happen infrequently.

As Jeff says, I think the key thing is to support shared memory in the runtime with some kind of OpenMP-like semantics via low-level primitives. Julia's macro system is flexible enough that people can then experiment with different syntaxes (although you will probably eventually want to standardize on one syntax to build into Base).

@timholy
Copy link
Sponsor Member

timholy commented Dec 20, 2012

Here's an old issue on a related topic:
#1002

For the record, I'd love this too, but not if it creates disasters.

@ViralBShah
Copy link
Member

Intel recently open-sourced its OpenMP library under the BSD.

http://www.openmprtl.org

@ViralBShah
Copy link
Member

OpenMP support is forthcoming in clang - although the approach taken is to do it in the clang frontend rather than in the LLVM IR. See the presentation linked to in this article.

http://www.phoronix.com/scan.php?page=news_item&px=MTM2NjE

@stevengj
Copy link
Member Author

Since Julia is unlikely to adopt the OpenMP #pragma syntax (I hope), the Clang OpenMP support should not really be relevant; what matters is access to the runtime library.

@ViralBShah
Copy link
Member

Right, Julia is unlikely to adopt the OpenMP syntax, but clang having OpenMP capability makes it possible for us to compile FFTW and OpenBLAS with OpenMP on Mac, where we use clang as the default compiler. This is nice to have, when it happens.

@stevengj
Copy link
Member Author

Last I checked, it was a bad idea to compile FFTW with clang; gcc gave significantly better performance. I don't know about OpenBLAS. For your precompiled Mac binaries, you might want to look into this.

(Fortunately, the Intel OpenMP library is supposedly ABI-compatible with libgomp, so you can hopefully compile FFTW with gcc and --enable-openmp and still use the Intel runtime.)

@bsilbaugh
Copy link
Contributor

It seems that perhaps there are really three distinct (but related) issues being raised here:

  1. Levering shared memory communication on multi-core machines.
  2. Supporting a thread-based programming model
  3. Compatibility with OpenMP libraries

In regards to issue 1, I don't see why Julia would need to change it's programming model to support shared memory communication. (Perhaps I've just revealed my own ignorance regarding Julia.) For example, MPI was designed with distributed computing in mind; however, many MPI implementations (e.g. OpenMPI) are still able to pass messages directly in memory when the sender and receiver are on the same machine. Hopefully, Julia will be able to (if it doesn't already) optimize it's communication strategy in a similar manner.

In regards to issue 2, I don't think it's unanimous that thread based programming models are always the best way to write high-performance parallel algorithms. In fact, I believe the designers of the D programming language decided to implement a message passing model to avoid many of the pitfalls of thread-based programming (see http://www.informit.com/articles/article.aspx?p=1609144 for details). Furthermore, with the exception of very simple programs, getting good performance out of OpenMP usually requires more effort than adding a few pragmas. As far as the simple cases are concerned, I think implementing parallel loop macros might be sufficient to make Julia as convenient as OpenMP.

Unfortunately, I don't have much to offer regarding issue 3.

I hope this helps.

@cognociente
Copy link

It seems that there has been progress made with OpenMP and CLANG. See below article:

http://www.phoronix.com/scan.php?page=news_item&px=MTQ0NjQ

and

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-August/065169.html

I have actually been watching the progress of Julia and this issue in particular before jumping into learning a whole new toolset. I genuinely believe this is a must-have for a language/tool like Julia and second the reasoning of stevengj above that this is just so much easier to code for than thinking of distributed memory.

@stevengj
Copy link
Member Author

stevengj commented Sep 9, 2013

@bsilbaugh, no one is arguing that "thread based programming models are always the best way to write high-performance parallel algorithms," just that they are significantly simpler in many common cases. The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task), you can often get good performance (even if not optimal) just by adding a few pragmas. (Even the difficulty of load-balancing irregular problems has been essentially solved.) These cases come up extremely often in practice in my experience.

And "implementing parallel loop macros" is not sufficient, precisely because it doesn't deal with the issue of memory. The problem is not parallelizing the loop, the problem is that in distributed-memory systems you need to decide in advance where things are stored and deal with explicit communication. This is what shared-memory avoids, and this is where automated language tools have been an abject failure in distributed-memory systems for 20 years. This is not a simple problem.

@bsilbaugh
Copy link
Contributor

@stevengj Breath. Maybe have some Camomile tea. Look, a cute puppy.

First, I think we agree on a few things:

  1. Julia should do whatever she needs to do under the hood (or skirt?) to optimize inter-process communication. This includes exploiting shared memory and/or system threads when two (or more) processes are running on the same machine.
  2. Simple tasks should be kept simple, and hard things doable.
  3. Julia needs to be competitive with existing technologies such as OpenMP, CUDA, and our beloved (or hated) grandpa MPI.
  4. This is not a simple problem.
  5. That was a darn cute puppy.

Now, let's consider the main point:

The whole point of this that in the simple cases where one wants to parallelize an expensive loop (#pragma for) or a recursive tree of calls (#pragma task).

For simple cases (loops), being able to use OpenMP is nice. No argument. But is it worth exposing another parallel programming model to the user? Another model that the user needs to think about when composing parallel libraries? (Developer A may decide he will only use threads for his library, and developer B will decide that he will only use one-sided comm for his library. But someday, developer C is going to need to put library A together with B, and will have to try to reason about their interaction.) I think this is perhaps where we disagree. We disagree about the idea of bolting on yet another parallel programming model to the language itself. And that's okay.

Out curiosity, have you tried automatic parallelism for some of the use cases you're talking about? Maybe compare a few test cases using between OpenMP and Intel's automatic parallelism just to get a sense of what is possible with automatic parallelism. (I would be willing to do some of the leg work if you would be willing to dig up some good test cases.) If the present state-of-the-art in automatic parallelism is good enough for the kind of problems your talking about, then this might be a way to get shared memory support into Julia without actually requiring Julia users to learn multiple programming models. Obviously, automatic parallelism will take some effort to implement, but then again, the parallel programming model of the future is probably no model; i.e. I suspect multi-threading and message passing will eventually go the way of assembly programming.

@stevengj
Copy link
Member Author

@bsilbaugh, Intel's auto-parallelization still requires the language to support shared-memory parallelism (and in fact, Intel's parallelizer is built on top of OpenMP). They probably didn't build it on top of distributed-memory parallelism precisely because it is so difficult to automate memory distribution.

@bsilbaugh
Copy link
Contributor

@stevengj Automatic parallelization requires compiler support (which may use OpenMP libraries internally), but it does not require language support. (Otherwise, it wouldn't be automatic.)

You're right that Intel does not support automatic parallelism for distributed memory architectures (nor does OpenMP). But, for the simple cases you alluded to, perhaps it's enough.

The point of my earlier post (which I think was inline with @ViralBShah and @JeffBezanson original comments) was that you may not need to change the current parallel model (i.e. the one-sided communication API supported by the standard library) simply for performance reasons. For example, calling fetch could just as easily dereference a pointer to a block of shared memory instead of pulling data over the network. Depending on julia's JIT capabilities, perhaps even some of these calls (fetch, remote_call, etc) can get optimized out.

@stevengj
Copy link
Member Author

@bsilbaugh, by "language support", I mean support in the Julia runtime, which is separate from the compiler. The main obstacle in this whole discussion is that the Julia runtime is not thread-safe. (And the simple cases I alluded to are no longer so simple in distributed-memory situations.)

Yes, you can certainly implement distributed-memory primitives on top of shared memory, but that doesn't address the difference in ease of programming between shared and distributed programming models (regardless of implementation).

@ArchRobison
Copy link
Contributor

For shared memory parallelism, it would be worth looking at the Cilk model too. There is ongoing work to add it to LLVM (http://cilkplus.github.io/). Cilk avoids some of the composition problems that OpenMP has (notably nested parallelism). Though there's no free lunch -- OpenMP also has certain advantages. Another candidate worth understanding is Deterministic Parallel Java (http://dpj.cs.uiuc.edu/DPJ/Home.html). Maybe some of its techniques can be applied in Julia. I think the important thing is to understand the tradeoffs.

@stevengj
Copy link
Member Author

@ArchRobison, the semantics of OpenMP have been converging towards those of Cilk for some time now. OpenMP now has #pragma omp task, similar to cilk's spawn model, and it has #pragma omp for schedule(guided) similar to the divide-and-conquer load-balancing technique for loop parallelism in Cilk+. Of course, the syntax is quite different, but no one is proposing to adopt OpenMP syntax in Julia.

So, while I agree that Cilk has a good conceptual model for shared-memory parallelism, that question is somewhat orthogonal to what runtime threading library we use. (LLVM support is apparently somewhat irrelevant here since our syntax is not implemented in LLVM; we just need the runtime library.)

But again, the stumbling block is thread safety in the Julia runtime library.

@JeffBezanson
Copy link
Sponsor Member

That is true, but i'm just as worried about thread safety in every other julia library that might be out there.

@timholy
Copy link
Sponsor Member

timholy commented Sep 19, 2013

I'm not sure I understand the latter concern fully. For example, are there really that many functions in base that make use of non-constant global variables? I'm not saying there aren't any---I've written some of them myself---but I don't tend to think of it as a major feature of our code base. Of course with packages there are additional possibilities for conflict, but at least in my own packages I think it's pretty typical of what's in base---a small percentage might need some redesign for thread-safety.

@ArchRobison
Copy link
Contributor

Though OpenMP has adopted tasking, there are fundamental semantic differences with Cilk that impact composability and performance. Tasking in OpenMP is tied to parallel regions. The big advantage and big disadvantage (depending on context) is that the number of threads executing a parallel region must be bound when the region is entered, before the amount of work or potential parallelism is known. (I work with the Cilk/OpenMP/TBB groups at Intel. We've considered lazy schemes to try to circumvent the issue in OpenMP, but the OpenMP standard has pesky features that get in the way.)

I agree that the big stumbling block is the run-time library and existing Julia libraries. Maybe a lint-like tool could inspect Julia libraries for "okay", "definitely not okay", or "a human should take a look"? From my beginner's experience, Julia seems to have much less of the alias analysis misery that stymies such tools for C/C++.

@JeffBezanson
Copy link
Sponsor Member

I am indeed worried about packages, and the various native libraries they might call. Thread safety is a pretty significant demand to put on all those things, especially since the failure mode is not an error message (or poor performance) but random horrible behavior.

Julia code is designed to compose together quite promiscuously, so it is hard to say "my code is threaded, so I need to make sure I only use thread-safe libraries" --- one could easily pass a function from one package to an implicitly parallel higher-order function from another package.

@ArchRobison great to have somebody from the Cilk group here.

@stevengj
Copy link
Member Author

@ArchRobison, thanks for the clarification, that's very helpful.

@ArchRobison
Copy link
Contributor

Another issue to consider is the generality of the threading model versus ability to detect races. Automatic race detectors can reduce the pain of dealing with threading bugs. Examples are Helgrind, Intel Inspector, and Cilk screen. (See http://supertech.csail.mit.edu/papers/tushara-meng-thesis.pdf for the theory behind one for Cilk.) The efficiency of a race detector is somewhat inverse to the generality of the parallelism model, so it's something to consider in choosing the parallelism model. JIT compilation may be able to reduce the cost somewhat since it can instrument only the memory accesses that might constitute races. (In the jungle of C/C++ binaries that Inspector deals with, we have to instrument just about every access since binary stack frames don't give us much info to do thread escape analysis.)

@staticfloat staticfloat mentioned this issue Nov 19, 2013
21 tasks
@rodrigob
Copy link

@timholy what is done ?

@timholy
Copy link
Sponsor Member

timholy commented Sep 29, 2014

Most of Halide syntax, specifically loop reordering, hosting constants, and tiling. Performance on a couple of test examples actually beats (single-threaded) Halide by a bit, surprisingly.

@rodrigob
Copy link

That sounds fantastic. Adding the "threading bit" is precisely the topic of this issue. What are the show stoppers ?
(and where are those tests, I see only blur at https://github.com/timholy/HalideCall.jl/tree/master/examples )

@timholy
Copy link
Sponsor Member

timholy commented Sep 29, 2014

Looks like I forgot to commit the accum example (it's in the Makefile, though). Just fixed.

The main showstopper is splitting the code for each tile into an independently-callable function, so that threads can be dispatched. However, I planned ahead, and at least the code to allocate temporary memory for each tile is annotated, which might make this easier.

@eaglgenes101
Copy link

So where are we now? Have we decided on a model yet?

@ViralBShah
Copy link
Member

See the threads branch.

I think we can close this issue and have more specific issues related to the implementation.

@ViralBShah ViralBShah added the domain:multithreading Base.Threads and related functionality label Mar 6, 2015
@stevengj
Copy link
Member Author

stevengj commented Sep 2, 2015

@ViralBShah, I think you mean the jn/threading branch? I looked for the threads branch and accidentally created a branch of this name (and promptly deleted it) since no such branch exists.

I don't think this issue should be closed until threading is merged in some form. You can track an issue much more easily than you can track a branch.

@ViralBShah
Copy link
Member

There used to be a threads branch, and a newer threading branch, and eventually the threads branch was removed, IIRC. I believe @JeffBezanson and @kpamnany are getting this up to speed this week. Perhaps worth connecting with them.

@JeffBezanson
Copy link
Sponsor Member

We got a fair amount accomplished this week. Here's the status.

  • The latest branch is jn/threading, and it is up to date with master.
  • Threading works with the new GC. Still using a barrier, but @vtjnash has some good progress towards stopping threads with a signal.
  • Progress on signal handling by @vtjnash .
  • Several more things in the runtime are thread safe.
  • @kpamnany has started work on I/O. I believe the plan is to use a single I/O thread.
  • We have a JULIA_ENABLE_THREADING flag to allow inclusion of the threading code, but disabled.

The plan is to merge into master soon after 0.4 branches, with JULIA_ENABLE_THREADING off by default. We need to start making sure everything builds and works in this state on all platforms. We want to build and also run as much of the thread-related code as possible even when threads are disabled.

@staticfloat
Copy link
Sponsor Member

I'm sensing an impending Travis build matrix addition. :)

I can hear the server fans spinning up from here. Good thing that Travis is roflscale.

@jakebolewski
Copy link
Member

We will soon get dedicated hardware for you to play with. Let the build matrix grow with impunity.

@ViralBShah
Copy link
Member

Should we close this in favour of more specific issues dealing with multi-threading at this point?

@tkelman
Copy link
Contributor

tkelman commented Sep 6, 2015

You said that last october. Judging by

I don't think this issue should be closed until threading is merged in some form. You can track an issue much more easily than you can track a branch.

I think the answer there is "not until something gets merged."

@timholy
Copy link
Sponsor Member

timholy commented Sep 6, 2015

Very exciting to hear about the progress here.

I can tell you're just trying to keep me following master rather than focusing on 0.4. Your devious strategy just might work.

@JeffBezanson
Copy link
Sponsor Member

The jn/threading branch doesn't seem to build at the moment. I get

/home/jeff/src/julia/src/codegen.cpp: In function ‘void jl_extern_c(jl_function_t*, jl_value_t*, jl_value_t*, char*)’:
/home/jeff/src/julia/src/codegen.cpp:1108:90: error: no matching function for call to ‘llvm::GlobalAlias::create(llvm::Type*, unsigned int, llvm::GlobalValue::LinkageTypes, char*&, llvm::Function*&, llvm::Module*)’
                             GlobalValue::ExternalLinkage, name, llvmf, llvmf->getParent());

and several other errors. @vtjnash Know anything about this? Am I doing something wrong?

@kpamnany
Copy link
Contributor

@Keno says in #13410: "Just needs a rebase of the threading patches on top of llvm svn."

I fixed the compile errors by removing the #ifdef LLVM38 and letting the #ifdef LLVM37 branch be compiled, wherever there were errors.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Oct 13, 2015

it sounds like someone needs to rebase the patches on top of the llvm 3.7.0 release so that it can remain stable and buildable.

@JeffBezanson
Copy link
Sponsor Member

I'm using the jn/tlsrebaseagainagain branch of JuliaLang LLVM. Why would that stop working? Maybe I'm missing a step in trying to clean my build state?

@tkelman
Copy link
Contributor

tkelman commented Oct 31, 2015

Something got merged, so close? It's not enabled by default yet but seems to work at the moment. #9336 covers the prereq LLVM version, and other separate issues can be filed for the occasional bug or segfault that will get found in using the threading code.

@StefanKarpinski
Copy link
Sponsor Member

It seems to me that this will be really fun to close when we do finally actually have a non-experimental threading interface.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 2, 2016

We'll continue making progress at thread-safety with PRs like #17250, but I think the original issue is resolved and I don't see any sub-issues mentioned later (which should be opened as separate issues anyways).

@tkelman
Copy link
Contributor

tkelman commented Jul 2, 2016

The programming model of what you can do with @threads is still pretty limited, but that can be discussed and developed in future issues/PR's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality domain:parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests