-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Removing assemblies from Spark. #2
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# Replacing the Spark Assembly with good old jars | ||
|
||
Spark, since the 1.0 release (at least), uses assemblies (or “fat jars”) as the approach to deliver | ||
the shared Spark code to users. In this document I’ll discuss a few problems this approach causes, | ||
and how avoiding the use of the assemblies makes development and deployment of Spark easier for all. | ||
|
||
What does it solve? And at what cost? | ||
|
||
The first question to ask is: what problems is the assembly solving? | ||
|
||
The assembly provides a convenient package including all the classes needed to run a Spark | ||
application. Theoretically, you can easily move it from one place to the other; easily cache it | ||
somewhere like HDFS for reuse; and easily include it as a library in 3rd-party applications. | ||
|
||
But that’s a very shallow look at things, and ignores a lot of problems caused by the assembly. | ||
|
||
Spark has suffered in the past from problems caused by such a large archive with so many files | ||
(pyspark incompatibilities with large archives created by newer JDKs). That has been solved by | ||
moving to JDK 7, though. | ||
|
||
The assembly makes dependencies very opaque. When a user adds the Spark assembly to an application, | ||
what exactly is he pulling in? Is he inadvertently overriding classes needed by his application with | ||
ones included in the Spark assembly? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to be clear, the user was never supposed to add Spark assemblies to their application. They link to Spark by adding it as a Maven dependency, and then they run their app using spark-submit, which uses the Spark build on their machine. The assembly was mostly meant to be an easy and efficient way to package all the classes on worker nodes, so that you don't have to send around a ton of JARs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @mateiz,
I understand that point, but the fact is that quite a lot of people run Spark embedded in their applications too, without ever using spark-submit. re: distribution to worker nodes, that mostly applies to YARN, right? Maybe Mesos, I'm not familiar with how to use Spark on it. But on standalone, the Worker instance itself already needs all this code to run, and propagates that code to all executors it starts. Whether the code is a single jar or multiple, it doesn't really make a difference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For embedded Spark, can't they just rely on the Maven dependencies? I don't see how removing assemblies would fix this, because you still need to add the code for your specific configuration of Spark on the classpath (i.e. the right Spark version, the right Hadoop version, Hive if you wanted that, etc). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They can, mostly. Perhaps citing the embedded case is kind of a red herring. The biggest issue someone will have in the embedded case is with YARN, since it currently requires some non-trivial setup to get things right (see Pig and Oozie bugs I mentioned later); I think that's the only change listed in this document that really would help the embedded case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, I just want to make sure that we're not recommending the wrong thing to people. YARN is its own world, but if we make a change like this we should make sure the result works in all resource managers and is worth the change. I don't remember the details exactly, but we actually started with JARs and moved to an assembly later; it might've been to make it easier to push onto YARN and Mesos but I forget. It was not done to let users link to a giant JAR though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Absolutely; I have actually played with a proof of concept for this in YARN and the changes required are actually not that big. I'd have to familiarize myself a bit with the Mesos side, but I don't expect a lot of issues there either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would be great if we could simply depend on spark (as a maven dependency), and launch our app on a yarn cluster (without using any installed spark distro or even spark-submit). this would make things much easier. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Mesos I don't think it will cause too much problems, most of the time it's about able to build a Spark distribution tar that holds the spark binaries and dependencies to run the executor via Mesos. Otherwise we don't upload dependencies for users and act much more like Standalone mode. |
||
|
||
The assembly makes development slower. Many tests need currently need an updated assembly to run | ||
correctly (although SPARK-9284 aims to solve that). Updating a remote cluster is slower than | ||
necessary - even rsyncing such a large archive is not terribly efficient. It slows down the build, | ||
because repackaging the assemblies when files change is not exactly fast. Even deciding that there | ||
is no need to rebuild the assembly takes time. | ||
|
||
And it also does not solve the one problem it was meant to solve: it does not include all | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There's another issue, which is conflict with the bundled artifacts (example: SLF4J) and anything on the classpath, especially when deployed on Hadoop clusters: warnings of conflict appear at the top of all logs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @steveloughran I'm not sure this change by itself would solve the multiple slf4j binding issue, but it definitely would make it easier for people to fix it by customizing their Spark installation. |
||
dependencies needed by Spark, because the Datanucleus libraries do not work when included in the | ||
assembly. | ||
|
||
From the point of view of someone trying to embed Spark into their application, things become | ||
trickier still. The assembly is not a published artifact, so what should the user pick up instead? | ||
The recommendation has been to use “provided” dependencies and somehow ship the appropriate Spark | ||
assembly with the user application. But that runs into all the issues above (dependency conflicts et | ||
al), aside from being a very unnatural way to use dependencies when compared to other maven-based | ||
projects. | ||
|
||
Finally, as someone whose work involves packaging Spark as part of a larger distribution, the | ||
assembly creates yet more problems. Because all dependencies are included in one big fat jar, it’s | ||
harder to share libraries that are shipped as part of the distribution. This means packages are | ||
unnecessarily bloated because of the code duplication, and patching becomes harder since now you | ||
have to patch multiple components that ship that code. | ||
|
||
Hacks were added to the Spark build to filter out such dependencies (all the *-provided profiles), | ||
but those are brittle, require constant maintenance and policing, and require non-trivial work to | ||
make sure Spark has all needed libraries at runtime. If you happen to miss a shared dependency, and | ||
you are unlucky enough to have to patch it later on, you just made your work more complicated | ||
because now there are two things to patch. | ||
|
||
## How to replace it? | ||
|
||
Ignoring potential backwards compatibility issues due to code that expects the current layout of a | ||
Spark distribution, getting rid of the Spark assembly should be rather easy. | ||
|
||
With a couple of exceptions that I’ll cover below, there is no code in Spark that actually depends | ||
on the assembly. Whether the code comes from one or two hundred jars, everything just works. So from | ||
the packaging side, all that is needed is, instead of having a single jar file, use maven’s built-in | ||
functionality (and I assume sbt would have something similar) to create a directory with the Spark | ||
jars and all needed dependencies. | ||
|
||
The two parts of the code base that depend on the assembly are: | ||
|
||
* The launcher library; fixing it to include all jars in a directory instead of the assembly is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for lib/*.jar. This addresses the problem today that you can't just drop in support for s3a:// or avs:// just by adding the JAR to the dir, but instead have to remember to explicitly add it on every run There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FYI for this you can also modify the spark-env.sh on each node to add it to the classpath. |
||
trivial. | ||
* YARN integration. | ||
|
||
The YARN backend assumes, by default, that there’s nothing Spark-related installed in the cluster. | ||
So when you submit an app to YARN, it will try to upload the jar containing the Spark classes | ||
(normally the assembly) to the cluster. There are config options that can be used to tell the YARN | ||
backend where to find the assembly (e.g. somewhere in HDFS or on the local filesystem of cluster | ||
nodes), but those configs assume that Spark is a single file. This is already an issue today when | ||
trying to run a Spark application that needs Datanucleus jars in cluster mode. | ||
|
||
Fixing this is not hard, it just requires a little more code. The YARN backend should be able to | ||
handle directories / globs as well as the current “single jar” approach to uploading (or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest supporting it as a tarball if we go that way, more efficient then a bunch of jars in a directory. This could also allow you to include confs and other things in there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's an interesting approach. Kinda similar to how we upload the configs to HDFS currently. For those who do not have the files cached on HDFS, though, creating the tgz locally on every run would be pretty expensive. Perhaps supporting both ways would be better? (Support multiple jars, but also a cached tgz via a different config option.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I'm fine with that, was talking more about on the uploaded to hdfs side of things. |
||
referencing) dependencies. | ||
|
||
Spark has more than one assembly, though, so we need to look at how the other assemblies are used | ||
too. | ||
|
||
The examples assembly can receive a similar treatment. The run-examples script might need some | ||
tweaking to include the extra jars in the Spark command to run. And running the examples by using | ||
spark-submit directly might become a little bit more complicated - although even that is fixable, in | ||
certain cases. The dependencies can be added to the example jar’s manifest, and spark-submit could | ||
read the manifest and automatically include the jars in the running application. | ||
|
||
The streaming backend assemblies could potentially just be removed. With the ivy integration in | ||
Spark, the original artifacts can be used instead, and dependencies will be automatically handled. | ||
For those using maven to build their streaming applications, including the dependencies is also | ||
easy. To help with tidiness, the streaming backends should declare Spark dependencies such as | ||
spark-core and spark-streaming as provided. There might be some tweaking needed to get the pyspark | ||
streaming tests to work, since they currently depend on the backend assemblies being built. One last | ||
thing that needs to be covered are python unit tests; they use the assemblies to avoid having to | ||
deal with maven / sbt to build their classpath. This could also be easily supported by having the | ||
dependencies be copied to a known directory under the backend’s build directory - not much different | ||
from how things work today. | ||
|
||
That leaves the YARN shuffle service. This is the only module where I see an assembly really adding | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could still generate a small assembly for this shuffle service. In fact I think it has it's own assembly separate from the larger one, since it intentionally has a tiny number of dependencies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is what I meant. |
||
some benefit - deploying / updating the shuffle service on YARN is just a matter of copying / | ||
replacing a single file (aside from configuration). | ||
|
||
|
||
## Summary of benefits | ||
|
||
Removing the assembly brings forward the following benefits: | ||
|
||
* Builds are faster | ||
* Build code is simplified | ||
* Spark behaves more like usual maven-based applications w.r.t. building, packaging and deployment | ||
* Possibility of minor code cleanups in other parts of the code base (launch scripts, code that | ||
starts Spark processes) | ||
* More flexibility when embedding Spark into other applications | ||
|
||
The cons of such a move are: | ||
|
||
* Backwards compatibility, in case someone really depends on the assembly being there. We can have a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I think it's worth fully understanding all possible compatibility issues. There might be things we aren't thinking of. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BC issues
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vanzin my understanding was that we now shade everything in our core jar, so the uber jar doesn't introduce any new shading. Is that right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pwendell that's right, we shade on every module, so the uber jar doesn't add any shading. The shaded guava classes are currently embedded in the "network-common" artifact. @steveloughran re: (1), that's already the case today, since the assembly is not published to maven [1]. you could still just switch one variable with the Spark version for all artifacts. [1] http://repo1.maven.org/maven2/org/apache/spark/spark-assembly_2.10/ (last one is 1.1.1) |
||
dummy jar, but that only solves a trivial part of the compatibility problem. | ||
* Running examples via spark-submit directly might become a little more complicated. | ||
* Slightly more complicated code in the YARN backend, to deal with uploading all dependencies | ||
(instead of just a single file). | ||
|
||
|
||
## What about the “provided” profiles? | ||
|
||
This change would allow most of the “provided” profiles to become obsolete. The only profile that | ||
should be kept is “hadoop-provided”, since it allows users to easily deploy a Spark package on top | ||
of any Hadoop distribution. | ||
|
||
The other profiles mostly cover avoiding repackaging dependencies for examples (which is not | ||
crucial) and streaming backends (which would be handled by the suggestions made in the discussion | ||
above). | ||
|
||
## Links to Assembly-related issues | ||
|
||
* https://issues.apache.org/jira/browse/OOZIE-2277 | ||
|
||
Oozie needs to do a lot of gymnastics to get Spark deployed because of the way it needs to run apps. | ||
Since the Spark assembly is not a maven artifact, it’s unrealistic for Oozie to use it. | ||
|
||
* https://issues.apache.org/jira/browse/PIG-4667 | ||
|
||
Similar to Oozie. When not using an assembly, the YARN backend does the wrong thing (since it will | ||
just upload the spark-yarn jar). The result is to either depend on the non-existent assembly | ||
artifact, or do what Oozie does. | ||
|
||
* https://issues.apache.org/jira/browse/HIVE-7292 | ||
|
||
Link is to the umbrella tracking the project; but Hive-on-Spark solves the same problem in yet | ||
another different way. It would be much simpler if Hive could just depend on Spark directly instead | ||
of somehow having to embed a Spark installation in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original motivation way, way back was that it's nicer when you run "ps -ef" - otherwise you get a huge output that is super long with the whole classpath and it's hard to see things like what arguments are passed to the executables. /cc @mateiz as this was a long time ago. At least bares mentioning here... not saying this justifies having it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "ps -ef" problem could be solved by using the CLASSPATH environment variable instead of the JVM argument. One can still use the
jinfo
command to see the classpath a Java process is running with.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you could either use the env variable or include
libdir/*
in the classpath, since the JVM handles that fine.