Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11327] [MESOS] Dispatcher does not respect all args from the Submit request #10370

Closed
wants to merge 5 commits into from

Conversation

jayv
Copy link
Contributor

@jayv jayv commented Dec 18, 2015

Supersedes #9752

@tnachen
Copy link
Contributor

tnachen commented Dec 18, 2015

ok to test

@tnachen
Copy link
Contributor

tnachen commented Dec 18, 2015

At first I was wondering if we just copy all scheduler properties will it clash with the ones we used above (i.e --total-executor-cores), but looks like the command line flags takes precedence in SparkSubmit so we should be ok. How have you tested this?

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47989 has finished for PR 10370 at commit cdde93d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jayv
Copy link
Contributor Author

jayv commented Dec 18, 2015

@tnachen I've tested that patch against branch-1.5 and I'm now able to specify options such as --conf spark.executor.userClassPathFirst=true --conf spark.driver.userClassPathFirst=true on spark-submit for a cluster deploy via mesos.

@tnachen
Copy link
Contributor

tnachen commented Dec 18, 2015

@jayv Looks like you need to fix the line length > 100 style rule.

@@ -440,6 +444,9 @@ private[spark] class MesosClusterScheduler(
.mkString(",")
options ++= Seq("--py-files", formattedFiles)
}
desc.schedulerProperties
.filter { case (key, _) => !replicatedOptionsBlacklist.contains(key) }
.foreach { case (key, value) => options ++= Seq("--conf", Seq(key, "=\"", value, "\"").mkString("")) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the quotes inside --conf key="value" valid? I know I tried --conf spark.mesos.executor.home="/var/spark" and that failed (the actual value of spark.mesos.executor.home contained the quotes).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..one more thing: better use string interpolation instead of Seq.mkString: s"$key=$value".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the quotes work for me.

In order to pass multiple system properties via eg spark.driver.extraJavaOptions, I passed the arg in quotes on the shell to spark-submit, but the value itself doesn't have quotes around it and ends up broken on the commandline for the driver without wrapping, unless I add escaped quotes inside the string.

So this fails when executing the driver, but spark-submit accepts the command:

bin/spark-submit [...] --conf spark.driver.extraJavaOptions="-Dcom.sun.jersey.server.impl.cdi.lookupExtensionInBeanManager=true -Dorg.jboss.weld.bootstrap.concurrentDeployment=false" [...]

This doesn't look very intuitive to me, but it does work without wrapping values in quotes:

bin/spark-submit [...] --conf spark.driver.extraJavaOptions="\"-Dcom.sun.jersey.server.impl.cdi.lookupExtensionInBeanManager=true -Dorg.jboss.weld.bootstrap.concurrentDeployment=false\"" [...]

The same is true for --driver-java-options as opposed to the --conf route.

String interpolation doesn't support double quotes, that's why I initially went with Seq.mkString but I've since learned about triple-double-quoted string interpolation which does...

@dragos
Copy link
Contributor

dragos commented Dec 19, 2015

thanks @jayv, I left one note, otherwise this looks good.

@SparkQA
Copy link

SparkQA commented Dec 23, 2015

Test build #48229 has finished for PR 10370 at commit 8f0e1f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

@dragos
Copy link
Contributor

dragos commented Jan 18, 2016

This PR might help with #8872 testing

@dragos
Copy link
Contributor

dragos commented Jan 18, 2016

Ok, it seems you need to filter out a few more things, particularly spark.submit.deployMode="cluster". I see errors[1] in the driver stderr, since the driver tries to launch again in cluster mode, on one of the machines in the cluster. If this worked for you, I'd like to know the exact sequence of steps (I'd really like to get this fixed, though). Here are mine:

$ sbin/start-mesos-dispatcher.sh --master mesos://lausanne1.local:5050
$ bin/spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 --conf spark.mesos.coarse=true --conf spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.6.0-SNAPSHOT-bin-pr10370.tgz" --class org.apache.spark.examples.SparkPi ftp://sagitarius.local/ftp/spark-examples-1.6.0-SNAPSHOT-hadoop2.2.0.jar

sagitarius is my machine, lausanne1 is the Mesos master in my local network.

[1]

16/01/18 16:56:31 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://lausanne1.local:5050.
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestProtocolException: Malformed response received from server
    at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:264)
    at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$postJson(RestSubmissionClient.scala:222)
    at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:87)
    at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$createSubmission$3.apply(RestSubmissionClient.scala:83)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at org.apache.spark.deploy.rest.RestSubmissionClient.createSubmission(RestSubmissionClient.scala:83)
    at org.apache.spark.deploy.rest.RestSubmissionClient$.run(RestSubmissionClient.scala:411)
    at org.apache.spark.deploy.rest.RestSubmissionClient$.main(RestSubmissionClient.scala:424)
    at org.apache.spark.deploy.rest.RestSubmissionClient.main(RestSubmissionClient.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestProtocolException: Server returned empty body
    at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$1.apply(RestSubmissionClient.scala:241)
    at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$1.apply(RestSubmissionClient.scala:232)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@dragos
Copy link
Contributor

dragos commented Jan 19, 2016

I opened jayv#2 with two more filters: spark.submit.deployMode and spark.master. I think it's ok to remove [WIP] from the title, and if you agree with my changes, and passes your tests (it passed mine), this is an LGTM

@jayv jayv changed the title [WIP] [SPARK-11327] [MESOS] Dispatcher does not respect all args from the Submit request [SPARK-11327] [MESOS] Dispatcher does not respect all args from the Submit request Jan 19, 2016
@jayv
Copy link
Contributor Author

jayv commented Jan 19, 2016

@dragos I don't remember seeing these errors on the 1.5.x branch I tested against,but I may have missed them since I was focussing on the missing args. This looks totally reasonable to me.

@SparkQA
Copy link

SparkQA commented Jan 19, 2016

Test build #49706 has finished for PR 10370 at commit 4b2b8b1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Jan 20, 2016

The failed test is in Spark Streaming, no doubt it's a flaky one.

LGTM. @andrewor14 can you please have a look?

@@ -440,6 +446,9 @@ private[spark] class MesosClusterScheduler(
.mkString(",")
options ++= Seq("--py-files", formattedFiles)
}
desc.schedulerProperties
.filter { case (key, _) => !replicatedOptionsBlacklist.contains(key) }
.foreach { case (key, value) => options ++= Seq("--conf", s"""$key="$value"""") }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are lines 437-443 now redundant? If setting --total-executor-cores is identical to --conf spark.cores.max, then we don't need to set both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this escape quotes, double quotes, whitespaces and backslashes correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, Mesos and CommandInfo use /bin/sh to launch the command. 😕

Spaces should be ok, everything else won't be correctly escaped. Skimming through Spark properties I think the only ones that could pose problems are spark.authenticate.secret and the other passwords (SSL, etc.). Still, this needs a solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mesos should also handle spaces if you provide the arguments as args in the CommandInfo even with spaces and other characters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess Tim refers to this line, so we'd need to set shell to false. That's probably the simplest way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK, I can check that, I wrote a bunch of regexes and tests today to solve escaping for most Unix shells. Was about to push a PR early morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Mesos way is more robust and requires less code. Hopefully :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meat of it is only 5 lines :) but let's see if we can make the Mesos approach work without side-effects and for my use-case with args with spaces.

@andrewor14
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49806 has finished for PR 10370 at commit 4b2b8b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Jan 22, 2016

@jayv do you have time to work on this, or should I?

@jayv
Copy link
Contributor Author

jayv commented Jan 25, 2016

@dragos Today I will test the driver issue you mentioned, we also seem to have a rogue driver spinning up on workers, and work on better escaping. Isn't there code for this in spark-submit or something already?

@andrewor14
Copy link
Contributor

@jayv the closest thing I can think of is Utils.splitCommandString, which does the unescaping. Did you end up getting Mesos to escape things properly? E.g. does something like spark.executor.extraJavaOptions work?

@andrewor14
Copy link
Contributor

@dragos is the only thing that's left before this patch can be merged the escaping thing?

@jayv
Copy link
Contributor Author

jayv commented Feb 1, 2016

Got an implementation that escapes all the things under linux shells, but had to suspend work on this, so was unable to test mesos' native escaping. I'll try to get back to it this week.

@dragos
Copy link
Contributor

dragos commented Feb 2, 2016

As far as I'm concerned, that's the only thing (I'll have to test again on a real Mesos cluster).

@dragos
Copy link
Contributor

dragos commented Feb 16, 2016

@jayv did you have the chance to look at this again?

@dragos
Copy link
Contributor

dragos commented Feb 16, 2016

I pushed this commit with the general idea, but I didn't get to test it much. I'll come back to it tomorrow.

@jayv
Copy link
Contributor Author

jayv commented Feb 16, 2016

@dragos Unfortunately not, I'll come back to this issue when I've done some more testing.

@dragos
Copy link
Contributor

dragos commented Feb 19, 2016

My commit is not working, BTW, I'll come back to this (the executable needs to be a complete path to spark-class, right now it's a bash-ism: cd spark-1.*; ./bin/spark-class.

@dragos
Copy link
Contributor

dragos commented Feb 23, 2016

@jayv please let me know if you will look into this one. If you're busy, I'm happy to take over, and will happily start with your escaping algorithm ;-)

@jayv
Copy link
Contributor Author

jayv commented Mar 2, 2016

Sorry for the delay @dragos very busy indeed, maybe this can help jayv@8023da3

@tnachen
Copy link
Contributor

tnachen commented Mar 18, 2016

@jayv are you planning to update this PR with that commit?

@dragos
Copy link
Contributor

dragos commented Mar 21, 2016

I can pick this up tomorrow.

@dragos
Copy link
Contributor

dragos commented Mar 22, 2016

It works, but I found a couple of corner cases. I guess you need to escape \ as well. For instance, a string ending in \, or a string like \"? won't be quoted correctly. Otherwise it looks good and the manual tests are passing.

@jayv
Copy link
Contributor Author

jayv commented Mar 22, 2016

Nice, will you pick this up @dragos or waiting for me? - can't commit any time this week

@dragos
Copy link
Contributor

dragos commented Mar 23, 2016

Yeap, I just opened a PR on your repo. If you merge it it should be reflected here.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53977 has finished for PR 10370 at commit 11b070d.

  • This patch fails R style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Mar 30, 2016

@jayv can you please rebase?

@andrewor14 the last issue (escaping characters for the shell command) has been fixed. Please take a look.

jayv and others added 5 commits March 30, 2016 10:49
- the launched driver should be in client mode, but at submission time this setting is set to ‘cluster’.
- the `spark.master` setting is set to the dispatcher address (usually on port 7077). Passing it has no effect,
since we also pass `—-master` which takes precedence. But this is confusing so we should filter it out.
(cherry picked from commit 8023da3)
@jayv jayv force-pushed the mesos_cluster_params branch from 11b070d to 82cd03d Compare March 30, 2016 18:00
@jayv
Copy link
Contributor Author

jayv commented Mar 30, 2016

@dragos rebased!

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54541 has finished for PR 10370 at commit 82cd03d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14
Copy link
Contributor

retest this please

@andrewor14
Copy link
Contributor

Latest changes look pretty reasonable. Funny how mesos cluster mode is basically unusable before. :)

@SparkQA
Copy link

SparkQA commented Mar 31, 2016

Test build #54556 has finished for PR 10370 at commit 82cd03d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dragos
Copy link
Contributor

dragos commented Mar 31, 2016

LGTM! @andrewor14 please have a look

@andrewor14
Copy link
Contributor

Merged into master. Thanks for your patience and hard work everyone.

@asfgit asfgit closed this in 10508f3 Mar 31, 2016
@andrewor14
Copy link
Contributor

@dragos @tnachen can you remind me on the status of Mesos cluster mode in 1.6? It was essentially unusable because of the external shuffle service right? Did we ever resolve that? (where is the patch?)

@mgummelt
Copy link
Contributor

Mesos Cluster Mode works fine. It's dynamic allocation that was broken. I'm pretty sure this is the PR: #11272

@dragos @skonto can you verify that the Spark/Mesos dynamic allocation integration tests on master are passing?

@andrewor14
Copy link
Contributor

Ah thanks, I forgot I merged that.

@andrewor14
Copy link
Contributor

Then it seems that this patch is worth backporting into 1.6 since cluster mode does work there.

@dragos @mgummelt @jayv Could one of you submit a patch for 1.6? I think the changes here are pretty isolated so it should be safe.

asfgit pushed a commit that referenced this pull request Apr 4, 2016
Backport for #10370 andrewor14

Author: Jo Voordeckers <jo.voordeckers@gmail.com>

Closes #12101 from jayv/mesos_cluster_params_backport.
zzcclp pushed a commit to zzcclp/spark that referenced this pull request Apr 5, 2016
Backport for apache#10370 andrewor14

Author: Jo Voordeckers <jo.voordeckers@gmail.com>

Closes apache#12101 from jayv/mesos_cluster_params_backport.

(cherry picked from commit 91530b0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants