-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2894][Core] Fixes spark-shell and pyspark CLI options #1864
Conversation
QA tests have started for PR 1864. This patch merges cleanly. |
QA results for PR 1864: |
@@ -46,11 +49,11 @@ function main(){ | |||
# (see https://github.com/sbt/sbt/issues/562). | |||
stty -icanon min 1 -echo > /dev/null 2>&1 | |||
export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix" | |||
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main spark-shell "$@" | |||
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this handle quoted strings, e.g. --name "awesome app"
? You may need to put double quotes around the argument lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed it doesn't, need to add similar logic as the for
loop in pyspark
to handle quoted arguments.
Hi @liancheng, this looks pretty good. We should merge this quickly because Spark shell for master is currently broken. However, I don't see a code path for handling application args for pyspark shell. Do you intend to add that in this PR? |
@andrewor14 Sorry, I was busy attending the 1st Spark Meetup in Beijing today, and thanks to @sarutak, his most recently updated PR (#1825) fixed all the issues you've pointed out here. I'll test it locally to make sure. |
As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1715, Closes #1864, and Closes #1861 Closes #1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
As sryza reported, spark-shell doesn't accept any flags. The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801 Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Author: Cheng Lian <lian.cs.zju@gmail.com> Closes apache#1715, Closes apache#1864, and Closes apache#1861 Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits: 47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894 2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py 98287ed [Kousuke Saruta] Removed useless code from java_gateway.py 513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces 28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments 5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
…jobs should fail the streaming context (apache#1864) ### What changes were proposed in this pull request? Currently Spark streaming uses Scala.Try to catch NonFatal exception during generating streaming jobs in `JobGenerator.generateJobs`. But if any fatal error (e.g., `OutOfMemoryError`) happens, it won't be caught. Normally just error should propagate up and finally crash Spark runtime, but Spark streaming does this in separate thread. If the fatal error crashes entire JVM, it also can bring Spark down. But for the customer case, the OOM is thrown by `ByteArrayOutputStream` when it cannot grow up to new capacity (new capacity is overflow on Int range, see Java [source code](https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L123)). So it is not actually running memory in JVM and JVM won't be down by that. Eventually the OOM is ignored by Spark streaming and next batches are continuing to execute. Such behavior causes unnoticed data loss for the streaming application. We should propagate all failures including fatal ones. ### Why are the changes needed? Making consistent behavior on Spark streaming on fatal and non-fatal errors. ### Does this PR introduce _any_ user-facing change? If user's streaming application contains fatal error during streaming job generation, previously it will be ignored. Now it will bring down the application. Note that if the application contains non-fatal error, it brings down the application previously without this change. That's why the previous behavior that fatal error not causing failure is not correct. ### How was this patch tested? Added unit test ### Was this patch authored or co-authored using generative AI tooling? No
This PR tries to fix SPARK-2894 in a way different from #1825, namely filtering out all
spark-submit
options from user application options, since we have control over option list ofspark-submit
. By extracting the filtering into a utility shell function, fixing other shell scripts can be much easier, and adding newspark-submit
options wouldn't need too much duplicated work. In the mid term, we still need a cleaner solution such as what has been discussed in #1715.All mode
pyspark
supported (Python, IPython and deprecatedpyspark app.py
style) should work. @JoshRosen @davies It would be great if you can help review PySpark related code, thanks!An open issue is Windows scripts need also to be updated.