Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2894][Core] Fixes spark-shell and pyspark CLI options #1864

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

This PR tries to fix SPARK-2894 in a way different from #1825, namely filtering out all spark-submit options from user application options, since we have control over option list of spark-submit. By extracting the filtering into a utility shell function, fixing other shell scripts can be much easier, and adding new spark-submit options wouldn't need too much duplicated work. In the mid term, we still need a cleaner solution such as what has been discussed in #1715.

All mode pyspark supported (Python, IPython and deprecated pyspark app.py style) should work. @JoshRosen @davies It would be great if you can help review PySpark related code, thanks!

An open issue is Windows scripts need also to be updated.

@SparkQA
Copy link

SparkQA commented Aug 9, 2014

QA tests have started for PR 1864. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18237/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 9, 2014

QA results for PR 1864:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18237/consoleFull

@@ -46,11 +49,11 @@ function main(){
# (see https://github.com/sbt/sbt/issues/562).
stty -icanon min 1 -echo > /dev/null 2>&1
export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix"
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main spark-shell "$@"
$FWDIR/bin/spark-submit --class org.apache.spark.repl.Main ${SUBMISSION_OPTS[@]} spark-shell ${APPLICATION_OPTS[@]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle quoted strings, e.g. --name "awesome app"? You may need to put double quotes around the argument lists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed it doesn't, need to add similar logic as the for loop in pyspark to handle quoted arguments.

@andrewor14
Copy link
Contributor

Hi @liancheng, this looks pretty good. We should merge this quickly because Spark shell for master is currently broken. However, I don't see a code path for handling application args for pyspark shell. Do you intend to add that in this PR?

@liancheng
Copy link
Contributor Author

@andrewor14 Sorry, I was busy attending the 1st Spark Meetup in Beijing today, and thanks to @sarutak, his most recently updated PR (#1825) fixed all the issues you've pointed out here. I'll test it locally to make sure.

asfgit pushed a commit that referenced this pull request Aug 10, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1715, Closes #1864, and Closes #1861

Closes #1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@asfgit asfgit closed this in 4f4a988 Aug 10, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by apache#1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes apache#1715, Closes apache#1864, and Closes apache#1861

Closes apache#1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
@liancheng liancheng deleted the spark-2894 branch September 24, 2014 00:08
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…jobs should fail the streaming context (apache#1864)

### What changes were proposed in this pull request?

Currently Spark streaming uses Scala.Try to catch NonFatal exception during generating streaming jobs in `JobGenerator.generateJobs`. But if any fatal error (e.g., `OutOfMemoryError`) happens, it won't be caught. Normally just error should propagate up and finally crash Spark runtime, but Spark streaming does this in separate thread.

If the fatal error crashes entire JVM, it also can bring Spark down. But for the customer case, the OOM is thrown by `ByteArrayOutputStream` when it cannot grow up to new capacity (new capacity is overflow on Int range, see Java [source code](https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/io/ByteArrayOutputStream.java#L123)). So it is not actually running memory in JVM and JVM won't be down by that. Eventually the OOM is ignored by Spark streaming and next batches are continuing to execute. Such behavior causes unnoticed data loss for the streaming application. We should propagate all failures including fatal ones.

### Why are the changes needed?

Making consistent behavior on Spark streaming on fatal and non-fatal errors.

### Does this PR introduce _any_ user-facing change?

If user's streaming application contains fatal error during streaming job generation, previously it will be ignored. Now it will bring down the application. Note that if the application contains non-fatal error, it brings down the application previously without this change. That's why the previous behavior that fatal error not causing failure is not correct.

### How was this patch tested?

Added unit test

### Was this patch authored or co-authored using generative AI tooling?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants