[SPARK-26015][K8S] Set a default UID for Spark on K8S Images #23017

rvesse · 2018-11-12T17:07:47Z

What changes were proposed in this pull request?

Adds USER directives to the Dockerfiles which is configurable via build argument (spark_uid) for easy customisation. A -u flag is added to bin/docker-image-tool.sh to make it easy to customise this e.g.

> bin/docker-image-tool.sh -r rvesse -t uid -u 185 build
> bin/docker-image-tool.sh -r rvesse -t uid push

If no UID is explicitly specified it defaults to 185 - this is per @skonto's suggestion to align with the OpenShift standard reserved UID for Java apps (
https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html)

Notes:

We have to make the WORKDIR writable by the root group or otherwise jobs will fail with AccessDeniedException

To Do:

Debug and resolve issue with client mode test
Consider whether to always propagate SPARK_USER_NAME to environment of driver and executor pods so entrypoint.sh can insert that into /etc/passwd entry
Rebase once PR [SPARK-25023] More detailed security guidance for K8S #23013 is merged and update documentation accordingly

How was this patch tested?

Built the Docker images with the new Dockerfiles that include the USER directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further.

Also manually dropped myself into the resulting container images via docker run and checked id -u output to see that UID is as expected.

Tried customising the UID from the default via the new -u argument to docker-image-tool.sh and again checked the resulting image for the correct runtime UID.

cc @felixcheung @skonto @vanzin

rvesse · 2018-11-12T17:11:15Z

For those with more knowledge of client mode here is the specific error seen in the integration tests:

Exception in thread "main" java.lang.IllegalArgumentException: basedir must be absolute: ?/.ivy2/local
  	at org.apache.ivy.util.Checks.checkAbsolute(Checks.java:48)
  	at org.apache.ivy.plugins.repository.file.FileRepository.setBaseDir(FileRepository.java:135)
  	at org.apache.ivy.plugins.repository.file.FileRepository.<init>(FileRepository.java:44)
  	at org.apache.spark.deploy.SparkSubmitUtils$.createRepoResolvers(SparkSubmit.scala:1063)
  	at org.apache.spark.deploy.SparkSubmitUtils$.buildIvySettings(SparkSubmit.scala:1149)
  	at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:51)
  	at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315)
  	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143)
  	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
  	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
  	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
  	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

This looks to me like Spark/Ivy is discovering the users home directory incorrectly. Have done some digging into the code paths but have not spotted what exactly is wrong yet.

felixcheung · 2018-11-13T17:57:48Z

noted test issue. let's kick off test though

felixcheung · 2018-11-13T17:57:54Z

ok to test

ifilonenko · 2018-11-13T20:41:44Z

This looks to me like Spark/Ivy is discovering the users home directory incorrectly. Have done some digging into the code paths but have not spotted what exactly is wrong yet.

I stumbled upon this error as well, I solved this with the following config addition:

--conf spark.driver.extraJavaOptions=-Divy.home=/root/.ivy2 \

On Mac the ?/.ivy2 returns ~/.ivy2, while on Alpine it seems to not resolve, so I added this hack to get around that.

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh

SparkQA · 2018-11-13T21:40:06Z

Test build #98787 has finished for PR 23017 at commit 92e22f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rvesse · 2018-11-15T14:22:34Z

Resolved the issue with the client mode test. The test itself was actually badly written in that it used the Spark images but overrode the entry point which avoided the logic that sets up the /etc/passwd entry for the container UID. Without this running there is no home directory for the user so the Ivy setup fails as a result. By changing the test to not override the entry point the test can run successfully. Also made a couple of minor changes to this test to make it easier to debug.

rvesse · 2018-11-15T14:23:16Z

noted test issue. let's kick off test though

@felixcheung This is now resolved, please kick off a retest when you get chance

vanzin

Do you want to remove WIP from the title now?

Not sure why integration tests didn't run. Will re-kick.

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile

...-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/ClientModeTestsSuite.scala

vanzin · 2018-11-15T18:35:47Z

retest this please

SparkQA · 2018-11-15T18:42:13Z

Test build #98872 has finished for PR 23017 at commit a9e59b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-15T22:47:45Z

Test build #98882 has finished for PR 23017 at commit a9e59b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-16T13:56:38Z

Test build #98911 has finished for PR 23017 at commit 8f4fd19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-11-16T19:34:17Z

LGTM but the integration tests still have not run.

vanzin · 2018-11-16T19:34:23Z

ok to test

vanzin · 2018-11-16T19:45:53Z

@shaneknapp halp!

ifilonenko · 2018-11-16T20:27:40Z

retest this please

ifilonenko · 2018-11-16T20:29:22Z

https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5086/
seems to be hanging on the distribution build (45min+)

shaneknapp · 2018-11-16T20:51:15Z

@vanzin i'm really not sure what's going on w/this. i noticed it happening on research-jenkins-worker-07 yesterday, so i rebooted the box and that seemed to fix it.

now it's back, and happening on ALL of the workers. at least we know it's not something up w/the systems, but when it builds the dist.

i'll try and build the dist manually on the workers and see what happens as well.

shaneknapp · 2018-11-16T21:12:03Z

yep it gets to the same spot when i try and build manually, and fails:

[INFO] --- maven-source-plugin:3.0.1:test-jar-no-fork (create-source-jar) @ spark-mllib-local_2.12 ---
[INFO] Building jar: /home/eecs/sknapp/src/spark/mllib-local/target/spark-mllib-local_2.12-3.0.0-SNAPSHOT-test-sources.jar

vanzin · 2018-11-16T21:14:20Z

manually on a different machine or just on the workers?

shaneknapp · 2018-11-16T21:15:49Z

@vanzin @ifilonenko can you guys try and build the dist locally on your dev laptops? here's a little wrapper script to make it easier:

you'll need to update your PATH to have some version of python3 installed (tho i don't actually think it's necessary), as well as JAVA_HOME... might also need zinc in there as well.

#!/bin/bash

rm -f spark-*.tgz

export DATE=`date "+%Y%m%d"`
export REVISION=`git rev-parse --short HEAD`

export AMPLAB_JENKINS=1
export PATH="$PATH:/home/anaconda/envs/py3k/bin"

# Prepend JAVA_HOME/bin to fix issue where Zinc's embedded SBT incremental compiler seems to
# ignore our JAVA_HOME and use the system javac instead.
export PATH="$JAVA_HOME/bin:$PATH:/usr/local/bin"

# Generate random point for Zinc
export ZINC_PORT
ZINC_PORT=$(python -S -c "import random; print random.randrange(3030,4030)")

export SBT_OPTS="-Duser.home=$HOME -Dsbt.ivy.home=$HOME/.ivy2"
export SPARK_VERSIONS_SUITE_IVY_PATH="$HOME/.ivy2"

./dev/make-distribution.sh --name ${DATE}-${REVISION} --pip --tgz -DzincPort=${ZINC_PORT} \
     -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver
retcode=$?

exit $retcode

shaneknapp · 2018-11-16T21:17:24Z

manually on a different machine or just on the workers?

on a different machine (your laptop, something local, etc).

i'm trying on a couple of different workers and it's always hanging @ that exact step. :\

nothing has been updated, system-wise, on ANY of the workers. i think this might be a problem w/making the dist and am not looking forward to any git archaeology to find the broken change. :(

shaneknapp · 2018-11-16T21:20:04Z

btw i wiped all of my .ivy2 and .m2 dirs before building, just in case we're looking at a poisoned artifact.

vanzin · 2018-11-16T21:24:03Z

yeah, it's hanging on my laptop too. seems stuck here and using all the CPU:

"BuilderThread 3" #97 prio=5 os_prio=0 tid=0x00007f76b8609800 nid=0x640d runnable [0x00007f7648f7d000]
   java.lang.Thread.State: RUNNABLE
        at org.jdom2.Element.isAncestor(Element.java:1052)
        at org.jdom2.ContentList.checkPreConditions(ContentList.java:222)
        at org.jdom2.ContentList.add(ContentList.java:244)
        at org.jdom2.Element.addContent(Element.java:950)
        at org.apache.maven.plugins.shade.pom.MavenJDOMWriter.insertAtPreferredLocation(MavenJDOMWriter.java:292)
        at org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateExclusion(MavenJDOMWriter.java:488)
        at org.apache.maven.plugins.shade.pom.MavenJDOMWriter.updateDependency(MavenJDOMWriter.java:1335)
        at org.apache.maven.plugins.shade.pom.MavenJDOMWriter.iterateDependency(MavenJDOMWriter.java:386)

I've seen this stuff before, will take a look at the pom and file a separate bug. No point in polluting this PR even more.

SparkQA · 2018-11-17T00:04:13Z

Test build #98926 has finished for PR 23017 at commit 8f4fd19.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-17T00:08:53Z

Test build #98928 has finished for PR 23017 at commit 8f4fd19.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-11-17T00:10:20Z

retest this please

SparkQA · 2018-11-20T11:31:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5177/

SparkQA · 2018-11-20T11:33:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5178/

SparkQA · 2018-11-20T11:44:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5178/

rvesse · 2018-11-20T11:46:23Z

Have now added the doc updates to this so think this is ready for final review and merging

SparkQA · 2018-11-20T15:22:30Z

Test build #99050 has finished for PR 23017 at commit 6dce8bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-20T15:32:47Z

Test build #99049 has finished for PR 23017 at commit 4fc40ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-11-26T22:41:57Z

docs/running-on-kubernetes.md

@@ -19,9 +19,9 @@ Please see [Spark Security](security.html) and the specific advice below before

 ## User Identity

-Images built from the project provided Dockerfiles do not contain any [`USER`](https://docs.docker.com/engine/reference/builder/#user) directives.  This means that the resulting images will be running the Spark processes as `root` inside the container.  On unsecured clusters this may provide an attack vector for privilege escalation and container breakout.  Therefore security conscious deployments should consider providing custom images with `USER` directives specifying an unprivileged UID and GID.
+Images built from the project provided Dockerfiles contain a default [`USER`](https://docs.docker.com/engine/reference/builder/#user) directive with a default UID of `185`.  This means that the resulting images will be running the Spark processes as this UID inside the container. Security conscious deployments should consider providing custom images with `USER` directives specifying their desired unprivileged UID and GID.  The resulting UID should include the root group in its supplementary groups in order to be able to run the Spark executables.  Users building their own images with the provided `docker-image-tool.sh` script can use the `-u <uid>` option to specify the desired UID.


Given the docs you quoted before, you can't override the container's GID, right?

Yes you can, the USER directive allows you to specify both a UID and a GID by separating them with a :. However as noted here changing the GID is likely to have side effects that are difficult for Spark as a project to deal with hence the note about ensuring your UID belongs to the root group. If end users need more complex setups then they are free to create their own custom images.

@rvesse I think using GID 0 is safe , I am using that on Openshift.

Adds USER directives to the Dockerfiles which is configurable via build argument for easy customisation. A -u flag is added to bin/docker-image-tool.sh to make it easy to customise this.

The client mode test was incorrectly overriding the entry point of the image so didn't benefit from the logic that set up the /etc/passwd entry for the container UID resulting in no home directory and a failed Ivy setup as a result

If SPARK_USER_NAME is set for the pod then use it as part of the /etc/passwd entry we create

- Add line breaks for clarity - Remove extra test tag

rvesse · 2018-11-29T11:15:07Z

Rebased to catch up with master and adapt for @vanzin's improvements to Docker build context from PR #23019

SparkQA · 2018-11-29T11:30:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5518/

SparkQA · 2018-11-29T11:41:24Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/5518/

SparkQA · 2018-11-29T15:14:38Z

Test build #99445 has finished for PR 23017 at commit 8ab866b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-11-29T17:59:20Z

Merging to master.

Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR apache#23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes apache#23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

…TA/pages/864879077/on+K8S Fix ImplicitCastInputTypes [SPARK-25222][K8S] Improve container status logging [SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be tmpfs backed on K8S [SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S [SPARK-25415][SQL] Make plan change log in RuleExecutor configurable by SQLConf In RuleExecutor, after applying a rule, if the plan has changed, the before and after plan will be logged using level "trace". At times, however, such information can be very helpful for debugging. Hence, making the log level configurable in SQLConf would allow users to turn on the plan change log independently and save the trouble of tweaking log4j settings. Meanwhile, filtering plan change log for specific rules can also be very useful. So this PR adds two SQL configurations: 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for logging plan changes after a rule is applied. 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only for a set of specified rules, separated by commas. Added UT. Closes apache#22406 from maryannxue/spark-25415. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com> [SPARK-25338][TEST] Ensure to call super.beforeAll() and super.afterAll() in test cases This PR ensures to call `super.afterAll()` in `override afterAll()` method for test suites. * Some suites did not call `super.afterAll()` * Some suites may call `super.afterAll()` only under certain condition * Others never call `super.afterAll()`. This PR also ensures to call `super.beforeAll()` in `override beforeAll()` for test suites. Existing UTs Closes apache#22337 from kiszk/SPARK-25338. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> [SPARK-25415][SQL][FOLLOW-UP] Add Locale.ROOT when toUpperCase Add `Locale.ROOT` when `toUpperCase`. manual tests Closes apache#22531 from wangyum/SPARK-25415. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> [SPARK-25514][SQL] Generating pretty JSON by to_json The PR introduces new JSON option `pretty` which allows to turn on `DefaultPrettyPrinter` of `Jackson`'s Json generator. New option is useful in exploring of deep nested columns and in converting of JSON columns in more readable representation (look at the added test). Added rount trip test which convert an JSON string to pretty representation via `from_json()` and `to_json()`. Closes apache#22534 from MaxGekk/pretty-json. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org> [SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag This adds a missing end markup tag. This should go `master` branch only. This is a doc-only change. Manual via `SKIP_API=1 jekyll build`. Closes apache#22584 from dongjoon-hyun/SPARK-25262. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: hyukjinkwon <gurwls223@apache.org> [SPARK-23257][K8S] Kerberos Support for Spark on K8S [SPARK-25682][K8S] Package example jars in same target for dev and distro images. This way the image generated from both environments has the same layout, with just a difference in contents that should not affect functionality. Also added some minor error checking to the image script. Closes apache#22681 from vanzin/SPARK-25682. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25745][K8S] Improve docker-image-tool.sh script Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors. This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong. Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this. The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first. - Tested with a `mvn clean` working copy and verified that the script now terminates early - Tested with bad `Dockerfiles` that fail to build to see that early termination occurred Closes apache#22748 from rvesse/SPARK-25745. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25730][K8S] Delete executor pods from kubernetes after figuring out why they died `removeExecutorFromSpark` tries to fetch the reason the executor exited from Kubernetes, which may be useful if the pod was OOMKilled. However, the code previously deleted the pod from Kubernetes first which made retrieving this status impossible. This fixes the ordering. On a separate but related note, it would be nice to wait some time before removing the pod - to let the operator examine logs and such. Running on my local cluster. Author: Mike Kaplinskiy <mike.kaplinskiy@gmail.com> Closes apache#22720 from mikekap/patch-1. [SPARK-25828][K8S] Bumping Kubernetes-Client version to 4.1. [SPARK-24434][K8S] pod template files [SPARK-25809][K8S][TEST] New K8S integration testing backends [SPARK-25875][K8S] Merge code to set up driver command into a single step. Right now there are 3 different classes dealing with building the driver command to run inside the pod, one for each "binding" supported by Spark. This has two main shortcomings: - the code in the 3 classes is very similar; changing things in one place would probably mean making a similar change in the others. - it gives the false impression that the step implementation is the only place where binding-specific logic is needed. That is not true; there was code in KubernetesConf that was binding-specific, and there's also code in the executor-specific config step. So the 3 classes weren't really working as a language-specific abstraction. On top of that, the current code was propagating command line parameters in a different way depending on the binding. That doesn't seem necessary, and in fact using environment variables for command line parameters is in general a really bad idea, since you can't handle special characters (e.g. spaces) that way. This change merges the 3 different code paths for Java, Python and R into a single step, and also merges the 3 code paths to start the Spark driver in the k8s entry point script. This increases the amount of shared code, and also moves more feature logic into the step itself, so it doesn't live in KubernetesConf. Note that not all logic related to setting up the driver lives in that step. For example, the memory overhead calculation still lives separately, except it now happens in the driver config step instead of outside the step hierarchy altogether. Some of the noise in the diff is because of changes to KubernetesConf, which will be addressed in a separate change. Tested with new and updated unit tests + integration tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#22897 from vanzin/SPARK-25875. [SPARK-25897][K8S] Hook up k8s integration tests to sbt build. The integration tests can now be run in sbt if the right profile is enabled, using the "test" task under the respective project. This avoids having to fall back to maven to run the tests, which invalidates all your compiled stuff when you go back to sbt, making development way slower than it should. There's also a task to run the tests directly without refreshing the docker images, which is helpful if you just made a change to the submission code which should not affect the code in the images. The sbt tasks currently are not very customizable; there's some very minor things you can set in the sbt shell itself, but otherwise it's hardcoded to run on minikube. I also had to make some slight adjustments to the IT code itself, mostly to remove assumptions about the existing harness. Tested on sbt and maven. Closes apache#22909 from vanzin/SPARK-25897. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25957][K8S] Make building alternate language binding docker images optional bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark and SparkR) by default. But not all spark distributions are built with SparkR and hence this script will fail on such distros. With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images. Tested following scenarios. *bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior) *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images *bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images. Author: Nagaram Prasad Addepally <ram@cloudera.com> Closes apache#23053 from ramaddepally/SPARK-25957. [SPARK-25960][K8S] Support subpath mounting with Kubernetes This PR adds configurations to use subpaths with Spark on k8s. Subpaths (https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) allow the user to specify a path within a volume to use instead of the volume's root. Added unit tests. Ran SparkPi on a cluster with event logging pointed at a subpath-mount and verified the driver host created and used the subpath. Closes apache#23026 from NiharS/k8s_subpath. Authored-by: Nihar Sheth <niharrsheth@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26025][K8S] Speed up docker image build on dev repo. [SPARK-26015][K8S] Set a default UID for Spark on K8S Images Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g. ``` > bin/docker-image-tool.sh -r rvesse -t uid -u 185 build > bin/docker-image-tool.sh -r rvesse -t uid push ``` If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps ( https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html) Notes: - We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException` To Do: - [x] Debug and resolve issue with client mode test - [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry - [x] Rebase once PR apache#23013 is merged and update documentation accordingly Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further. Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected. Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID. cc felixcheung skonto vanzin Closes apache#23017 from rvesse/SPARK-26015. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25876][K8S] Simplify kubernetes configuration types. [SPARK-23781][CORE] Merge token renewer functionality into HadoopDelegationTokenManager. [SPARK-25515][K8S] Adds a config option to keep executor pods for debugging [SPARK-26083][K8S] Add Copy pyspark into corresponding dir cmd in pyspark Dockerfile When I try to run `./bin/pyspark` cmd in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: ``` $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py' ``` This is because `pyspark` folder doesn't exist under `/opt/spark/python/` Added `COPY python/pyspark ${SPARK_HOME}/python/pyspark` to pyspark Dockerfile to resolve issue above. Google Kubernetes Engine Closes apache#23037 from AzureQ/master. Authored-by: Qi Shao <qi.shao.nyu@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26194][K8S] Auto generate auth secret for k8s apps. This change modifies the logic in the SecurityManager to do two things: - generate unique app secrets also when k8s is being used - only store the secret in the user's UGI on YARN The latter is needed so that k8s won't unnecessarily create k8s secrets for the UGI credentials when only the auth token is stored there. On the k8s side, the secret is propagated to executors using an environment variable instead. This ensures it works in both client and cluster mode. Security doc was updated to mention the feature and clarify that proper access control in k8s should be enabled for it to be secure. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#23174 from vanzin/SPARK-26194. [SPARK-25877][K8S] Move all feature logic to feature classes. [SPARK-25922][K8] Spark Driver/Executor "spark-app-selector" label mismatch In K8S Cluster mode, the algorithm to generate spark-app-selector/spark.app.id of spark driver is different with spark executor. This patch makes sure spark driver and executor to use the same spark-app-selector/spark.app.id if spark.app.id is set, otherwise it will use superclass applicationId. In K8S Client mode, spark-app-selector/spark.app.id for executors will use superclass applicationId. Manually run." Closes apache#23322 from suxingfate/SPARK-25922. Lead-authored-by: suxingfate <suxingfate@163.com> Co-authored-by: xinglwang <xinglwang@ebay.com> Signed-off-by: Yinan Li <ynli@google.com> [SPARK-26642][K8S] Add --num-executors option to spark-submit for Spark on K8S. [SPARK-25887][K8S] Configurable K8S context support This enhancement allows for specifying the desired context to use for the initial K8S client auto-configuration. This allows users to more easily access alternative K8S contexts without having to first explicitly change their current context via kubectl. Explicitly set my K8S context to a context pointing to a non-existent cluster, then launched Spark jobs with explicitly specified contexts via the new `spark.kubernetes.context` configuration property. Example Output: ``` > kubectl config current-context minikube > minikube status minikube: Stopped cluster: kubectl: > ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.context=docker-for-desktop --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4 18/10/31 11:57:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/31 11:57:51 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using context docker-for-desktop from users K8S config file 18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: N/A start time: N/A phase: Pending container status: N/A 18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: docker-for-desktop start time: N/A phase: Pending container status: N/A ... 18/10/31 11:58:03 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1540987071845-driver namespace: default labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver pod uid: 32462cac-dd04-11e8-b6c6-025000000001 creation time: 2018-10-31T11:57:52Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv node name: docker-for-desktop start time: 2018-10-31T11:57:52Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: rvesse/spark:debian container state: terminated container started at: 2018-10-31T11:57:54Z container finished at: 2018-10-31T11:58:02Z exit code: 0 termination reason: Completed ``` Without the `spark.kubernetes.context` setting this will fail because the current context - `minikube` - is pointing to a non-running cluster e.g. ``` > ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4 18/10/31 12:02:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/10/31 12:02:30 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 18/10/31 12:02:31 WARN WatchConnectionManager: Exec Failure javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 39 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 45 more Exception in thread "kubernetes-dispatcher-0" Exception in thread "main" java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask611a9c09 rejected from java.util.concurrent.ScheduledThreadPoolExecutor404819e4[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326) at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533) at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632) at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.scheduleReconnect(WatchConnectionManager.java:300) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$800(WatchConnectionManager.java:48) at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:213) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:204) at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543) at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148) at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979) at sun.security.ssl.Handshaker.process_record(Handshaker.java:914) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281) at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251) at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151) at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195) at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121) at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) ... 4 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292) at sun.security.validator.Validator.validate(Validator.java:260) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491) ... 39 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382) ... 45 more 18/10/31 12:02:31 INFO ShutdownHookManager: Shutdown hook called 18/10/31 12:02:31 INFO ShutdownHookManager: Deleting directory /private/var/folders/6b/y1010qp107j9w2dhhy8csvz0000xq3/T/spark-5e649891-8a0f-4f17-bf3a-33b34082eba8 ``` Suggested reviews: mccheah liyinan926 - this is the follow up fix to the bug discovered while working on SPARK-25809 (PR apache#22805) Closes apache#22904 from rvesse/SPARK-25887. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26685][K8S] Correct placement of ARG declaration Latest Docker releases are stricter in their enforcement of build argument scope. The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID. Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into This commit moves the `ARG` declaration to just before the argument is used such that it is in scope. It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant Prior to the patch images are produced where the Python and R images ignore the default/configured UID: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4# whoami root bash-4.4# id -u 0 bash-4.4# exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope. After the patch the correct UID is observed: ``` > docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456 bash-4.4$ id -u 456 bash-4.4$ exit exit > docker run -it --entrypoint /bin/bash rvesse/spark:uid456 bash-4.4$ id -u 456 bash-4.4$ exit ``` Closes apache#23611 from rvesse/SPARK-26685. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26687][K8S] Fix handling of custom Dockerfile paths With the changes from vanzin's PR apache#23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times. However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths. This is because of the following code snippets: ``` (cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \ -t $(image_ref spark) \ -f "$BASEDOCKERFILE" .) ``` Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script. This is rather unintuitive and produces somewhat unhelpful errors e.g. ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 ... Successfully tagged rvesse/spark:badpath unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory Failed to build PySpark Docker image, please refer to Docker build output for details. ``` Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory. To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths Validated that relative paths now work as expected: ``` > ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 218.4MB Step 1/15 : FROM openjdk:8-alpine ---> 5801f7d008e5 Step 2/15 : ARG spark_uid=185 ---> Using cache ---> 5fd63df1ca39 Step 3/15 : RUN set -ex && apk upgrade --no-cache && apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd ---> Using cache ---> eb0a568e032f Step 4/15 : COPY jars /opt/spark/jars ... Successfully tagged rvesse/spark:badpath Sending build context to Docker daemon 6.599MB Step 1/13 : ARG base_img Step 2/13 : ARG spark_uid=185 Step 3/13 : FROM $base_img ---> 8f4fff16f903 Step 4/13 : WORKDIR / ---> Running in 25466e66f27f Removing intermediate container 25466e66f27f ---> 1470b6efae61 Step 5/13 : USER 0 ---> Running in b094b739df37 Removing intermediate container b094b739df37 ---> 6a27eb4acad3 Step 6/13 : RUN mkdir ${SPARK_HOME}/python ---> Running in bc8002c5b17c Removing intermediate container bc8002c5b17c ---> 19bb12f4286a Step 7/13 : RUN apk add --no-cache python && apk add --no-cache python3 && python -m ensurepip && python3 -m ensurepip && rm -r /usr/lib/python*/ensurepip && pip install --upgrade pip setuptools && rm -r /root/.cache ---> Running in 12dcba5e527f ... Successfully tagged rvesse/spark-py:badpath ``` Closes apache#23613 from rvesse/SPARK-26687. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26794][SQL] SparkSession enableHiveSupport does not point to hive but in-memory while the SparkContext exists ```java public class SqlDemo { public static void main(final String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName("spark-sql-demo"); JavaSparkContext sc = new JavaSparkContext(conf); SparkSession ss = SparkSession.builder().enableHiveSupport().getOrCreate(); ss.sql("show databases").show(); } } ``` Before https://issues.apache.org/jira/browse/SPARK-20946, the demo above point to the right hive metastore if the hive-site.xml is present. But now it can only point to the default in-memory one. Catalog is now as a variable shared across SparkSessions, it is instantiated with SparkContext's conf. After https://issues.apache.org/jira/browse/SPARK-20946, Session level configs are not pass to SparkContext's conf anymore, so the enableHiveSupport API takes no affect on the catalog instance. You can set spark.sql.catalogImplementation=hive application wide to solve the problem, or never create a sc before you call SparkSession.builder().enableHiveSupport().getOrCreate() Here we respect the SparkSession level configuration at the first time to generate catalog within SharedState 1. add ut 2. manually ```scala test("enableHiveSupport has right to determine the catalog while using an existing sc") { val conf = new SparkConf().setMaster("local").setAppName("SharedState Test") val sc = SparkContext.getOrCreate(conf) val ss = SparkSession.builder().enableHiveSupport().getOrCreate() assert(ss.sharedState.externalCatalog.unwrapped.isInstanceOf[HiveExternalCatalog], "The catalog should be hive ") val ss2 = SparkSession.builder().getOrCreate() assert(ss2.sharedState.externalCatalog.unwrapped.isInstanceOf[HiveExternalCatalog], "The catalog should be shared across sessions") } ``` Without this fix, the above test will fail. You can apply it to `org.apache.spark.sql.hive.HiveSharedStateSuite`, and run, ```sbt ./build/sbt -Phadoop-2.7 -Phive "hive/testOnly org.apache.spark.sql.hive.HiveSharedStateSuite" ``` to verify. Closes apache#23709 from yaooqinn/SPARK-26794. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> [SPARK-24894][K8S] Make sure valid host names are created for executors. Since the host name is derived from the app name, which can contain arbitrary characters, it needs to be sanitized so that only valid characters are allowed. On top of that, take extra care that truncation doesn't leave characters that are valid except at the start of a host name. Closes apache#23781 from vanzin/SPARK-24894. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25394][CORE] Add an application status metrics source - Exposes several metrics regarding application status as a source, useful to scrape them via jmx instead of mining the metrics rest api. Example use case: prometheus + jmx exporter. - Metrics are gathered when a job ends at the AppStatusListener side, could be more fine-grained but most metrics like tasks completed are also counted by executors. More metrics could be exposed in the future to avoid scraping executors in some scenarios. - a config option `spark.app.status.metrics.enabled` is added to disable/enable these metrics, by default they are disabled. This was manually tested with jmx source enabled and prometheus server on k8s: ![metrics](https://user-images.githubusercontent.com/7945591/45300945-63064d00-b518-11e8-812a-d9b4155ba0c0.png) In the next pic the job delay is shown for repeated pi calculation (Spark action). ![pi](https://user-images.githubusercontent.com/7945591/45329927-89a1a380-b56b-11e8-9cc1-5e76cb83969f.png) Closes apache#22381 from skonto/add_app_status_metrics. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-25926][CORE] Move config entries in core module to internal.config. [SPARK-26489][CORE] Use ConfigEntry for hardcoded configs for python/r categories [SPARK-26445][CORE] Use ConfigEntry for hardcoded configs for driver/executor categories. [SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs [SPARK-26239] File-based secret key loading for SASL. [SPARK-26482][CORE] Use ConfigEntry for hardcoded configs for ui categories [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories. [SPARK-24736][K8S] Let spark-submit handle dependency resolution. [SPARK-26420][K8S] Generate more unique IDs when creating k8s resource names. Using the current time as an ID is more prone to clashes than people generally realize, so try to make things a bit more unique without necessarily using a UUID, which would eat too much space in the names otherwise. The implemented approach uses some bits from the current time, plus some random bits, which should be more resistant to clashes. Closes apache#23805 from vanzin/SPARK-26420. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> [K8S][MINOR] Log minikube version when running integration tests. Closes apache#23893 from vanzin/minikube-version. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26995][K8S] Make ld-linux-x86-64.so.2 visible to snappy native library under /lib in docker image with Alpine Linux [SPARK-27023][K8S] Make k8s client timeouts configurable Make k8s client timeouts configurable. No test suite exists for the client factory class, happy to add one if needed Closes apache#23928 from onursatici/os/k8s-client-timeouts. Lead-authored-by: Onur Satici <osatici@palantir.com> Co-authored-by: Onur Satici <onursatici@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> [SPARK-27061][K8S] Expose Driver UI port on driver service to access … Expose Spark UI port on driver service to access logs from service. The patch was tested using unit tests being contributed as a part of the PR Closes apache#23990 from chandulal/SPARK-27061. Authored-by: chandulal.kavar <cckavar@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26343][K8S] Try to speed up running local k8s integration tests Speed up running k8s integration tests locally by allowing folks to skip the tgz dist build and extraction Run tests locally without a distribution of Spark, just a local build Closes apache#23380 from holdenk/SPARK-26343-Speed-up-running-the-kubernetes-integration-tests-locally. Authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-26729][K8S] Make image names under test configurable [SPARK-24793][K8S] Enhance spark-submit for app management - supports `--kill` & `--status` flags. - supports globs which is useful in general check this long standing [issue](kubernetes/kubernetes#17144 (comment)) for kubectl. Manually against running apps. Example output: Submission Id reported at launch time: ``` 2019-01-20 23:47:56 INFO Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish... ``` Killing the app: ``` ./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master k8s://https://192.168.2.8:8443 2019-01-20 23:48:07 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 23:48:07 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address ``` App terminates with 143 (SIGTERM, since we have tiny this should lead to [graceful shutdown](https://cloud.google.com/solutions/best-practices-for-building-containers)): ``` 2019-01-20 23:48:08 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T21:48:00Z 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Failed container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - Container final statuses: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished. 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Shutdown hook called 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e ``` Glob scenario: ``` ./bin/spark-submit --status spark:spark-pi* --master k8s://https://192.168.2.8:8443 2019-01-20 22:27:44 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 22:27:44 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address Application status (driver): pod name: spark-pi-1547948600328-driver namespace: spark labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:43:22Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:43:22Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T01:43:27Z Application status (driver): pod name: spark-pi-1547948792539-driver namespace: spark labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:46:35Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:46:35Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T01:46:39Z container finished at: 2019-01-20T01:46:56Z exit code: 0 termination reason: Completed ``` Closes apache#23599 from skonto/submit_ops_extension. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com> [SPARK-24902][K8S] Add PV integration tests - Adds persistent volume integration tests - Adds a custom tag to the test to exclude it if it is run against a cloud backend. - Assumes default fs type for the host, AFAIK that is ext4. Manually run the tests against minikube as usual: ``` [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 192 milliseconds. Run starting. Expected test count is: 16 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Test PVs with local storage ``` Closes apache#23514 from skonto/pvctests. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: shane knapp <incomplete@gmail.com> [SPARK-27216][CORE][BACKPORT-2.4] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue Fix ImplicitCastInputTypes

…deployment ### _Why are the changes needed?_ On case spark on kubernetes, spark using env `SPARK_USER_NAME` as user name. So kyuubi should build spark engine with this env when proxy user or using keytab. This conf only affect on kubernetes case. Ref: apache/spark#23017 ### _How was this patch tested?_ - [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #3527 from zwangsheng/feature/add_spark_user_name. Closes #3527 9596372 [zwangsheng] only k8s case ddd713f [zwangsheng] fix 48b9b22 [zwangsheng] add Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: Cheng Pan <chengpan@apache.org>

…deployment ### _Why are the changes needed?_ On case spark on kubernetes, spark using env `SPARK_USER_NAME` as user name. So kyuubi should build spark engine with this env when proxy user or using keytab. This conf only affect on kubernetes case. Ref: apache/spark#23017 ### _How was this patch tested?_ - [x] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [ ] [Run test](https://kyuubi.apache.org/docs/latest/develop_tools/testing.html#running-tests) locally before make a pull request Closes #3527 from zwangsheng/feature/add_spark_user_name. Closes #3527 9596372 [zwangsheng] only k8s case ddd713f [zwangsheng] fix 48b9b22 [zwangsheng] add Authored-by: zwangsheng <2213335496@qq.com> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 3738512) Signed-off-by: Cheng Pan <chengpan@apache.org>

vanzin reviewed Nov 13, 2018

View reviewed changes

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh Outdated Show resolved Hide resolved

vanzin reviewed Nov 15, 2018

View reviewed changes

resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile Outdated Show resolved Hide resolved

...-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/ClientModeTestsSuite.scala Outdated Show resolved Hide resolved

rvesse mentioned this pull request Nov 16, 2018

[SPARK-25023] More detailed security guidance for K8S #23013

Closed

rvesse changed the title ~~[WIP][SPARK-26015][K8S] Set a default UID for Spark on K8S Images~~ [SPARK-26015][K8S] Set a default UID for Spark on K8S Images Nov 16, 2018

vanzin reviewed Nov 26, 2018

View reviewed changes

rvesse added 8 commits November 29, 2018 11:10

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images

26697fc

Adds USER directives to the Dockerfiles which is configurable via build argument for easy customisation. A -u flag is added to bin/docker-image-tool.sh to make it easy to customise this.

[SPARK-26015][K8S] Fix broken client mode test

6680c13

The client mode test was incorrectly overriding the entry point of the image so didn't benefit from the logic that set up the /etc/passwd entry for the container UID resulting in no home directory and a failed Ivy setup as a result

[SPARK-26015][K8S] If SPARK_USER_NAME is present use it

11419e3

If SPARK_USER_NAME is set for the pod then use it as part of the /etc/passwd entry we create

[SPARK-26015][K8S] Address PR Comments

db1e83a

- Add line breaks for clarity - Remove extra test tag

[SPARK-26015][K8S] Fix line breaks

fd37bdd

[SPARK-26015][K8S] Updates docs for USER changes

6c6232e

[SPARK-26015][K8S] Fix incorrectly removed import

72fb30c

[SPARK-26015][K8S] Fix up changes to align with master changes

8ab866b

rvesse force-pushed the SPARK-26015 branch from 6dce8bb to 8ab866b Compare November 29, 2018 11:14

asfgit closed this in 1144df3 Nov 29, 2018

pan3793 mentioned this pull request Sep 20, 2022

[SPARK] Kyuubi should set env SPARK_USER_NAME for K8s deployment apache/kyuubi#3527

Closed

3 tasks

pan3793 mentioned this pull request Oct 8, 2022

[K8S] Support custom uid on creating docker image apache/celeborn#725

Closed

pan3793 mentioned this pull request Apr 18, 2023

[SPARK-43171][K8S] Support custom Unix username in Pod #40831

Closed

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images #23017

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images #23017

Conversation

rvesse commented Nov 12, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rvesse commented Nov 12, 2018

felixcheung commented Nov 13, 2018

felixcheung commented Nov 13, 2018

ifilonenko commented Nov 13, 2018 • edited Loading

SparkQA commented Nov 13, 2018

rvesse commented Nov 15, 2018

rvesse commented Nov 15, 2018

vanzin left a comment

Choose a reason for hiding this comment

vanzin commented Nov 15, 2018

SparkQA commented Nov 15, 2018

SparkQA commented Nov 15, 2018

SparkQA commented Nov 16, 2018

vanzin commented Nov 16, 2018

vanzin commented Nov 16, 2018

vanzin commented Nov 16, 2018

ifilonenko commented Nov 16, 2018

ifilonenko commented Nov 16, 2018

shaneknapp commented Nov 16, 2018

shaneknapp commented Nov 16, 2018

vanzin commented Nov 16, 2018

shaneknapp commented Nov 16, 2018

shaneknapp commented Nov 16, 2018

shaneknapp commented Nov 16, 2018

vanzin commented Nov 16, 2018

SparkQA commented Nov 17, 2018

SparkQA commented Nov 17, 2018

vanzin commented Nov 17, 2018

SparkQA commented Nov 20, 2018

SparkQA commented Nov 20, 2018

SparkQA commented Nov 20, 2018

rvesse commented Nov 20, 2018

SparkQA commented Nov 20, 2018

SparkQA commented Nov 20, 2018

vanzin Nov 26, 2018

Choose a reason for hiding this comment

rvesse Nov 29, 2018

Choose a reason for hiding this comment

skonto Dec 12, 2018 • edited Loading

Choose a reason for hiding this comment

rvesse commented Nov 29, 2018

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

SparkQA commented Nov 29, 2018

vanzin commented Nov 29, 2018

rvesse commented Nov 12, 2018 •

edited

Loading

ifilonenko commented Nov 13, 2018 •

edited

Loading

skonto Dec 12, 2018 •

edited

Loading