[SPARK-5493] [core] Add option to impersonate user. #4405

vanzin · 2015-02-05T22:26:58Z

Hadoop has a feature that allows users to impersonate other users
when submitting applications or talking to HDFS, for example. These
impersonated users are referred generally as "proxy users".

Services such as Oozie or Hive use this feature to run applications
as the requesting user.

This change makes SparkSubmit accept a new command line option to
run the application as a proxy user. It also fixes the plumbing
of the user name through the UI (and a couple of other places) to
refer to the correct user running the application, which can be
different than sys.props("user.name") even without proxies (e.g.
when using kerberos).

Hadoop has a feature that allows users to impersonate other users when submitting applications or talking to HDFS, for example. These impersonated users are referred generally as "proxy users". Services such as Oozie or Hive use this feature to run applications as the requesting user. This change makes SparkSubmit accept a new command line option to run the application as a proxy user. It also fixes the plumbing of the user name through the UI (and a couple of other places) to refer to the correct user running the application, which can be different that `sys.props("user.name")` even without proxies (e.g. when using kerberos).

vanzin · 2015-02-05T22:29:09Z

Some description of the testing I did:

normal submission without kerberos
impersonated submission without kerberos (got expected "unauthorized" error from Yarn since I did not have impersonation configured)
normal submission as system user ("oozie" - id = 480) with kerberos (denied by Yarn since id < 1000)
impersonated submission as same system user with kerberos, app runs as proxy user (and proxy user shows up in history as owner of the app).

SparkQA · 2015-02-05T23:40:28Z

Test build #26861 has finished for PR 4405 at commit 0540d38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

SparkQA · 2015-02-06T20:18:07Z

Test build #26930 has finished for PR 4405 at commit b6c947d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- "public class " + className + extendsText + " implements java.io.Serializable
- case class RegisterExecutor(

harishreedharan · 2015-02-06T20:18:42Z

In this case, you are only running SparkSubmit as the proxy user. Should we not have the executor code also run as the proxy user, so any writes from the app to HDFS shows the proxy user - or is that not the intent?

vanzin · 2015-02-06T20:41:42Z

That should already be handled.

CoarseGrainedExecutorBackend does run the executor inside a doAs block, run as SPARK_USER
Yarn actually runs the underlying container processes as the requesting user too (unlike standalone)
HDFS authorization is done through the delegation tokens generated by the driver (see Client.scala:obtainTokensForNamenodes).

harishreedharan · 2015-02-06T22:35:46Z

Oh, I didn't know YARN already takes care of run the container as the requesting user.

+1.

Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

SparkQA · 2015-02-07T02:42:53Z

Test build #26979 has finished for PR 4405 at commit 8af06ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-02-09T18:12:19Z

Hi @chesterxgchen.

Assumption 1 and 3 wrong: the code doesn't assume that. Proxy users work just fine without kerberos as long as the configuration allows it (in fact I just tested it). You may argue that it's useless in non-kerberos mode (where you can just use UserGroupInfomation.createRemoteUser() and achieve the same), but that's beyond the point of this patch.

Errors, just like any other error when submitting the app to Yarn, are reported as exceptions thrown by the Yarn client code. I assume that standalone and mesos have no notion of proxy users, so they wouldn't ever complain about it.

For spark jobs, should one need to add delegation token to the job's credential ?

Delegation tokens are only needed with kerberos, which is only supported by Yarn currently, which as you already noticed already handles them.

chesterxgchen · 2015-02-10T02:58:55Z

@vanzin

Did you test with the secured Hadoop Cluster or just normal cluster ?  If the hadoop cluster is secured, I think these assumptions are required. I just finished our Hadoop Kerberos authentication implementation with Pig, MapReduce, HDFS, Sqoop and Spark recently (For Spark with Yarn Cluster mode).  I don't think you can access the secure cluster without kerberos authentication ( assumption 1). And if the UserGroupInformation uses the SIMPLE mode to access secured hadoop cluster, you will get exception at certain point ( assumption 3).

In our case, we did not use SparkSubmit, but directly use Yarn Client. I don't understand why the standalone mode or messos mode won't need have job delegation token ? Maybe you can elaborate that a bit more.

If you see in oozie's implementation, you can see that before the MR job is submitted, the job delegation is added to the Jobclient's credential. This is regardless using Yarn or not.

Another question related to the overall approach. This seems to be fine with the command line calling SparkSubmmit. As the current user can be authenticated with kinit, and the proxy-user can be impersonated via createProxyUser. The user who manages the spark job submit is responsible for managing kerberos TGT lifetime and renewal etc. If the ticket is expired, user can re-run the kinit or use the cron job to keep it from expire. In this case, the spark is merely create proxy user.

For application ( for example, a programs that submit the spark job directly, not from command line), this seems approach doesn't seem to help much. As the application can createProxyUser in its program instead of let spark do it, application already do the kerberosLogin ( UserGroupInformation.loginUserFromKeytab), renew UserGroupInformation.checkTGTAndReloginFromKeytab, handle ticket expiration, add job token etc.

So is the approach is only intended for command line use ? does it make sense to push more logic into spark ? Or this logic doesn't belong to spark ?

thanks
Chester Chen

vanzin · 2015-02-10T03:53:40Z

Did you test with the secured Hadoop Cluster or just normal cluster ?

Both. In kerberos mode you have to be logged in before you submit the app, but that's true before this change. If you're not logged in, you can't submit. So ths change is not changing any assumptions.

I don't understand why the standalone mode or messos mode won't need have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't need delegation tokens without kerberos.

If you see in oozie's implementation, you can see that before the MR job is submitted

Not sure how that's related to Spark. Spark gets the needed delegation tokens, there's nothing else to be done.

For application ( for example, a programs that submit the spark job directly, not from command line), this seems approach doesn't seem to help much.

Both Oozie and Hive need to fork external processes to run Spark. When those processes fork, they'll be running with the kerberos credentials of the user running those services, not as the "proxy user". So the forked process needs to know which user to impersonate. loginUserFromKeytab is irrelevant here.

So is the approach is only intended for command line use ? does it make sense to push more logic into spark ?

Yes, this is only for command line use (or, in other words, running Spark as a separate process). Anything else would be a lot more complicated and probably a much larger project, that is really not needed for the use case at hand (Hive and, eventually, Oozie).

chesterxgchen · 2015-02-10T04:19:38Z

Thanks for the detailed reply.

bq. I don't understand why the standalone mode or messos mode won't
need have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't
need delegation tokens without kerberos.

bq. If you see in oozie's implementation, you can see that before the MR
job >>is submitted

Not sure how that's related to Spark. Spark gets the needed delegation
tokens, there's nothing else to be done.

you answered above as through they are separate questions, actually it's
one question. I am not sure you understand my original question.

Spark submit MR job just like other application such as oozie to submit MR
job. In the secured cluster, you will need to use delegation token before
submit, i am just using oozie as example, this can be Oozie, Hue, Knox or
Tajo. I am under the assumption that spark would need to have delegation
as well even in the standalone mode similar to other application.

I did not get a clear picture from your answer: is spark standalone mode
that does not support kerberos ? or is the spark already has the delegation
token ?

Both Oozie and Hive need to fork external processes to run Spark. When
those processes fork, they'll be running with the kerberos credentials of
the >>user running those services, not as the "proxy user". So the forked
process >>needs to know which user to impersonate. loginUserFromKeytab is
irrelevant here.

if you forked the process, of course, that's a different story. I am
thinking about the application that run spark job without forking a
process. Our application is like this and I am sure there are many other
applications like this, which don't need run kinit from commandline.

Thanks for clarify what's the PR is intended for.

Chester

On Mon, Feb 9, 2015 at 7:54 PM, Marcelo Vanzin notifications@github.com
wrote:

bq. Did you test with the secured Hadoop Cluster or just normal cluster ?

Both. In kerberos mode you have to be logged in before you submit the app,
but that's true before this change. If you're not logged in, you can't
submit. So ths change is not changing any assumptions.

bq. I don't understand why the standalone mode or messos mode won't need
have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't
need delegation tokens without kerberos.

bq. If you see in oozie's implementation, you can see that before the MR
job is submitted

Not sure how that's related to Spark. Spark gets the needed delegation
tokens, there's nothing else to be done.

bq. For application ( for example, a programs that submit the spark job
directly, not from command line), this seems approach doesn't seem to help
much.

Both Oozie and Hive need to fork external processes to run Spark. When
those processes fork, they'll be running with the kerberos credentials of
the user running those services, not as the "proxy user". So the forked
process needs to know which user to impersonate. loginUserFromKeytab is
irrelevant here.

bq. So is the approach is only intended for command line use ? does it
make sense to push more logic into spark ?

Yes, this is only for command line use (or, in other words, running Spark
as a separate process). Anything else would be a lot more complicated and
probably a much larger project, that is really not needed for the use case
at hand (Hive and, eventually, Oozie).

—
Reply to this email directly or view it on GitHub
#4405 (comment).

pwendell · 2015-02-10T07:42:37Z

bin/utils.sh

@@ -35,7 +35,8 @@ function gatherSparkSubmitOpts() {
      --master | --deploy-mode | --class | --name | --jars | --packages | --py-files | --files | \
      --conf | --repositories | --properties-file | --driver-memory | --driver-java-options | \
      --driver-library-path | --driver-class-path | --executor-memory | --driver-cores | \
-      --total-executor-cores | --executor-cores | --queue | --num-executors | --archives)
+      --total-executor-cores | --executor-cores | --queue | --num-executors | --archives |


does this need a \ similar to earlier lines?

pwendell · 2015-02-10T07:57:19Z

Hey @vanzin I made two comments but overall this looks good.

One question - in cluster mode, does this doAs propagate along to the yarn submission client? I'm assuming it does (and that this handles both cases) but I thought I would ask to be sure.

vanzin · 2015-02-10T18:26:38Z

Hi @chesterxgchen,

is spark standalone mode that does not support kerberos ?

Correct. Currently, there's no point in talking about kerberos for anything but Yarn.

I am thinking about the application that run spark job without forking a process.

That is obviously not covered by this patch. For that application, it would need to do the impersonation itself before creating a SparkContext (or doing whatever it does to launch Spark).

Well, it could potentially call SparkSubmit programatically and pass this argument, but I don't think that's a supported use case.

vanzin · 2015-02-10T18:27:36Z

Hi @pwendell ,

One question - in cluster mode, does this doAs propagate along to the yarn submission client?

The submission is run as the proxy user. So when you start the "cluster mode" app, it will be started as that proxy user.

Needed some extra code to handle some special exceptions that the new code may hit, since they seem to confuse the JVM.

SparkQA · 2015-02-10T23:03:27Z

Test build #27233 has finished for PR 4405 at commit 05bfc08.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-10T23:15:57Z

Test build #27234 has finished for PR 4405 at commit df82427.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pwendell · 2015-02-11T01:18:46Z

LGTM - thanks @vanzin I'll pull it in!

Hadoop has a feature that allows users to impersonate other users when submitting applications or talking to HDFS, for example. These impersonated users are referred generally as "proxy users". Services such as Oozie or Hive use this feature to run applications as the requesting user. This change makes SparkSubmit accept a new command line option to run the application as a proxy user. It also fixes the plumbing of the user name through the UI (and a couple of other places) to refer to the correct user running the application, which can be different than `sys.props("user.name")` even without proxies (e.g. when using kerberos). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4405 from vanzin/SPARK-5493 and squashes the following commits: df82427 [Marcelo Vanzin] Clarify the reason for the special exception handling. 05bfc08 [Marcelo Vanzin] Remove unneeded annotation. 4840de9 [Marcelo Vanzin] Review feedback. 8af06ff [Marcelo Vanzin] Fix usage string. 2e4fa8f [Marcelo Vanzin] Merge branch 'master' into SPARK-5493 b6c947d [Marcelo Vanzin] Merge branch 'master' into SPARK-5493 0540d38 [Marcelo Vanzin] [SPARK-5493] [core] Add option to impersonate user. (cherry picked from commit ed167e7) Signed-off-by: Patrick Wendell <patrick@databricks.com>

hemshankar · 2016-01-13T09:29:15Z

What do we mean by plumbing of the user name through the UI

hemshankar · 2016-01-13T09:45:57Z

I have few doubts about running in client mode and cluster mode.
Currently I am using a cloudera hadoop single node cluster (kerberos enabled.)

In client mode I use following commands

kinit
spark-submit --master yarn-client --proxy-user cloudera examples/src/main/python/pi.py

This works fine. In cluster mode I use following command (no kinit done and no TGT is present in the cache)

spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster examples/src/main/python/pi.py

Also works fine. But when I use following command in cluster mode (no kinit done and no TGT is present in the cache)

   spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster --proxy-user cloudera examples/src/main/python/pi.py

throws following error

  No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

I guess in cluster mode the spark-submit do not look for TGT in the client machine... it transfers the "keytab" file to the cluster and then starts the spark job. So why does the specifying "--proxy-user" option looks for TGT while submitting in the "yarn-cluster" mode. Am I doing some thing wrong.

vanzin · 2016-01-13T18:10:05Z

@hemshankar Please don't use github to ask questions / point out possible issues. See http://spark.apache.org/community.html.

Merge branch 'master' into SPARK-5493

b6c947d

Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

Marcelo Vanzin added 2 commits February 6, 2015 17:24

Merge branch 'master' into SPARK-5493

2e4fa8f

Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

Fix usage string.

8af06ff

pwendell reviewed Feb 10, 2015
View reviewed changes

Marcelo Vanzin added 3 commits February 10, 2015 13:38

Review feedback.

4840de9

Needed some extra code to handle some special exceptions that the new code may hit, since they seem to confuse the JVM.

Remove unneeded annotation.

05bfc08

Clarify the reason for the special exception handling.

df82427

asfgit closed this in ed167e7 Feb 11, 2015

vanzin deleted the SPARK-5493 branch February 12, 2015 00:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5493] [core] Add option to impersonate user. #4405

[SPARK-5493] [core] Add option to impersonate user. #4405

vanzin commented Feb 5, 2015

vanzin commented Feb 5, 2015

SparkQA commented Feb 5, 2015

SparkQA commented Feb 6, 2015

harishreedharan commented Feb 6, 2015

vanzin commented Feb 6, 2015

harishreedharan commented Feb 6, 2015

SparkQA commented Feb 7, 2015

vanzin commented Feb 9, 2015

chesterxgchen commented Feb 10, 2015

vanzin commented Feb 10, 2015

chesterxgchen commented Feb 10, 2015

pwendell Feb 10, 2015

pwendell commented Feb 10, 2015

vanzin commented Feb 10, 2015

vanzin commented Feb 10, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

pwendell commented Feb 11, 2015

hemshankar commented Jan 13, 2016

hemshankar commented Jan 13, 2016

vanzin commented Jan 13, 2016

[SPARK-5493] [core] Add option to impersonate user. #4405

[SPARK-5493] [core] Add option to impersonate user. #4405

Conversation

vanzin commented Feb 5, 2015

vanzin commented Feb 5, 2015

SparkQA commented Feb 5, 2015

SparkQA commented Feb 6, 2015

harishreedharan commented Feb 6, 2015

vanzin commented Feb 6, 2015

harishreedharan commented Feb 6, 2015

SparkQA commented Feb 7, 2015

vanzin commented Feb 9, 2015

chesterxgchen commented Feb 10, 2015

vanzin commented Feb 10, 2015

chesterxgchen commented Feb 10, 2015

pwendell Feb 10, 2015

Choose a reason for hiding this comment

pwendell commented Feb 10, 2015

vanzin commented Feb 10, 2015

vanzin commented Feb 10, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

pwendell commented Feb 11, 2015

hemshankar commented Jan 13, 2016

hemshankar commented Jan 13, 2016

vanzin commented Jan 13, 2016