Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5493] [core] Add option to impersonate user. #4405

Closed
wants to merge 7 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Feb 5, 2015

Hadoop has a feature that allows users to impersonate other users
when submitting applications or talking to HDFS, for example. These
impersonated users are referred generally as "proxy users".

Services such as Oozie or Hive use this feature to run applications
as the requesting user.

This change makes SparkSubmit accept a new command line option to
run the application as a proxy user. It also fixes the plumbing
of the user name through the UI (and a couple of other places) to
refer to the correct user running the application, which can be
different than sys.props("user.name") even without proxies (e.g.
when using kerberos).

Hadoop has a feature that allows users to impersonate other users
when submitting applications or talking to HDFS, for example. These
impersonated users are referred generally as "proxy users".

Services such as Oozie or Hive use this feature to run applications
as the requesting user.

This change makes SparkSubmit accept a new command line option to
run the application as a proxy user. It also fixes the plumbing
of the user name through the UI (and a couple of other places) to
refer to the correct user running the application, which can be
different that `sys.props("user.name")` even without proxies (e.g.
when using kerberos).
@vanzin
Copy link
Contributor Author

vanzin commented Feb 5, 2015

Some description of the testing I did:

  • normal submission without kerberos
  • impersonated submission without kerberos (got expected "unauthorized" error from Yarn since I did not have impersonation configured)
  • normal submission as system user ("oozie" - id = 480) with kerberos (denied by Yarn since id < 1000)
  • impersonated submission as same system user with kerberos, app runs as proxy user (and proxy user shows up in history as owner of the app).

@SparkQA
Copy link

SparkQA commented Feb 5, 2015

Test build #26861 has finished for PR 4405 at commit 0540d38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26930 has finished for PR 4405 at commit b6c947d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • "public class " + className + extendsText + " implements java.io.Serializable
    • case class RegisterExecutor(

@harishreedharan
Copy link
Contributor

In this case, you are only running SparkSubmit as the proxy user. Should we not have the executor code also run as the proxy user, so any writes from the app to HDFS shows the proxy user - or is that not the intent?

@vanzin
Copy link
Contributor Author

vanzin commented Feb 6, 2015

That should already be handled.

  • CoarseGrainedExecutorBackend does run the executor inside a doAs block, run as SPARK_USER
  • Yarn actually runs the underlying container processes as the requesting user too (unlike standalone)
  • HDFS authorization is done through the delegation tokens generated by the driver (see Client.scala:obtainTokensForNamenodes).

@harishreedharan
Copy link
Contributor

Oh, I didn't know YARN already takes care of run the container as the requesting user.

+1.

Marcelo Vanzin added 2 commits February 6, 2015 17:24
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@SparkQA
Copy link

SparkQA commented Feb 7, 2015

Test build #26979 has finished for PR 4405 at commit 8af06ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@vanzin
Copy link
Contributor Author

vanzin commented Feb 9, 2015

Hi @chesterxgchen.

Assumption 1 and 3 wrong: the code doesn't assume that. Proxy users work just fine without kerberos as long as the configuration allows it (in fact I just tested it). You may argue that it's useless in non-kerberos mode (where you can just use UserGroupInfomation.createRemoteUser() and achieve the same), but that's beyond the point of this patch.

Errors, just like any other error when submitting the app to Yarn, are reported as exceptions thrown by the Yarn client code. I assume that standalone and mesos have no notion of proxy users, so they wouldn't ever complain about it.

For spark jobs, should one need to add delegation token to the job's credential ?

Delegation tokens are only needed with kerberos, which is only supported by Yarn currently, which as you already noticed already handles them.

@chesterxgchen
Copy link

@vanzin

Did you test with the secured Hadoop Cluster or just normal cluster ?  If the hadoop cluster is secured, I think these assumptions are required. I just finished our Hadoop Kerberos authentication implementation with Pig, MapReduce, HDFS, Sqoop and Spark recently (For Spark with Yarn Cluster mode).  I don't think you can access the secure cluster without kerberos authentication ( assumption 1). And if the UserGroupInformation uses the SIMPLE mode to access secured hadoop cluster, you will get exception at certain point ( assumption 3).

In our case, we did not use SparkSubmit, but directly use Yarn Client. I don't understand why the standalone mode or messos mode won't need have job delegation token ? Maybe you can elaborate that a bit more.

If you see in oozie's implementation, you can see that before the MR job is submitted, the job delegation is added to the Jobclient's credential. This is regardless using Yarn or not.

Another question related to the overall approach. This seems to be fine with the command line calling SparkSubmmit. As the current user can be authenticated with kinit, and the proxy-user can be impersonated via createProxyUser. The user who manages the spark job submit is responsible for managing kerberos TGT lifetime and renewal etc. If the ticket is expired, user can re-run the kinit or use the cron job to keep it from expire. In this case, the spark is merely create proxy user.

For application ( for example, a programs that submit the spark job directly, not from command line), this seems approach doesn't seem to help much. As the application can createProxyUser in its program instead of let spark do it, application already do the kerberosLogin ( UserGroupInformation.loginUserFromKeytab), renew UserGroupInformation.checkTGTAndReloginFromKeytab, handle ticket expiration, add job token etc.

So is the approach is only intended for command line use ? does it make sense to push more logic into spark ? Or this logic doesn't belong to spark ?

thanks
Chester Chen

@vanzin
Copy link
Contributor Author

vanzin commented Feb 10, 2015

Did you test with the secured Hadoop Cluster or just normal cluster ?

Both. In kerberos mode you have to be logged in before you submit the app, but that's true before this change. If you're not logged in, you can't submit. So ths change is not changing any assumptions.

I don't understand why the standalone mode or messos mode won't need have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't need delegation tokens without kerberos.

If you see in oozie's implementation, you can see that before the MR job is submitted

Not sure how that's related to Spark. Spark gets the needed delegation tokens, there's nothing else to be done.

For application ( for example, a programs that submit the spark job directly, not from command line), this seems approach doesn't seem to help much.

Both Oozie and Hive need to fork external processes to run Spark. When those processes fork, they'll be running with the kerberos credentials of the user running those services, not as the "proxy user". So the forked process needs to know which user to impersonate. loginUserFromKeytab is irrelevant here.

So is the approach is only intended for command line use ? does it make sense to push more logic into spark ?

Yes, this is only for command line use (or, in other words, running Spark as a separate process). Anything else would be a lot more complicated and probably a much larger project, that is really not needed for the use case at hand (Hive and, eventually, Oozie).

@chesterxgchen
Copy link

Thanks for the detailed reply.

bq. I don't understand why the standalone mode or messos mode won't
need have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't
need delegation tokens without kerberos.

bq. If you see in oozie's implementation, you can see that before the MR
job >>is submitted

Not sure how that's related to Spark. Spark gets the needed delegation
tokens, there's nothing else to be done.

you answered above as through they are separate questions, actually it's
one question. I am not sure you understand my original question.

Spark submit MR job just like other application such as oozie to submit MR
job. In the secured cluster, you will need to use delegation token before
submit, i am just using oozie as example, this can be Oozie, Hue, Knox or
Tajo. I am under the assumption that spark would need to have delegation
as well even in the standalone mode similar to other application.

I did not get a clear picture from your answer: is spark standalone mode
that does not support kerberos ? or is the spark already has the delegation
token ?

Both Oozie and Hive need to fork external processes to run Spark. When
those processes fork, they'll be running with the kerberos credentials of
the >>user running those services, not as the "proxy user". So the forked
process >>needs to know which user to impersonate. loginUserFromKeytab is
irrelevant here.

if you forked the process, of course, that's a different story. I am
thinking about the application that run spark job without forking a
process. Our application is like this and I am sure there are many other
applications like this, which don't need run kinit from commandline.

Thanks for clarify what's the PR is intended for.

Chester

On Mon, Feb 9, 2015 at 7:54 PM, Marcelo Vanzin notifications@github.com
wrote:

bq. Did you test with the secured Hadoop Cluster or just normal cluster ?

Both. In kerberos mode you have to be logged in before you submit the app,
but that's true before this change. If you're not logged in, you can't
submit. So ths change is not changing any assumptions.

bq. I don't understand why the standalone mode or messos mode won't need
have job delegation token ?

Because they don't need it. They don't work with kerberos, and you don't
need delegation tokens without kerberos.

bq. If you see in oozie's implementation, you can see that before the MR
job is submitted

Not sure how that's related to Spark. Spark gets the needed delegation
tokens, there's nothing else to be done.

bq. For application ( for example, a programs that submit the spark job
directly, not from command line), this seems approach doesn't seem to help
much.

Both Oozie and Hive need to fork external processes to run Spark. When
those processes fork, they'll be running with the kerberos credentials of
the user running those services, not as the "proxy user". So the forked
process needs to know which user to impersonate. loginUserFromKeytab is
irrelevant here.

bq. So is the approach is only intended for command line use ? does it
make sense to push more logic into spark ?

Yes, this is only for command line use (or, in other words, running Spark
as a separate process). Anything else would be a lot more complicated and
probably a much larger project, that is really not needed for the use case
at hand (Hive and, eventually, Oozie).


Reply to this email directly or view it on GitHub
#4405 (comment).

@@ -35,7 +35,8 @@ function gatherSparkSubmitOpts() {
--master | --deploy-mode | --class | --name | --jars | --packages | --py-files | --files | \
--conf | --repositories | --properties-file | --driver-memory | --driver-java-options | \
--driver-library-path | --driver-class-path | --executor-memory | --driver-cores | \
--total-executor-cores | --executor-cores | --queue | --num-executors | --archives)
--total-executor-cores | --executor-cores | --queue | --num-executors | --archives |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a \ similar to earlier lines?

@pwendell
Copy link
Contributor

Hey @vanzin I made two comments but overall this looks good.

One question - in cluster mode, does this doAs propagate along to the yarn submission client? I'm assuming it does (and that this handles both cases) but I thought I would ask to be sure.

@vanzin
Copy link
Contributor Author

vanzin commented Feb 10, 2015

Hi @chesterxgchen,

is spark standalone mode that does not support kerberos ?

Correct. Currently, there's no point in talking about kerberos for anything but Yarn.

I am thinking about the application that run spark job without forking a process.

That is obviously not covered by this patch. For that application, it would need to do the impersonation itself before creating a SparkContext (or doing whatever it does to launch Spark).

Well, it could potentially call SparkSubmit programatically and pass this argument, but I don't think that's a supported use case.

@vanzin
Copy link
Contributor Author

vanzin commented Feb 10, 2015

Hi @pwendell ,

One question - in cluster mode, does this doAs propagate along to the yarn submission client?

The submission is run as the proxy user. So when you start the "cluster mode" app, it will be started as that proxy user.

Marcelo Vanzin added 3 commits February 10, 2015 13:38
Needed some extra code to handle some special exceptions that the new
code may hit, since they seem to confuse the JVM.
@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27233 has finished for PR 4405 at commit 05bfc08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27234 has finished for PR 4405 at commit df82427.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pwendell
Copy link
Contributor

LGTM - thanks @vanzin I'll pull it in!

asfgit pushed a commit that referenced this pull request Feb 11, 2015
Hadoop has a feature that allows users to impersonate other users
when submitting applications or talking to HDFS, for example. These
impersonated users are referred generally as "proxy users".

Services such as Oozie or Hive use this feature to run applications
as the requesting user.

This change makes SparkSubmit accept a new command line option to
run the application as a proxy user. It also fixes the plumbing
of the user name through the UI (and a couple of other places) to
refer to the correct user running the application, which can be
different than `sys.props("user.name")` even without proxies (e.g.
when using kerberos).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #4405 from vanzin/SPARK-5493 and squashes the following commits:

df82427 [Marcelo Vanzin] Clarify the reason for the special exception handling.
05bfc08 [Marcelo Vanzin] Remove unneeded annotation.
4840de9 [Marcelo Vanzin] Review feedback.
8af06ff [Marcelo Vanzin] Fix usage string.
2e4fa8f [Marcelo Vanzin] Merge branch 'master' into SPARK-5493
b6c947d [Marcelo Vanzin] Merge branch 'master' into SPARK-5493
0540d38 [Marcelo Vanzin] [SPARK-5493] [core] Add option to impersonate user.

(cherry picked from commit ed167e7)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
@asfgit asfgit closed this in ed167e7 Feb 11, 2015
@vanzin vanzin deleted the SPARK-5493 branch February 12, 2015 00:00
@hemshankar
Copy link

What do we mean by plumbing of the user name through the UI

@hemshankar
Copy link

I have few doubts about running in client mode and cluster mode.
Currently I am using a cloudera hadoop single node cluster (kerberos enabled.)

In client mode I use following commands

kinit
spark-submit --master yarn-client --proxy-user cloudera examples/src/main/python/pi.py 

This works fine. In cluster mode I use following command (no kinit done and no TGT is present in the cache)

spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster examples/src/main/python/pi.py 

Also works fine. But when I use following command in cluster mode (no kinit done and no TGT is present in the cache)

   spark-submit --principal <myprinc> --keytab <KT location> --master yarn-cluster --proxy-user cloudera examples/src/main/python/pi.py 

throws following error

  No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)

I guess in cluster mode the spark-submit do not look for TGT in the client machine... it transfers the "keytab" file to the cluster and then starts the spark job. So why does the specifying "--proxy-user" option looks for TGT while submitting in the "yarn-cluster" mode. Am I doing some thing wrong.

@vanzin
Copy link
Contributor Author

vanzin commented Jan 13, 2016

@hemshankar Please don't use github to ask questions / point out possible issues. See http://spark.apache.org/community.html.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants