-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5493] [core] Add option to impersonate user. #4405
Conversation
Hadoop has a feature that allows users to impersonate other users when submitting applications or talking to HDFS, for example. These impersonated users are referred generally as "proxy users". Services such as Oozie or Hive use this feature to run applications as the requesting user. This change makes SparkSubmit accept a new command line option to run the application as a proxy user. It also fixes the plumbing of the user name through the UI (and a couple of other places) to refer to the correct user running the application, which can be different that `sys.props("user.name")` even without proxies (e.g. when using kerberos).
Some description of the testing I did:
|
Test build #26861 has finished for PR 4405 at commit
|
Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
Test build #26930 has finished for PR 4405 at commit
|
In this case, you are only running SparkSubmit as the proxy user. Should we not have the executor code also run as the proxy user, so any writes from the app to HDFS shows the proxy user - or is that not the intent? |
That should already be handled.
|
Oh, I didn't know YARN already takes care of run the container as the requesting user. +1. |
Conflicts: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
Test build #26979 has finished for PR 4405 at commit
|
Hi @chesterxgchen. Assumption 1 and 3 wrong: the code doesn't assume that. Proxy users work just fine without kerberos as long as the configuration allows it (in fact I just tested it). You may argue that it's useless in non-kerberos mode (where you can just use Errors, just like any other error when submitting the app to Yarn, are reported as exceptions thrown by the Yarn client code. I assume that standalone and mesos have no notion of proxy users, so they wouldn't ever complain about it.
Delegation tokens are only needed with kerberos, which is only supported by Yarn currently, which as you already noticed already handles them. |
In our case, we did not use SparkSubmit, but directly use Yarn Client. I don't understand why the standalone mode or messos mode won't need have job delegation token ? Maybe you can elaborate that a bit more. If you see in oozie's implementation, you can see that before the MR job is submitted, the job delegation is added to the Jobclient's credential. This is regardless using Yarn or not. Another question related to the overall approach. This seems to be fine with the command line calling SparkSubmmit. As the current user can be authenticated with kinit, and the proxy-user can be impersonated via createProxyUser. The user who manages the spark job submit is responsible for managing kerberos TGT lifetime and renewal etc. If the ticket is expired, user can re-run the kinit or use the cron job to keep it from expire. In this case, the spark is merely create proxy user. For application ( for example, a programs that submit the spark job directly, not from command line), this seems approach doesn't seem to help much. As the application can createProxyUser in its program instead of let spark do it, application already do the kerberosLogin ( UserGroupInformation.loginUserFromKeytab), renew UserGroupInformation.checkTGTAndReloginFromKeytab, handle ticket expiration, add job token etc. So is the approach is only intended for command line use ? does it make sense to push more logic into spark ? Or this logic doesn't belong to spark ? thanks |
Both. In kerberos mode you have to be logged in before you submit the app, but that's true before this change. If you're not logged in, you can't submit. So ths change is not changing any assumptions.
Because they don't need it. They don't work with kerberos, and you don't need delegation tokens without kerberos.
Not sure how that's related to Spark. Spark gets the needed delegation tokens, there's nothing else to be done.
Both Oozie and Hive need to fork external processes to run Spark. When those processes fork, they'll be running with the kerberos credentials of the user running those services, not as the "proxy user". So the forked process needs to know which user to impersonate.
Yes, this is only for command line use (or, in other words, running Spark as a separate process). Anything else would be a lot more complicated and probably a much larger project, that is really not needed for the use case at hand (Hive and, eventually, Oozie). |
Thanks for the detailed reply.
you answered above as through they are separate questions, actually it's Spark submit MR job just like other application such as oozie to submit MR I did not get a clear picture from your answer: is spark standalone mode
if you forked the process, of course, that's a different story. I am Thanks for clarify what's the PR is intended for. Chester On Mon, Feb 9, 2015 at 7:54 PM, Marcelo Vanzin notifications@github.com
|
@@ -35,7 +35,8 @@ function gatherSparkSubmitOpts() { | |||
--master | --deploy-mode | --class | --name | --jars | --packages | --py-files | --files | \ | |||
--conf | --repositories | --properties-file | --driver-memory | --driver-java-options | \ | |||
--driver-library-path | --driver-class-path | --executor-memory | --driver-cores | \ | |||
--total-executor-cores | --executor-cores | --queue | --num-executors | --archives) | |||
--total-executor-cores | --executor-cores | --queue | --num-executors | --archives | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need a \
similar to earlier lines?
Hey @vanzin I made two comments but overall this looks good. One question - in cluster mode, does this |
Hi @chesterxgchen,
Correct. Currently, there's no point in talking about kerberos for anything but Yarn.
That is obviously not covered by this patch. For that application, it would need to do the impersonation itself before creating a SparkContext (or doing whatever it does to launch Spark). Well, it could potentially call |
Hi @pwendell ,
The submission is run as the proxy user. So when you start the "cluster mode" app, it will be started as that proxy user. |
Needed some extra code to handle some special exceptions that the new code may hit, since they seem to confuse the JVM.
Test build #27233 has finished for PR 4405 at commit
|
Test build #27234 has finished for PR 4405 at commit
|
LGTM - thanks @vanzin I'll pull it in! |
Hadoop has a feature that allows users to impersonate other users when submitting applications or talking to HDFS, for example. These impersonated users are referred generally as "proxy users". Services such as Oozie or Hive use this feature to run applications as the requesting user. This change makes SparkSubmit accept a new command line option to run the application as a proxy user. It also fixes the plumbing of the user name through the UI (and a couple of other places) to refer to the correct user running the application, which can be different than `sys.props("user.name")` even without proxies (e.g. when using kerberos). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4405 from vanzin/SPARK-5493 and squashes the following commits: df82427 [Marcelo Vanzin] Clarify the reason for the special exception handling. 05bfc08 [Marcelo Vanzin] Remove unneeded annotation. 4840de9 [Marcelo Vanzin] Review feedback. 8af06ff [Marcelo Vanzin] Fix usage string. 2e4fa8f [Marcelo Vanzin] Merge branch 'master' into SPARK-5493 b6c947d [Marcelo Vanzin] Merge branch 'master' into SPARK-5493 0540d38 [Marcelo Vanzin] [SPARK-5493] [core] Add option to impersonate user. (cherry picked from commit ed167e7) Signed-off-by: Patrick Wendell <patrick@databricks.com>
What do we mean by plumbing of the user name through the UI |
I have few doubts about running in client mode and cluster mode. In client mode I use following commands
This works fine. In cluster mode I use following command (no kinit done and no TGT is present in the cache)
Also works fine. But when I use following command in cluster mode (no kinit done and no TGT is present in the cache)
throws following error
I guess in cluster mode the spark-submit do not look for TGT in the client machine... it transfers the "keytab" file to the cluster and then starts the spark job. So why does the specifying "--proxy-user" option looks for TGT while submitting in the "yarn-cluster" mode. Am I doing some thing wrong. |
@hemshankar Please don't use github to ask questions / point out possible issues. See http://spark.apache.org/community.html. |
Hadoop has a feature that allows users to impersonate other users
when submitting applications or talking to HDFS, for example. These
impersonated users are referred generally as "proxy users".
Services such as Oozie or Hive use this feature to run applications
as the requesting user.
This change makes SparkSubmit accept a new command line option to
run the application as a proxy user. It also fixes the plumbing
of the user name through the UI (and a couple of other places) to
refer to the correct user running the application, which can be
different than
sys.props("user.name")
even without proxies (e.g.when using kerberos).