Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs #2676

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion core/src/main/scala/org/apache/spark/SparkContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging {
kClass: Class[K],
vClass: Class[V],
conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
// mapreduce.Job (NewHadoopJob) merges any credentials for you.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment isn't very clear to me. What is it trying to say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just saying that the call to "new NewHadoopJob" adds the credentials to the conf passed in for you, so we don't need an explicit call to SparkHadoopUtil.get.addCredentials(jconf).

I can try to rephrase .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

The call to new NewHadoopJob" automatically adds security credentials to conf, so we don't need to explicitly add them ourselves

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I can change the comment

val job = new NewHadoopJob(conf)
NewFileInputFormat.addInputPath(job, new Path(path))
val updatedConf = job.getConfiguration
Expand All @@ -661,7 +662,10 @@ class SparkContext(config: SparkConf) extends Logging {
fClass: Class[F],
kClass: Class[K],
vClass: Class[V]): RDD[(K, V)] = {
new NewHadoopRDD(this, fClass, kClass, vClass, conf)
// Add necessary security credentials to the JobConf. Required to access secure HDFS.
val jconf = new JobConf(conf)
SparkHadoopUtil.get.addCredentials(jconf)
new NewHadoopRDD(this, fClass, kClass, vClass, jconf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hadoopFile we actually mutate the existing conf. Should we modify this one or that one to be more consistent? It does seem more sensible here to duplicate it and then modify the copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't see what you are referring to in hadoopFile? Do you mean hadoopRDD, which already takes a Jobconf?

We need a JobConf for addCredentials routine to work. I could look at add the credentials another way but the nice thing about creating another JobConf is if the user passes in a JobConf instance creating a new one handles transferring any credentials from that one as well.

}

/** Get an RDD for a Hadoop SequenceFile with given key and value types.
Expand Down