Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to latest Hadoop 3.3.6 #1937

Closed
1 task done
mikev opened this issue Jul 6, 2023 · 10 comments · Fixed by #2075
Closed
1 task done

Update to latest Hadoop 3.3.6 #1937

mikev opened this issue Jul 6, 2023 · 10 comments · Fixed by #2075
Labels
tag:Upstream A problem with one of the upstream packages installed in the docker images type:Bug A problem with the definition of one of the docker images maintained here

Comments

@mikev
Copy link

mikev commented Jul 6, 2023

What docker image(s) are you using?

all-spark-notebook

Host OS system and architecture running docker image

Ubuntu 22.04

What Docker command are you running?

docker run -it -p 8888:8888 --user root -e GRANT_SUDO=yes -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook:spark-3.4.1

How to Reproduce the problem?

Visit localhost:8888

Open Terminal from Launcher

(base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*"
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar
(base) jovyan@745e84c0ed21:/home$
(base) jovyan@745e84c0ed21:/home$

Command output

No response

Expected behavior

Expect to see hadoop-client-api-3.3.6.jar. Hadoop should be updated to latest which is 3.3.6 or greater.

Actual behavior

Although Spark is at version 3.4.1 the Hadoop library is still at 3.3.4

base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*"
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar
(base) jovyan@745e84c0ed21:/home$

Anything else?

Our project uses AWS S3 and requires the requester-pays header on all S3 requests. This issue was described and fixed in Hadoop 3.3.5.

https://issues.apache.org/jira/browse/HADOOP-14661
The patch is here:
https://issues.apache.org/jira/secure/attachment/12877218/HADOOP-14661.patch

Per the patch we're required to set "fs.s3a.requester-pays.enabled" to "true"
This fix was enabled in aws-hadoop 3.3.5 and released on Mar 27, 2023.

I've tried to upgrade Hadoop in various ways and it still doesn't work. But I finally noticed that my hadoop is fixed at version 3.3.4. Somehow I can't seem to upgrade to 3.3.5. However Hadoop 3.3.5 was very recently released maybe something extra is needed to get the upgrade into Jupyter.

Latest Docker version

  • I've updated my Docker version to the latest available, and the issue still persists
@mikev mikev added the type:Bug A problem with the definition of one of the docker images maintained here label Jul 6, 2023
@mikev
Copy link
Author

mikev commented Jul 6, 2023

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions:
https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

@mikev
Copy link
Author

mikev commented Jul 6, 2023

Appears that hadoop is bundled with Spark. So likely this is not a Jupyter build issue. In other words, Hadoop 3.3.4 is bundled with Spark 3.4.1

michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ find . -name "hadoop*"
./jars/hadoop-client-api-3.3.4.jar
./jars/hadoop-client-runtime-3.3.4.jar
./jars/hadoop-shaded-guava-1.1.1.jar
./jars/hadoop-yarn-server-web-proxy-3.3.4.jar
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ cd ..
michael@PC:/mnt/c/Users/mvier/code/helium$ ls spark-3.4.1-bin-hadoop3.tgz
spark-3.4.1-bin-hadoop3.tgz
michael@PC:/mnt/c/Users/mvier/code/helium$ wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

@mikev
Copy link
Author

mikev commented Jul 6, 2023

3.3.6 was inserted into the Spark build files last week.
https://github.com/apache/spark/blob/f6e0b3906d533ab719f2423bd136d79215bfa315/pom.xml#L125

Appears we just need to wait for the next Spark 3.4.2 release which will include Hadoop 3.3.6.

@mikev
Copy link
Author

mikev commented Jul 6, 2023

Before this issue is closed. I'm wondering why --build-arg hadoop_version=3.3.6 has no effect?

Per the specifics doc you are supposed to be able to specify the Hadoop version when launching the image per specifics instructions:
https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

Is there a work-around to configure a different Hadoop version?

@mikev
Copy link
Author

mikev commented Jul 7, 2023

Recap:

Attempted to dynamically update Hadoop to 3.3.6 via three methods:
One
my_packages = ["org.apache.hadoop:hadoop-aws:3.3.6"]
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
Two
docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6
Three
Open Jupyter Terminal
pip3 install aws-hadoop

None of the methods worked.

@mathbunnyru
Copy link
Member

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

You need to build jupyter/pyspark-notebook first, it's where spark is actually installed.

@mathbunnyru
Copy link
Member

Overall, you're right, and we're only using the bundled Hadoop.
So, we'll have to wait for an upstream release.

@mathbunnyru mathbunnyru added the tag:Upstream A problem with one of the upstream packages installed in the docker images label Jul 7, 2023
@bjornjorgensen
Copy link
Contributor

yes, Hadoop is bundled in Apache Spark.

Apache Spark 3.5.0 will soon start RC
https://lists.apache.org/thread/z27z5nkzch66plpw88dkbmpt8gdlq044

@bjornjorgensen
Copy link
Contributor

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

This was for hadoop version 2 or version 3

@bjornjorgensen
Copy link
Contributor

There are some problems with Hadoop 3.3.6 apache/hadoop#5706

https://lists.apache.org/thread/o7ockmppo5yqk2cm7f1kvo7plfgx6xnc

@mathbunnyru mathbunnyru changed the title [BUG] - Update to latest Hadoop 3.3.6 Update to latest Hadoop 3.3.6 Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag:Upstream A problem with one of the upstream packages installed in the docker images type:Bug A problem with the definition of one of the docker images maintained here
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants