Update to latest Hadoop 3.3.6 #1937

mikev · 2023-07-06T15:06:54Z

What docker image(s) are you using?

all-spark-notebook

Host OS system and architecture running docker image

Ubuntu 22.04

What Docker command are you running?

docker run -it -p 8888:8888 --user root -e GRANT_SUDO=yes -v $(pwd):/home/jovyan/work jupyter/all-spark-notebook:spark-3.4.1

How to Reproduce the problem?

Visit localhost:8888

Open Terminal from Launcher

(base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*"
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar
(base) jovyan@745e84c0ed21:/home$
(base) jovyan@745e84c0ed21:/home$

Command output

No response

Expected behavior

Expect to see hadoop-client-api-3.3.6.jar. Hadoop should be updated to latest which is 3.3.6 or greater.

Actual behavior

Although Spark is at version 3.4.1 the Hadoop library is still at 3.3.4

base) jovyan@745e84c0ed21:/home$ find /usr/local/spark-3.4.1-bin-hadoop3/ -name "hadoop*"
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-yarn-server-web-proxy-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-shaded-guava-1.1.1.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-runtime-3.3.4.jar
/usr/local/spark-3.4.1-bin-hadoop3/jars/hadoop-client-api-3.3.4.jar
(base) jovyan@745e84c0ed21:/home$

Anything else?

Our project uses AWS S3 and requires the requester-pays header on all S3 requests. This issue was described and fixed in Hadoop 3.3.5.

https://issues.apache.org/jira/browse/HADOOP-14661
The patch is here:
https://issues.apache.org/jira/secure/attachment/12877218/HADOOP-14661.patch

Per the patch we're required to set "fs.s3a.requester-pays.enabled" to "true"
This fix was enabled in aws-hadoop 3.3.5 and released on Mar 27, 2023.

I've tried to upgrade Hadoop in various ways and it still doesn't work. But I finally noticed that my hadoop is fixed at version 3.3.4. Somehow I can't seem to upgrade to 3.3.5. However Hadoop 3.3.5 was very recently released maybe something extra is needed to get the upgrade into Jupyter.

Latest Docker version

I've updated my Docker version to the latest available, and the issue still persists

mikev · 2023-07-06T16:22:59Z

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions:
https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

mikev · 2023-07-06T16:49:11Z

Appears that hadoop is bundled with Spark. So likely this is not a Jupyter build issue. In other words, Hadoop 3.3.4 is bundled with Spark 3.4.1

michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ find . -name "hadoop*"
./jars/hadoop-client-api-3.3.4.jar
./jars/hadoop-client-runtime-3.3.4.jar
./jars/hadoop-shaded-guava-1.1.1.jar
./jars/hadoop-yarn-server-web-proxy-3.3.4.jar
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$
michael@PC:/mnt/c/Users/mvier/code/helium/spark-3.4.1-bin-hadoop3$ cd ..
michael@PC:/mnt/c/Users/mvier/code/helium$ ls spark-3.4.1-bin-hadoop3.tgz
spark-3.4.1-bin-hadoop3.tgz
michael@PC:/mnt/c/Users/mvier/code/helium$ wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

mikev · 2023-07-06T17:08:45Z

3.3.6 was inserted into the Spark build files last week.
https://github.com/apache/spark/blob/f6e0b3906d533ab719f2423bd136d79215bfa315/pom.xml#L125

Appears we just need to wait for the next Spark 3.4.2 release which will include Hadoop 3.3.6.

mikev · 2023-07-06T17:10:59Z

Before this issue is closed. I'm wondering why --build-arg hadoop_version=3.3.6 has no effect?

Per the specifics doc you are supposed to be able to specify the Hadoop version when launching the image per specifics instructions:
https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

Is there a work-around to configure a different Hadoop version?

mikev · 2023-07-07T02:15:09Z

Recap:

Attempted to dynamically update Hadoop to 3.3.6 via three methods:
One
my_packages = ["org.apache.hadoop:hadoop-aws:3.3.6"]
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
Two
docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6
Three
Open Jupyter Terminal
pip3 install aws-hadoop

None of the methods worked.

mathbunnyru · 2023-07-07T11:19:32Z

Also you are supposed to be able to specify the Hadoop version when launching the image per the image specifics instructions: https://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

You need to build jupyter/pyspark-notebook first, it's where spark is actually installed.

mathbunnyru · 2023-07-07T11:27:34Z

Overall, you're right, and we're only using the bundled Hadoop.
So, we'll have to wait for an upstream release.

bjornjorgensen · 2023-07-07T14:07:02Z

yes, Hadoop is bundled in Apache Spark.

Apache Spark 3.5.0 will soon start RC
https://lists.apache.org/thread/z27z5nkzch66plpw88dkbmpt8gdlq044

bjornjorgensen · 2023-07-07T14:08:30Z

docker build --rm --force-rm -t jupyter/all-spark-notebook:spark-3.4.1 . --build-arg hadoop_version=3.3.6

This also failed to set Hadoop to version 3.3.6

This was for hadoop version 2 or version 3

bjornjorgensen · 2023-08-05T12:52:10Z

There are some problems with Hadoop 3.3.6 apache/hadoop#5706

https://lists.apache.org/thread/o7ockmppo5yqk2cm7f1kvo7plfgx6xnc

mikev added the type:Bug A problem with the definition of one of the docker images maintained here label Jul 6, 2023

This was referenced Jul 6, 2023

Mv/tuning helium/helium-data#15

Merged

Known Issues helium/helium-data#16

Open

mathbunnyru added the tag:Upstream A problem with one of the upstream packages installed in the docker images label Jul 7, 2023

mathbunnyru changed the title ~~[BUG] - Update to latest Hadoop 3.3.6~~ Update to latest Hadoop 3.3.6 Sep 10, 2023

mathbunnyru mentioned this issue Sep 16, 2023

Upgrade Apache Spark to 3.5.0 #1995

Merged

4 tasks

mathbunnyru mentioned this issue Dec 20, 2023

Install latest spark version automatically #2063

Closed

mathbunnyru mentioned this issue Jan 5, 2024

Automatically install latest spark version #2075

Merged

4 tasks

mathbunnyru closed this as completed in #2075 Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to latest Hadoop 3.3.6 #1937

Update to latest Hadoop 3.3.6 #1937

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023 •

edited

Loading

mikev commented Jul 7, 2023

mathbunnyru commented Jul 7, 2023

mathbunnyru commented Jul 7, 2023

bjornjorgensen commented Jul 7, 2023

bjornjorgensen commented Jul 7, 2023

bjornjorgensen commented Aug 5, 2023

Update to latest Hadoop 3.3.6 #1937

Update to latest Hadoop 3.3.6 #1937

Comments

mikev commented Jul 6, 2023

What docker image(s) are you using?

Host OS system and architecture running docker image

What Docker command are you running?

How to Reproduce the problem?

Command output

Expected behavior

Actual behavior

Anything else?

Latest Docker version

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023

mikev commented Jul 6, 2023 • edited Loading

mikev commented Jul 7, 2023

mathbunnyru commented Jul 7, 2023

mathbunnyru commented Jul 7, 2023

bjornjorgensen commented Jul 7, 2023

bjornjorgensen commented Jul 7, 2023

bjornjorgensen commented Aug 5, 2023

mikev commented Jul 6, 2023 •

edited

Loading