You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, the way we run a shell command is using the gcloud dataproc jobs submit pig ... --execute ... command.
We looked at the API and found the following flag:
--properties=[PROPERTY=VALUE,…]
A list of key value pairs to configure Pig.
Using it we tried setting the appropriate configuration properties to pass the files / archives, based on this SO answer.
The result was something like:
Unfortunately, it didn't work. The file myxml.xml was not present in the working directory of the job and the cat command was failing.
Approach 2
Knowing that we have a Pig mapper which produces a DAG with a DataprocPigOperator task running a Pig script and it already handles file/archive, we tried to explore this path.
We found that a .pig script can successfully run a bash command, e.g.:
script.pig
sh ls -al
The file/archive functionality in a Pig script is handled by modifying the script and adding a few SET commands on top, e.g.:
set mapred.create.symlink yes;
set mapred.cache.file hdfs:///user/szymon/examples/apps/pig/test_dir/test.txt#test_link.txt,hdfs:///user/szymon/examples/apps/pig/test_dir/test2.zip#test_link.zip;
set mapred.cache.archives hdfs:///user/szymon/examples/apps/pig/test_dir/test2.zip#test_zip_dir,hdfs:///user/szymon/examples/apps/pig/test_dir/test3.zip#test3_zip_dir,hdfs:///user/szymon/examples/apps/pig/test_dir/testcopy.zip#testcopy_zip_dir;
We ran this on Dataproc and there are a few observations:
When the HDFS URI points to a non-existent archive, there is an error and the job fails
When the HDFS URI points to a non-existent file, no error is thrown and the job completes
If we try to sh cat ... the referenced file or a file from a referenced archive (which should be unarchived and present), there is an error - file not found
A sh ls -al only showed the script.pig file
Additional actions
When printing sh pwd we found out that the job is actually executed from a /tmp/{job-hash} directory.
It is removed after the job has completed so unable to be inspected. We placed an sh sleep 1000 inside the Pig script to inspect the directory at runtime but didn't find the file/archive resources inside.
Conclusions
We didn't manage to find a way to add file/archive functionality to the Shell mapper.
Moreover, it seems that it doesn't work correctly for the Pig mapper either.
We've decided to abandon this problem for now and return to it later.
Things to resolve:
Find out where exactly the file/archive resources should be store in the local cache (local filesystem)
Find out if they are stored there in the case of the gcloud dataproc jobs submit pig command as well as the DataprocPigOperator (which probably uses the same command underneath)
Check (somehow) if for Pig the file/archive functionality works at all - possibly by creating a Pig script which makes use of the symlinked file from the local cache
The text was updated successfully, but these errors were encountered:
Describe the bug
It seems that the File/Archive functionality in Pig (both
gcloud dataproc jobs submit pig
andDataprocPigOperator
) doesn't work.To Reproduce
Details can be found here, but pasting them below as well for easier access:
===
On 18-19.06.2019 me (Szymon) and Tomek we were working on adding file/archive support to the Shell mapper.
It was supposed to be a simple task, however it turned out to be anything but.
Approach 1
First of all, the way we run a shell command is using the
gcloud dataproc jobs submit pig ... --execute ...
command.We looked at the API and found the following flag:
--properties=[PROPERTY=VALUE,…]
A list of key value pairs to configure Pig.
Using it we tried setting the appropriate configuration properties to pass the files / archives, based on this SO answer.
The result was something like:
Unfortunately, it didn't work. The file
myxml.xml
was not present in the working directory of the job and thecat
command was failing.Approach 2
Knowing that we have a Pig mapper which produces a DAG with a
DataprocPigOperator
task running a Pig script and it already handles file/archive, we tried to explore this path.We found that a
.pig
script can successfully run a bash command, e.g.:script.pig
The file/archive functionality in a Pig script is handled by modifying the script and adding a few
SET
commands on top, e.g.:We ran this on Dataproc and there are a few observations:
archive
, there is an error and the job failsfile
, no error is thrown and the job completessh cat ...
the referencedfile
or a file from a referencedarchive
(which should be unarchived and present), there is an error - file not foundsh ls -al
only showed thescript.pig
fileAdditional actions
When printing
sh pwd
we found out that the job is actually executed from a/tmp/{job-hash}
directory.It is removed after the job has completed so unable to be inspected. We placed an
sh sleep 1000
inside the Pig script to inspect the directory at runtime but didn't find the file/archive resources inside.Conclusions
We didn't manage to find a way to add file/archive functionality to the Shell mapper.
Moreover, it seems that it doesn't work correctly for the Pig mapper either.
We've decided to abandon this problem for now and return to it later.
Things to resolve:
gcloud dataproc jobs submit pig
command as well as theDataprocPigOperator
(which probably uses the same command underneath)The text was updated successfully, but these errors were encountered: