Skip to content

Efficiently Copying Files into HDFS

Tom White edited this page Nov 3, 2017 · 2 revisions

It is a known fact that GATK spark tools run faster when they are pulling their underlying input files from an HDFS source as opposed to a Google bucket. The cost of putting a file into HDFS from GCS can be expensive, so this optimization is especially important for persistent clusters where the same data might be run multiple times using a GATK spark tool. Under some conditions, it might actually be faster to download inputs into HDFS and run them rather than relying on the GCS adapter as can be explored in this pull request. To efficiently put input into HDFS, the following command should suffice:

$ hadoop distcp gs://my/gcs/path.file hdfs:///my/hdfs/path.file

The GATK tool ParallelCopyGCSDirectoryIntoHDFSSpark may perform even better, since it can split large files into blocks and copy them in parallel:

gatk-launch ParallelCopyGCSDirectoryIntoHDFSSpark \
    --inputGCSPath gs://my/gcs/path \
    --outputHDFSDirectory hdfs:///my/hdfs/path