How to read glob of multiple parquet Genotype #1179

jpdna · 2016-09-20T17:03:27Z

Is it possible to load the data of several parquet files storing bdg-formats Genotype into a single GenotypeRDD?
I want something equivalent to: sc.loadVcf("*.vcf") typeglob but for ADAM Parquet files.

See discussions below from email:

On Tue, Sep 20, 2016 at 11:19 AM, Justin Paschall justinpaschalldna@gmail.com wrote:
For set of single samples VCFs Michael pointed out using sc.loadVcf("*.vcf") can read in a bunch of single samples VCFs into a single RDD[VariantContext] of a VariantContextRDD

Similarly - is there any sense to to converting single-sample VCFs to single sample Parquet files of [Genotype] as a data source of reference?

However, I am not sure there is a way to typeglob load a set of many individual ADAM parquet genotypes files upon loading, in the same way that we can load a bunch of VCF with the typeglob?

It is tricky to get the globs in correctly without having them expanded by the shell; it appears that even my tricks don't work for .adam files

$ ./bin/adam-submit transform "*.adam" -single combined.sam
Command body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching *.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching *.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)

$ ./bin/adam-submit transform "file:///pwd/.adam" -single combined.sam
Command body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/*.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)

Will have to dig in the new code changes to look (getFsAndFilesWithFilter).

The VCF type glob works, but it is slower than I hoped, 10 minutes to load/count on 100 samples of chr22 on my workstation (testing on cluster shortly), and I want to make sure that I am not missing out on a prior one-time conversion to parquet on a per-sample-file basis if that would be better.

The text was updated successfully, but these errors were encountered:

fnothaft · 2016-09-20T20:41:27Z

Read the unit tests ;)

I jest. This works, but it's not obvious. Specifically, I created two Parquet files:

fnothaft$ ./bin/adam-submit vcf2adam adam-core/src/test/resources/small.vcf small.1.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:32:53 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:32:55 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
fnothaft$ ./bin/adam-submit vcf2adam adam-core/src/test/resources/small.vcf small.2.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:33:02 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:33:04 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
fnothaft$

Then I globbed them like so:


fnothaft$ ./bin/adam-submit adam2vcf "small.*.adam/*" small.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:36:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:36:49 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
Command body threw exception:
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
    at htsjdk.variant.vcf.VCFHeader.<init>(VCFHeader.java:142)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:145)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:86)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:62)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:132)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:72)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Sep 20, 2016 9:36:51 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
zIRISz:adam fnothaft$

This isn't a great example (since it correctly throws an error for dupe sample names) but it shows how to do the glob. Specifically, you need to glob inside the directories that you're globbing. Long story short, Hadoop treats globs on files and on directories differently. If you want to dig into the implementation details, grep for globStatus vs. listStatus.

That being said, since this isn't obvious, I'll update the exception to include a hint.

Resolves bigdatagenomics#1179.

heuermh · 2016-09-20T22:57:45Z

Thanks for the clarification, in the example above, this now works for me

$ ./bin/adam-submit transform "*.adam/*" -single combined.sam

fnothaft added a commit to fnothaft/adam that referenced this issue Sep 20, 2016

[ADAM-1179] Improve error message when globbing a parquet file fails.

8bd6c0e

Resolves bigdatagenomics#1179.

fnothaft mentioned this issue Sep 20, 2016

[ADAM-1179] Improve error message when globbing a parquet file fails. #1180

Merged

heuermh closed this as completed in #1180 Sep 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read glob of multiple parquet Genotype #1179

How to read glob of multiple parquet Genotype #1179

jpdna commented Sep 20, 2016

fnothaft commented Sep 20, 2016

heuermh commented Sep 20, 2016

How to read glob of multiple parquet Genotype #1179

How to read glob of multiple parquet Genotype #1179

Comments

jpdna commented Sep 20, 2016

fnothaft commented Sep 20, 2016

heuermh commented Sep 20, 2016