Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read glob of multiple parquet Genotype #1179

Closed
jpdna opened this issue Sep 20, 2016 · 2 comments
Closed

How to read glob of multiple parquet Genotype #1179

jpdna opened this issue Sep 20, 2016 · 2 comments

Comments

@jpdna
Copy link
Member

jpdna commented Sep 20, 2016

Is it possible to load the data of several parquet files storing bdg-formats Genotype into a single GenotypeRDD?
I want something equivalent to: sc.loadVcf("*.vcf") typeglob but for ADAM Parquet files.

See discussions below from email:

On Tue, Sep 20, 2016 at 11:19 AM, Justin Paschall justinpaschalldna@gmail.com wrote:
For set of single samples VCFs Michael pointed out using sc.loadVcf("*.vcf") can read in a bunch of single samples VCFs into a single RDD[VariantContext] of a VariantContextRDD

Similarly - is there any sense to to converting single-sample VCFs to single sample Parquet files of [Genotype] as a data source of reference?

However, I am not sure there is a way to typeglob load a set of many individual ADAM parquet genotypes files upon loading, in the same way that we can load a bunch of VCF with the typeglob?

It is tricky to get the globs in correctly without having them expanded by the shell; it appears that even my tricks don't work for .adam files

$ ./bin/adam-submit transform "*.adam" -single combined.sam
Command body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching *.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching *.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)

$ ./bin/adam-submit transform "file:///pwd/.adam" -single combined.sam
Command body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/
.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/*.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)

Will have to dig in the new code changes to look (getFsAndFilesWithFilter).

The VCF type glob works, but it is slower than I hoped, 10 minutes to load/count on 100 samples of chr22 on my workstation (testing on cluster shortly), and I want to make sure that I am not missing out on a prior one-time conversion to parquet on a per-sample-file basis if that would be better.

@fnothaft
Copy link
Member

Read the unit tests ;)

I jest. This works, but it's not obvious. Specifically, I created two Parquet files:

fnothaft$ ./bin/adam-submit vcf2adam adam-core/src/test/resources/small.vcf small.1.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:32:53 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:32:55 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
fnothaft$ ./bin/adam-submit vcf2adam adam-core/src/test/resources/small.vcf small.2.adam
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:33:02 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:33:04 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
fnothaft$ 

Then I globbed them like so:


fnothaft$ ./bin/adam-submit adam2vcf "small.*.adam/*" small.vcf
Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-1.5.2//bin/spark-submit
2016-09-20 21:36:48 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-09-20 21:36:49 WARN  MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set.
Command body threw exception:
htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
    at htsjdk.variant.vcf.VCFHeader.<init>(VCFHeader.java:142)
    at org.bdgenomics.adam.rdd.variation.VariantContextRDD.saveAsVcf(VariantContextRDD.scala:145)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:86)
    at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:55)
    at org.bdgenomics.adam.cli.ADAM2Vcf.run(ADAM2Vcf.scala:62)
    at org.bdgenomics.adam.cli.ADAMMain.apply(ADAMMain.scala:132)
    at org.bdgenomics.adam.cli.ADAMMain$.main(ADAMMain.scala:72)
    at org.bdgenomics.adam.cli.ADAMMain.main(ADAMMain.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Sep 20, 2016 9:36:51 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2
zIRISz:adam fnothaft$

This isn't a great example (since it correctly throws an error for dupe sample names) but it shows how to do the glob. Specifically, you need to glob inside the directories that you're globbing. Long story short, Hadoop treats globs on files and on directories differently. If you want to dig into the implementation details, grep for globStatus vs. listStatus.

That being said, since this isn't obvious, I'll update the exception to include a hint.

@heuermh
Copy link
Member

heuermh commented Sep 20, 2016

Thanks for the clarification, in the example above, this now works for me

$ ./bin/adam-submit transform "*.adam/*" -single combined.sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants