-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to read glob of multiple parquet Genotype #1179
Comments
Read the unit tests ;) I jest. This works, but it's not obvious. Specifically, I created two Parquet files:
Then I globbed them like so:
This isn't a great example (since it correctly throws an error for dupe sample names) but it shows how to do the glob. Specifically, you need to glob inside the directories that you're globbing. Long story short, Hadoop treats globs on files and on directories differently. If you want to dig into the implementation details, grep for That being said, since this isn't obvious, I'll update the exception to include a hint. |
Thanks for the clarification, in the example above, this now works for me
|
Is it possible to load the data of several parquet files storing bdg-formats Genotype into a single GenotypeRDD?
I want something equivalent to: sc.loadVcf("*.vcf") typeglob but for ADAM Parquet files.
See discussions below from email:
On Tue, Sep 20, 2016 at 11:19 AM, Justin Paschall justinpaschalldna@gmail.com wrote:
For set of single samples VCFs Michael pointed out using sc.loadVcf("*.vcf") can read in a bunch of single samples VCFs into a single RDD[VariantContext] of a VariantContextRDD
Similarly - is there any sense to to converting single-sample VCFs to single sample Parquet files of [Genotype] as a data source of reference?
However, I am not sure there is a way to typeglob load a set of many individual ADAM parquet genotypes files upon loading, in the same way that we can load a bunch of VCF with the typeglob?
It is tricky to get the globs in correctly without having them expanded by the shell; it appears that even my tricks don't work for .adam files
$ ./bin/adam-submit transform "*.adam" -single combined.sam
Command body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching *.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching *.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)
$ ./bin/adam-submit transform "file:///
pwd
/.adam" -single combined.samCommand body threw exception:
java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/.adam
Exception in thread "main" java.io.FileNotFoundException: Couldn't find any files matching file:///Users/heuermh/working/adam/*.adam
at org.bdgenomics.adam.rdd.ADAMContext.getFsAndFilesWithFilter(ADAMContext.scala:391)
at org.bdgenomics.adam.rdd.ADAMContext.loadAvroSequences(ADAMContext.scala:207)
at org.bdgenomics.adam.rdd.ADAMContext.loadParquetAlignments(ADAMContext.scala:637)
Will have to dig in the new code changes to look (getFsAndFilesWithFilter).
The VCF type glob works, but it is slower than I hoped, 10 minutes to load/count on 100 samples of chr22 on my workstation (testing on cluster shortly), and I want to make sure that I am not missing out on a prior one-time conversion to parquet on a per-sample-file basis if that would be better.
The text was updated successfully, but these errors were encountered: