-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to filter genotype RDD with FeatureRDD #890
Comments
Hi @NeillGibson! You'll need to key each RDD with a ReferenceRegion, e.g.:
|
Hi @fnothaft . Thank you for the information, needed to move some parentheses but the command starts to execute.
Tasks of the operation keep failing thoug and keep being resubmitted:
Is this by any chance a very resource intensive operation? I am running this on 11 m3.xlarge machines. Or is there another likely cause for the taks that keep failing / being resubmitted. In the end I just aborted the process. The annoation file that I am using is
(removed the chr from the chromosome names to match the chromosome names in the 1000 genomes vcf file, reference version should match.) |
It shouldn't be terribly expensive, however you may want to try running the ShuffleRegionJoin instead of the BroadcastRegionJoin. I'll need to look over the GTF loading code, but I'm not sure off of the top of my head whether that code maps each feature in a GTF to a |
Hi @fnothaft. Thank you for the information. I can try to increase the driver memory and the ShuffleRegionJoin. The vcf file I am testing with is Some searching online led me to this gene annotation file (the previous one mentioned is for an older genome build). The basic thing I am trying to achieve is just to see how long it takes with Adam to filter a large set of genotypes based on a gene model. It doesn't need to be this exact gene model, or this detailed of a gene model. If you have a gene model which is compatible with the 1000 genomes data I would be happy to use that one. I am only using chr22 because the full data set is much bigger. Queries with just chr22 already take some time. And I have other queries that I've tested on chr22. I'll filter the gene model (gtf file) also for just the chr22 gene annotations, before converting to Feature format. I will post my results here after testing with the correct genome build and the increased driver memory and or ShuffleRegionJoin method. |
Hi @fnothaft, Increasing the driver memory from 8 to 14 GB (is max machine mem) did not help. Still the same errors. I am now trying to use the
On what should I base the partitionSize? If I set this to 3 (example that I found in Adam test forShuffleRegionJoin ) or 11 (number of nodes) I get an out of memory error.
If I set partitionSize to 100 I get a lot of small tasks. And processing takes a long time, I did not wait for it to finish. Do you maybe have an example somewhere with a public vcf file and feature file of real sizes that works, ie the filter a genotypeRDD with BroadCastRegoinJoin or ShuffleRegionJoin. I could then look if I can run that example. Maybe the issue is just with my gene annotation file..... |
@devin-petersohn Can you take this as part of #1324? I think that all that is needed to resolve this issue is to:
I'm assigning it to you for now; let me know if you'd rather not. |
This will be a part of set difference implemented in #1561. |
Ping @devin-petersohn; can you put together a small doc snippet for this? I would like to close this in 0.23.0. |
I describe something similar here The gene feature ADAM file came from Ensembl and was filtered via val features = sc.loadFeatures("Homo_sapiens.GRCh38.89.chr.gff3.gz")
val geneFeatures = features.transform(_.filter(f => f.featureType == "gene"))
geneFeatures.saveAsParquet("Homo_sapiens.GRCh38.89.chr.geneFeatures.adam") |
I am adding multiple additional examples for various things one could do with joins. I should have a PR in tonight. It didn't seem right to simply have the one real-world example. |
Hi,
How can I filter a genotypeRDD with a FeatureRDD? I get the following error:
with this code:
Do I need to convert the genotypeRDD and FeatureRDD to a ReferenceRegionRDD ? Is this an implicit conversion done automatically by importing a certain class?
Thank you,
Neill
The text was updated successfully, but these errors were encountered: