How to filter genotype RDD with FeatureRDD #890

NeillGibson · 2015-11-29T17:26:57Z

Hi,

How can I filter a genotypeRDD with a FeatureRDD? I get the following error:

(BroadcastRegionJoin.partitionAndJoin(genes, humanGenotypesRDD).map(_._2))
<console>:39: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[org.bdgenomics.formats.avro.Feature]
 required: org.apache.spark.rdd.RDD[(org.bdgenomics.adam.models.ReferenceRegion, ?)]
              (BroadcastRegionJoin.partitionAndJoin(genesRDD, genotypesRDD).map(_._2))

with this code:

import org.bdgenomics.adam.rdd.ADAMContext
import org.bdgenomics.formats.avro._
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.BroadcastRegionJoin
import org.bdgenomics.adam.rdd.ADAMContext._

 # Load the genotype RDDs using the AdamContext
val ac = new ADAMContext(sc)
val genotypesRDD = ac.loadGenotypes("/user/ec2-user/1kg/chr22.adam")

val genesRDD = ac.loadFeatures("/user/ec2-user/1kg/gene_annotation.adam")

# based on example from Spark Analytics book (page 213 bottom)
genotypesInGenesRdd = (BroadcastRegionJoin.partitionAndJoin(genesRDD , genotypesRDD).map(_._2))

Do I need to convert the genotypeRDD and FeatureRDD to a ReferenceRegionRDD ? Is this an implicit conversion done automatically by importing a certain class?

Thank you,

Neill

The text was updated successfully, but these errors were encountered:

fnothaft · 2015-11-30T16:52:52Z

Hi @NeillGibson!

You'll need to key each RDD with a ReferenceRegion, e.g.:

import org.bdgenomics.adam.models.ReferenceRegion
genotypesInGenesRdd = (BroadcastRegionJoin.partitionAndJoin(genesRDD.keyBy(ReferenceRegion(_) , genotypesRDD.keyBy(g => ReferenceRegion(g.getVariant.getContig.getContigName, g.getVariant.getStart, g.getVariant.getEnd))).map(_._2))

NeillGibson · 2015-12-01T15:12:47Z

Hi @fnothaft . Thank you for the information, needed to move some parentheses but the command starts to execute.

val genotypesInGenesRDD = (BroadcastRegionJoin.partitionAndJoin(genesRDD.keyBy(ReferenceRegion(_)),genotypesRDD.keyBy(g => ReferenceRegion(g.getVariant.getContig.getContigName,g.getVariant.getStart, g.getVariant.getEnd)))).map(_._2)

Tasks of the operation keep failing thoug and keep being resubmitted:

15/12/01 07:57:24 WARN TaskSetManager: Lost task 6.0 in stage 3.1 (TID 170, ip-10-1-1-205.eu-west-1.compute.internal): java.io.IOException: Failed to connect to ip-10-1-1-193.eu-west-1.compute.internal/10.1.1.193:40751
15/12/01 07:58:24 WARN TaskSetManager: Lost task 6.1 in stage 3.1 (TID 171, ip-10-1-1-209.eu-west-1.compute.internal): java.io.IOException: Failed to connect to ip-10-1-1-191.eu-west-1.compute.internal/10.1.1.191:57895
15/12/01 08:40:16 WARN TaskSetManager: Lost task 21.1 in stage 4.1 (TID 300, ip-10-1-1-209.eu-west-1.compute.internal): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=22, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
15/12/01 08:40:16 WARN TaskSetManager: Lost task 44.0 in stage 4.1 (TID 307, ip-10-1-1-209.eu-west-1.compute.internal): FetchFailed(null, shuffleId=1, mapId=-1, reduceId=45, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

15/12/01 09:04:43 INFO DAGScheduler: Resubmitting ShuffleMapStage 3 (flatMap at BroadcastRegionJoin.scala:117) because some of its tasks had failed: 0, 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 45, 47, 48, 49, 50, 51, 52, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83
15/12/01 09:04:43 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[13] at flatMap at BroadcastRegionJoin.scala:117), which has no missing parents

Is this by any chance a very resource intensive operation? I am running this on 11 m3.xlarge machines. Or is there another likely cause for the taks that keep failing / being resubmitted. In the end I just aborted the process.

The annoation file that I am using is

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_18/gencode.v18.annotation.gtf.gz

(removed the chr from the chromosome names to match the chromosome names in the 1000 genomes vcf file, reference version should match.)

fnothaft · 2015-12-01T16:39:38Z

It shouldn't be terribly expensive, however you may want to try running the ShuffleRegionJoin instead of the BroadcastRegionJoin. I'll need to look over the GTF loading code, but I'm not sure off of the top of my head whether that code maps each feature in a GTF to a Feature object, or just the Gene/Transcript. The reason I'm concerned about that is that if you have just genes as Features, you'll have an RDD of O(20k) Features, while if you have all GTF features (gene, transcript, exon, intron, cds, etc), it could be several orders of magnitude bigger and thus might not fit on the driver, which would cause the BroadcastRegionJoin to fail where the ShuffleRegionJoin would succeed. That being said, I'm a bit skeptical of that as the problem. Are you joining against the whole 1000Genomes variant calls? I can try to repro this on my side.

NeillGibson · 2015-12-02T17:16:11Z

Hi @fnothaft. Thank you for the information. I can try to increase the driver memory and the ShuffleRegionJoin.

The vcf file I am testing with is
s3://1000genomes/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
This is based on the b37 assembly

Some searching online led me to this gene annotation file (the previous one mentioned is for an older genome build).
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
This is based on the GRCh37 which is the same as b37, just with chr in front of the chromosome names (fixed this with sed).

The basic thing I am trying to achieve is just to see how long it takes with Adam to filter a large set of genotypes based on a gene model. It doesn't need to be this exact gene model, or this detailed of a gene model. If you have a gene model which is compatible with the 1000 genomes data I would be happy to use that one.

I am only using chr22 because the full data set is much bigger. Queries with just chr22 already take some time. And I have other queries that I've tested on chr22. I'll filter the gene model (gtf file) also for just the chr22 gene annotations, before converting to Feature format.

I will post my results here after testing with the correct genome build and the increased driver memory and or ShuffleRegionJoin method.

NeillGibson · 2015-12-06T16:36:58Z

Hi @fnothaft,

Increasing the driver memory from 8 to 14 GB (is max machine mem) did not help. Still the same errors.

I am now trying to use the ShuffleRegionJoin to filter the genotypeRDD with the FeatureRDD.


import org.bdgenomics.formats.avro._
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.models.{
ReferencePosition,
ReferenceRegion,
SequenceDictionary,
SequenceRecord,
VariantContext
}


import org.bdgenomics.adam.rdd.ShuffleRegionJoin

# Load the genotype RDDs using the AdamContext
val genotypesRDD = sc.loadGenotypes("/user/ec2-user/1kg/chr22.adam")

# Load GRCh37 / hg19 / b37 gene annotation
val genesRDD = sc.loadFeatures("/user/ec2-user/1kg/gene_annotation.adam")

# create sequence dictionary
val seqDict = SequenceDictionary( SequenceRecord("22", 51304566, url = "test://chrom1"))
val partitionSize = 20

val genotypesInGenesRDD = (ShuffleRegionJoin(seqDict, partitionSize).partitionAndJoin(genesRDD.keyBy(ReferenceRegion(_)),humanGenotypesRDD.keyBy(g => ReferenceRegion(g.getVariant.getContig.getContigName,g.getVariant.getStart, g.getVariant.getEnd)))).map(_._2).count

On what should I base the partitionSize? If I set this to 3 (example that I found in Adam test forShuffleRegionJoin ) or 11 (number of nodes) I get an out of memory error.

15/12/06 08:06:00 WARN TaskSetManager: Lost task 10.0 in stage 2.0 (TID 22, ip-10-1-1-204.eu-west-1.compute.internal): java.lang.OutOfMemoryError: Java heap space

If I set partitionSize to 100 I get a lot of small tasks. And processing takes a long time, I did not wait for it to finish.

Do you maybe have an example somewhere with a public vcf file and feature file of real sizes that works, ie the filter a genotypeRDD with BroadCastRegoinJoin or ShuffleRegionJoin. I could then look if I can run that example.

Maybe the issue is just with my gene annotation file.....

fnothaft · 2017-05-12T07:54:44Z

@devin-petersohn Can you take this as part of #1324? I think that all that is needed to resolve this issue is to:

Add a unit test demonstrating how to do this exact join.
Add docs demonstrating this join to the region join Pandocs.

I'm assigning it to you for now; let me know if you'd rather not.

devin-petersohn · 2017-06-13T22:32:24Z

This will be a part of set difference implemented in #1561.

fnothaft · 2017-07-11T05:49:19Z

Ping @devin-petersohn; can you put together a small doc snippet for this? I would like to close this in 0.23.0.

heuermh · 2017-07-13T04:57:40Z

I describe something similar here
https://github.com/SVAI/UTC-0500#interactive-genomic-region-joins-in-adam-shell

The gene feature ADAM file came from Ensembl and was filtered via

val features = sc.loadFeatures("Homo_sapiens.GRCh38.89.chr.gff3.gz")
val geneFeatures = features.transform(_.filter(f => f.featureType == "gene"))
geneFeatures.saveAsParquet("Homo_sapiens.GRCh38.89.chr.geneFeatures.adam")

devin-petersohn · 2017-07-13T05:00:33Z

I am adding multiple additional examples for various things one could do with joins. I should have a PR in tonight. It didn't seem right to simply have the one real-world example.

Resolves #890.

fnothaft added this to the 0.21.0 milestone Jul 20, 2016

This was referenced Jul 20, 2016

Documentation for common usecases #491

Closed

Update documentation #406

Closed

Release ADAM version 0.21.0 #1088

Closed

heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

fnothaft assigned devin-petersohn May 12, 2017

fnothaft added the documentation label Jul 11, 2017

devin-petersohn mentioned this issue Jul 13, 2017

Adding examples of how to use joins in the real world #1605

Merged

fnothaft closed this as completed in #1605 Jul 21, 2017

fnothaft pushed a commit that referenced this issue Jul 21, 2017

[ADAM-890] Adding examples of how to use joins in the real world

6abe7a6

Resolves #890.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to filter genotype RDD with FeatureRDD #890

How to filter genotype RDD with FeatureRDD #890

NeillGibson commented Nov 29, 2015

fnothaft commented Nov 30, 2015

NeillGibson commented Dec 1, 2015

fnothaft commented Dec 1, 2015

NeillGibson commented Dec 2, 2015

NeillGibson commented Dec 6, 2015

fnothaft commented May 12, 2017

devin-petersohn commented Jun 13, 2017

fnothaft commented Jul 11, 2017 •

edited

Loading

heuermh commented Jul 13, 2017

devin-petersohn commented Jul 13, 2017

How to filter genotype RDD with FeatureRDD #890

How to filter genotype RDD with FeatureRDD #890

Comments

NeillGibson commented Nov 29, 2015

fnothaft commented Nov 30, 2015

NeillGibson commented Dec 1, 2015

fnothaft commented Dec 1, 2015

NeillGibson commented Dec 2, 2015

NeillGibson commented Dec 6, 2015

fnothaft commented May 12, 2017

devin-petersohn commented Jun 13, 2017

fnothaft commented Jul 11, 2017 • edited Loading

heuermh commented Jul 13, 2017

devin-petersohn commented Jul 13, 2017

fnothaft commented Jul 11, 2017 •

edited

Loading