Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added is* genotype methods from HTS-JDK Genotype to RichGenotype #895

Merged
merged 1 commit into from
Dec 10, 2015

Conversation

NeillGibson
Copy link
Contributor

I added the is* genotype methods from HTS-JDK Genotype to Adam RichGenotype.
This makes it easier to filter on a genotypeRDD.
Compare for instance
genotypeRDD.filter(_.isHom)
to
genotypeRDD.filter(gt => (gt.getType = GenotypeType.HOM_REF) || (gt.getType = GenotypeType.HOM_ALT) )

These methods are copied almost exactly from from
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/variant/variantcontext/Genotype.java#L230

I did not add isMixed and isAvailable since Adam seems to handle these Genotypes the same as GenotypeType.NO_CALL

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@heuermh
Copy link
Member

heuermh commented Dec 7, 2015

Jenkins, test this please

/**
* True if all observed alleles are ref; if any alleles are no-calls, this method will return false.
*/
def isHomRef: Boolean = { getType == GenotypeType.HOM_REF }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Professing some Scala ignorance here, is == ok for enum values across different JVMs and serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure , == is used in this Scala Enum documentation example http://www.scala-lang.org/api/current/index.html#scala.Enumeration

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1028/

Build result: FAILURE

GitHub pull request #895 of commit 45d68e8 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/895/merge^{commit} # timeout=10 > git branch -a --contains 5bc300a5fc823ad8fed4efbfc9d5cc4e7bdbb6c6 # timeout=10 > git rev-parse remotes/origin/pr/895/merge^{commit} # timeout=10Checking out Revision 5bc300a5fc823ad8fed4efbfc9d5cc4e7bdbb6c6 (origin/pr/895/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5bc300a5fc823ad8fed4efbfc9d5cc4e7bdbb6c6First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh
Copy link
Member

heuermh commented Dec 7, 2015

Build errors from Jenkins, not sure what the scalariform ones mean

[INFO] --- scalariform-maven-plugin:0.1.4:format (default-cli) @ adam-parent_2.10 ---
[ERROR] Error formatting /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.6.0/SCALAVER/2.10/SPARK_VERSION/1.4.1/label/centos/adam-core/src/test/scala/org/bdgenomics/adam/rdd/read/MDTaggingSuite.scala: java.nio.charset.MalformedInputException: Input length = 1
[ERROR] Error formatting /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.6.0/SCALAVER/2.10/SPARK_VERSION/1.4.1/label/centos/adam-core/src/main/scala/org/bdgenomics/adam/rdd/read/AlignmentRecordRDDFunctions.scala: java.nio.charset.MalformedInputException: Input length = 1
[ERROR] Error formatting /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.6.0/SCALAVER/2.10/SPARK_VERSION/1.4.1/label/centos/adam-cli/src/main/scala/org/bdgenomics/adam/cli/PrintADAM.scala: java.nio.charset.MalformedInputException: Input length = 1
...
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ adam-core_2.10 ---
[INFO] Compiling 108 source files to /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.6.0/SCALAVER/2.10/SPARK_VERSION/1.4.1/label/centos/adam-core/target/scala-2.10.4/classes at 1449506972607
[ERROR] /home/jenkins/workspace/ADAM-prb/HADOOP_VERSION/2.6.0/SCALAVER/2.10/SPARK_VERSION/1.4.1/label/centos/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala:55: error: value HOM_VAR is not a member of object org.bdgenomics.formats.avro.GenotypeType
[ERROR]   def isHomVar: Boolean = { getType == GenotypeType.HOM_VAR }
[ERROR]                                                     ^
[ERROR] one error found

@NeillGibson
Copy link
Contributor Author

Mhm I should have checked the GenotypeType Enum definition. HOM_VAR is renamed HOM_ALT in BDG.

https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L486

And that is also what getType returns in the case of two equal alternative alleles.
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L35

Checking for equality with GenotypeType.HOM_ALT should fix this error. And the method can be renamed to isHomAlt

I am also not sure what the scalariform error means.

@fnothaft
Copy link
Member

fnothaft commented Dec 7, 2015

Code LGTM; will look at the Jenkins logs.

@NeillGibson
Copy link
Contributor Author

Thank you @fnothaft . I noticed one other related thing, which might be included in this Pull request or I can open a ticket for it. The getType method of RichGenotype does not seem to handle the case where the combination of alleles contains an OtherAlt allele.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L36

That means that for genotypes with an OtherAlt allele the getType method returns GenotypeType.NO_CALL which I guess is wrong.

Shouldn't the diploid case for GenotypeType.HET be:

 case List(GenotypeAllele.Ref, GenotypeAllele.Alt) |
List(GenotypeAllele.Alt, GenotypeAllele.Ref)  |
List(GenotypeAllele.Alt, GenotypeAllele.OtherAlt) | 
List(GenotypeAllele.OtherAlt, GenotypeAllele.Alt) => GenotypeType.HET

With the above case for GenotypeType.HET the isHetNonRef method from HTS-JDK Genotype can also be implemented.

https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/variant/variantcontext/Genotype.java#L250

@NeillGibson
Copy link
Contributor Author

Hi @fnothaft . Also I believe that the getType method in RichGenotype should handle ploidy levels other than 2.

It is currently limited to haploid or diploid:

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L32

The HTS-JDK methods getType / determineType handle ploidy levels other than 2:

https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/variant/variantcontext/Genotype.java#L198

And also the biological concepts homozygous reference, homozygous alternative and heterozygous are applicable to any ploidy level. It just means that all your alleles are reference, the same alternative, or a mix of different alleles.

Except that heterozygous (mix of different alleles) is impossible in a haploid.

I could take a shot at creating a getType method which should work for any ploidy level. In this pull request or another one.

@heuermh
Copy link
Member

heuermh commented Dec 8, 2015

Out of respect for a former colleague, the term ploidy should be reserved for differences in number at the entire chromosome level. More than two observed alleles at a single position is better called something else, like copy number variation.

Not sure why Jenkins hasn't retested this yet...

@heuermh
Copy link
Member

heuermh commented Dec 8, 2015

Jenkins, retest this please

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1029/

Build result: FAILURE

GitHub pull request #895 of commit 1220f66 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/895/merge^{commit} # timeout=10 > git branch -a --contains 5fb6aae089b8ed669c626532b8e36a3282757d0e # timeout=10 > git rev-parse remotes/origin/pr/895/merge^{commit} # timeout=10Checking out Revision 5fb6aae089b8ed669c626532b8e36a3282757d0e (origin/pr/895/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 5fb6aae089b8ed669c626532b8e36a3282757d0eFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@NeillGibson
Copy link
Contributor Author

Hi Michael thank you for retesting.

With ploidy I mean difference in number at the entire chromosome, like in some animals, plants and fungi is always the case for every chromosome. With a tetraploid organism for example a variant caller would always output 4 alleles for a genotype. For example these allele combinations which also have genotypTypes:

tetraploid allele combination genotypeType
Ref,Ref,Ref,Ref HOM_REF
Alt,Alt,Alt,Alt HOM_ALT
Ref,Alt,Alt,Alt HET
Ref,Alt,Alt,OtherAlt HET
Alt,Alt,Alt,OtherAlt HET

For the above genotypes from a tetraploid organism this assertion would fail
https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala#L32
while I think it should just return the correct genotypeType like the same function in HTS-JDK does.

But maybe this should go into another issue / pull request. To not have to many things at the same time going on.

Don't know by the way what the alleles should be for

tetraploid allele combination genotypeType
OtherAlt,OtherAlt,OtherAlt,OtherAlt ????

I guess HOM_ALT or HET depending on if the OtherAlt alleles are different of the same? But there is no way to find this out?

@heuermh
Copy link
Member

heuermh commented Dec 8, 2015

Right, we're on the same page then.

Looks like Jenkins is complaining about source code formatting

...
+ test -n ' M adam-core/src/main/scala/org/bdgenomics/adam/rich/RichGenotype.scala'
+ echo 'Please run '\''./scripts/format-source'\'''
Please run './scripts/format-source'
+ exit 1
Build step 'Execute shell' marked build as failure
Recording test results
Finished: FAILURE

Could you try running that script and then squashing the commits in this pull request?

It might be best to move the additional changes into a separate issue/pull request, since there may be other places where we aren't handling e.g. tetraploid organisms correctly.

Thanks!

@NeillGibson
Copy link
Contributor Author

Ok thank you for the tip. I tried to squash my commits. Looks like it worked.

I will create an issue or a pull request for taking care of polyploid genotypes and genotypes containing OtherAlt in getType.

@heuermh
Copy link
Member

heuermh commented Dec 8, 2015

That looks good, thanks.

I wonder if I need to kick Jenkins again. @fnothaft do we need to add @NeillGibson to some ACL somewhere for Jenkins to auto-trigger on his pull requests?

@fnothaft
Copy link
Member

Hi @NeillGibson and @heuermh!

Sorry for the slow reply.

Out of respect for a former colleague, the term ploidy should be reserved for differences in number at the entire chromosome level. More than two observed alleles at a single position is better called something else, like copy number variation.

I agree.

I could take a shot at creating a getType method which should work for any ploidy level. In this pull request or another one.

That would be great! I agree RE: your comments about the concepts being valid for all copy numbers.

It might be best to move the additional changes into a separate issue/pull request, since there may be other places where we aren't handling e.g. tetraploid organisms correctly.

Agreed.

@fnothaft do we need to add @NeillGibson to some ACL somewhere for Jenkins to auto-trigger on his pull requests?

We shouldn't need to. I'll kick Jenkins again though. The PR builder hook is a bit flaky.

@fnothaft
Copy link
Member

Jenkins, test this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1030/
Test PASSed.

@fnothaft
Copy link
Member

@heuermh is this good to merge by you?

@heuermh
Copy link
Member

heuermh commented Dec 10, 2015

+1

@fnothaft fnothaft merged commit 5895386 into bigdatagenomics:master Dec 10, 2015
@fnothaft
Copy link
Member

Merged. Thanks @NeillGibson!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants