Add ADAMContext APIs to create genomic RDDs from dataframes #2000

henrydavidge · 2018-06-20T21:03:09Z

In some cases, it's convenient to use vanilla Spark APIs to load genomic data and apply some basic transformations before creating an ADAM object. This PR adds methods to ADAMContext to load each type of genomic RDD from a Spark SQL dataframe and a metadata path. We look for metadata like the sequence and record group dictionaries in the metadata path.

AmplabJenkins · 2018-06-20T21:03:12Z

Can one of the admins verify this patch?

henrydavidge · 2018-06-20T21:08:05Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala

@@ -760,4 +759,91 @@ class ADAMContextSuite extends ADAMFunSuite {
    assert(secondPg.head.getCommandLine === "\"myProg 456\"")
    assert(secondPg.head.getVersion === "1.0.0")
  }
+
+  sparkTest("load variant contexts from dataframe") {


@fnothaft This test (and it's equivalent form where we reload using the loadVariantContexts(path: String) method) don't pass against my PR or master, which surprised me. Any ideas why?

I'll take a look into the failure.

Not sure why this is failing; I took the test output and pasted the dump of all of the records in each array and they have the same textual values. You could either be getting bitten by some odd floating point comparison bug, or perhaps the Array== comparator does something wonky.

heuermh · 2018-06-21T17:13:45Z

Thanks for the contribution, @henrydavidge!

heuermh · 2018-06-21T17:13:51Z

Jenkins, test this please

AmplabJenkins · 2018-06-21T17:25:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2772/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/2000/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 82e03c9 # timeout=10Checking out Revision 82e03c9 (origin/pr/2000/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 82e03c9455cd8a8cdbe1599c119a7401e4856adeFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

heuermh

Some code style suggestions and a few questions.

heuermh · 2018-06-21T17:15:53Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

-  VCFHeaderLine,
-  VCFInfoHeaderLine
-}
+import htsjdk.variant.vcf.{ VCFCompoundHeaderLine, VCFFormatHeaderLine, VCFHeader, VCFHeaderLine, VCFInfoHeaderLine }


Code style: we format this style of import to separate lines if there are more than three items or if the line gets too long. Same goes for the other import formatting changes.

heuermh · 2018-06-21T17:27:57Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -2371,6 +2318,13 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
    }
  }

+  def loadVariantContexts(df: DataFrame, metadataPath: String): VariantContextRDD = {


If we're calling df.as[FooProduct] to convert to Dataset, shouldn't the API method accept Dataset[FooProduct] instead? This principle is described here https://github.com/google/guice/wiki/InjectOnlyDirectDependencies

As a public API, I think it's more convenient to pass a dataframe since the initial transformation is likely to be expressed in SQL or the dataframe API. Since users may call this method interactively from the shell, I think it's important to minimize friction.

I think exposing this as a DataFrame is fine. We're not super opinionated between DataFrames and Datasets (wherever we expose a Dataset, we expose the underlying DataFrame as well), and in the Python/R APIs, we only expose DataFrames.

heuermh · 2018-06-21T17:29:35Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala

 import org.bdgenomics.adam.models._
 import org.bdgenomics.adam.rdd.ADAMContext._
 import org.bdgenomics.adam.util.PhredUtils._
 import org.bdgenomics.adam.util.ADAMFunSuite
+import org.bdgenomics.adam.sql.{ VariantContext => VCProduct }


Code style: don't add unnecessary abbreviations, especially in class names

heuermh · 2018-06-21T17:30:44Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

+  def loadAlignments(df: DataFrame, metadataPath: String): AlignmentRecordRDD = {
+    val sd = loadAvroSequenceDictionary(metadataPath)
+    val rgd = loadAvroRecordGroupDictionary(metadataPath)
+    val process = loadAvroPrograms(metadataPath)


process → processingSteps

heuermh · 2018-06-21T17:34:19Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -1121,6 +1072,8 @@ private class NoPrefixFileFilter(private val prefix: String) extends PathFilter
 * @param sc The SparkContext to wrap.
 */
 class ADAMContext(@transient val sc: SparkContext) extends Serializable with Logging {
+  @transient val spark = SQLContext.getOrCreate(sc).sparkSession


This doesn't appear to be used anywhere, except for the import below. Why bring that up here instead of keeping it as import sqlContext.implicits._ within a method as before?

Well, I thought it was a reasonable thing to have available, and it saved me from importing the implicitits in each of the new functions I added. I'm fine with moving it if you prefer, though.

If it eliminates the need to repeatedly import the implicits, then I'd favor keeping it.

fnothaft

Generally LGTM, a few small comments. Would you be able to add this to the Python and R APIs as well?

fnothaft · 2018-06-27T04:41:20Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -18,15 +18,10 @@
 package org.bdgenomics.adam.rdd

 import java.io.{ File, FileNotFoundException, InputStream }
+


Nit: no space after import.

fnothaft · 2018-06-27T04:42:50Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -1121,6 +1072,8 @@ private class NoPrefixFileFilter(private val prefix: String) extends PathFilter
 * @param sc The SparkContext to wrap.
 */
 class ADAMContext(@transient val sc: SparkContext) extends Serializable with Logging {
+  @transient val spark = SQLContext.getOrCreate(sc).sparkSession


If it eliminates the need to repeatedly import the implicits, then I'd favor keeping it.

fnothaft · 2018-06-27T04:47:12Z

adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala

@@ -2371,6 +2318,13 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
    }
  }

+  def loadVariantContexts(df: DataFrame, metadataPath: String): VariantContextRDD = {


I think exposing this as a DataFrame is fine. We're not super opinionated between DataFrames and Datasets (wherever we expose a Dataset, we expose the underlying DataFrame as well), and in the Python/R APIs, we only expose DataFrames.

fnothaft · 2018-06-27T04:47:36Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala

 import org.bdgenomics.adam.models._
 import org.bdgenomics.adam.rdd.ADAMContext._
 import org.bdgenomics.adam.util.PhredUtils._
 import org.bdgenomics.adam.util.ADAMFunSuite
+import org.bdgenomics.adam.sql.{ VariantContext => VCProduct }


fnothaft · 2018-06-27T04:48:55Z

adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala

@@ -760,4 +759,91 @@ class ADAMContextSuite extends ADAMFunSuite {
    assert(secondPg.head.getCommandLine === "\"myProg 456\"")
    assert(secondPg.head.getVersion === "1.0.0")
  }
+
+  sparkTest("load variant contexts from dataframe") {


I'll take a look into the failure.

heuermh · 2019-05-24T22:39:17Z

Closing in favor of #2158

henrydavidge added 3 commits June 20, 2018 13:41

Add ADAMContext APIs to create genomic RDDs from dataframes

95b6e9f

some cleanup

6221afc

fix compilation

e0cc666

henrydavidge commented Jun 20, 2018

View reviewed changes

fix one codacy comment

089214a

heuermh requested changes Jun 21, 2018

View reviewed changes

fnothaft requested changes Jun 27, 2018

View reviewed changes

heuermh added this to the 0.28.0 milestone May 7, 2019

heuermh mentioned this pull request May 24, 2019

[ADAM-2159] Add load methods for data frames #2158

Merged

heuermh closed this May 24, 2019

heuermh mentioned this pull request May 24, 2019

Add load methods for data frames #2159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ADAMContext APIs to create genomic RDDs from dataframes #2000

Add ADAMContext APIs to create genomic RDDs from dataframes #2000

henrydavidge commented Jun 20, 2018

AmplabJenkins commented Jun 20, 2018

henrydavidge Jun 20, 2018

fnothaft Jun 27, 2018

fnothaft Jun 27, 2018

heuermh commented Jun 21, 2018

heuermh commented Jun 21, 2018

AmplabJenkins commented Jun 21, 2018

heuermh left a comment

heuermh Jun 21, 2018

heuermh Jun 21, 2018

henrydavidge Jun 21, 2018

fnothaft Jun 27, 2018

heuermh Jun 21, 2018

fnothaft Jun 27, 2018

heuermh Jun 21, 2018

heuermh Jun 21, 2018

henrydavidge Jun 21, 2018

fnothaft Jun 27, 2018

fnothaft left a comment

fnothaft Jun 27, 2018

fnothaft Jun 27, 2018

fnothaft Jun 27, 2018

fnothaft Jun 27, 2018

fnothaft Jun 27, 2018

heuermh commented May 24, 2019

		@@ -18,15 +18,10 @@
		package org.bdgenomics.adam.rdd

		import java.io.{ File, FileNotFoundException, InputStream }

Add ADAMContext APIs to create genomic RDDs from dataframes #2000

Add ADAMContext APIs to create genomic RDDs from dataframes #2000

Conversation

henrydavidge commented Jun 20, 2018

AmplabJenkins commented Jun 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented Jun 21, 2018

heuermh commented Jun 21, 2018

AmplabJenkins commented Jun 21, 2018

Build result: FAILURE

heuermh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fnothaft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heuermh commented May 24, 2019