Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1883] Python and R caching #1885

Closed
wants to merge 1 commit into from

Conversation

akmorrow13
Copy link
Contributor

No description provided.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 82.728% when pulling b11d7c2 on akmorrow13:python-caching into c5b4ebc on bigdatagenomics:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.05%) to 82.728% when pulling b11d7c2 on akmorrow13:python-caching into c5b4ebc on bigdatagenomics:master.

@coveralls
Copy link

coveralls commented Jan 24, 2018

Coverage Status

Coverage decreased (-0.2%) to 82.561% when pulling 129f617 on akmorrow13:python-caching into eb3528c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2591/
Test PASSed.

@akmorrow13
Copy link
Contributor Author

Resolves #1883

@akmorrow13
Copy link
Contributor Author

akmorrow13 commented Jan 24, 2018

This is tested and works like so:

reads = ac.loadAlignments(readsPath, stringency=LENIENT)
reads.__jvmRdd.cache()
reads.toDF().count() # caching
reads.toDF().count() # is cached, will be fast
reads._jvmRdd.unpersist()

@heuermh
Copy link
Member

heuermh commented Jan 24, 2018

I'm not sure I'm comfortable reviewing Python stuff yet. Is it reasonable to expose reads.__jvmRdd. to the caller like this? And isn't toDF().rdd contradictory?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2594/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 1c40800 # timeout=10Checking out Revision 1c40800 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1c40800b3b7b8e34bbc320651ff03f1dcc7c1e12First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13
Copy link
Contributor Author

@heuermh

  1. _jvmRDD has already been exposed, this PR does not expose that.
  2. I have added no code calling .toDF().rdd, so I am not sure what problem there is there. This is the most straightforward way to access the pyspark RDD currently. We could add a better method for accessing this but that is a different problem then the one we are addressing here.

@heuermh
Copy link
Member

heuermh commented Jan 24, 2018

re 2 I'm asking is the toDF() necessary in your comment above? Seems odd to convert to data frame and then back to rdd.

@akmorrow13
Copy link
Contributor Author

I agree @heuermh, but this is currently the only way to access a pyspark rdd from our API. The other rdd option is ._jvmRdd, but this is not a pyspark RDD, just an inaccessible Java Member.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2595/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jan 24, 2018

I see, thanks

@akmorrow13
Copy link
Contributor Author

@heuermh I updated the count example to exclude .rdd so it isn't so confusing

Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this @akmorrow13! Two things:

  1. Can you add a persist(sl: StorageLevel) method to GenomicRDD?
  2. Can you add the persist/cache/unpersist methods on the Python (and R) GenomicRDD classes? cache and unpersist should be trivial but persist will take a tiny amount more finesse.

To wit:

_jvmRDD has already been exposed, this PR does not expose that.

_ is Python convention for "protected", so don't expect users to call that. Instead, to GenomicRDD add:

  def cache(self):

     self._jvmRdd.cache()

@@ -137,6 +137,26 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
*/
def replaceSequences(newSequences: SequenceDictionary): U

/**
* Caches underlying RDD in memory. This interfaced is used to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop comment about R/Python, this method is useful independent of that.


/**
* Unpersists underlying RDD from memory. This interfaced is used to
* access caching functionality from the python and R APIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop comment about R/Python, this method is useful independent of that.

* @return type of RDD that was cached
*/
def cache() = {
rdd.cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking out of ignorance, does caching make any sense for GenomicDataset.dataset?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes... These should probably be overridden by the DatasetBound...RDD classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnothaft @heuermh I am confused what this means. Do you want me to implement this function in every DatasetBound...RDD function? There seems to be a lot of them, while this approach just requires 1 implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akmorrow13! Sorry for the slow reply here. Yes, that is what I want, but the simpler way to do this is to have RDDBoundGenomicRDD and DatasetBoundGenomicRDD traits that both implement the cache, persist, and unpersist methods. Then RDDBoundAlignmentRecordRDD would extend RDDBoundGenomicRDD, etc. We should do this for saveAsParquet as well.

Also, my preference is that unpersist returns Unit (no return value) and that cache/persist return either U or Unit.

@akmorrow13
Copy link
Contributor Author

akmorrow13 commented Jan 25, 2018

I added a persist function, but I have absolutely no idea to handle passing parameters from python to scala. I tried following the example for stringency, like this:

self._jvmRdd.persist(self.sc._jvm.org.apache.spark.storage.StorageLevel.apply(sl_dict["useDisk"], sl_dict["useMemory"],
                             sl_dict["deserialized"], sl_dict["replication"]))

but that didn't seem to work

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2599/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2600/
Test PASSed.

def cache(self):
self._jvmRdd.cache()

def persist(self, sl):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. sl should be of type StorageLevel
  2. Use the Java StorageLevels.create method
  3. Voila:
self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.StorageLevels.create(sl.useDisk, sl.useMemory, sl.useOffHeap, sl.useDeserialized, sl.replication)

Py4j lines Python types up with Java types, so you need to line Python bool up with Java boolean, not Scala.Boolean, Python int up with Java int, not Scala.Integer, etc.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2601/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2603/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 52c468a # timeout=10Checking out Revision 52c468a (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 52c468a > /home/jenkins/git2/bin/git rev-list 92d412d # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2604/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 52c468a # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2605/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2606/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 01bd9f7 # timeout=10Checking out Revision 01bd9f7 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 01bd9f7 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2607/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b38a1c5 # timeout=10Checking out Revision b38a1c5 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b38a1c5 > /home/jenkins/git2/bin/git rev-list 01bd9f7 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13
Copy link
Contributor Author

Jenkins, retest this please.

Caches underlying RDD in memory.
"""

self._jvmRdd.cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

:param sl new StorageLevel
"""

self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

Unpersists underlying RDD from memory.
"""

self._jvmRdd.unpersist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

setMethod("cache",
signature(ardd = "GenomicRDD"),
function(ardd) {
sparkR.callJMethod(ardd@jrdd, "cache")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

sl = "character"),
function(ardd, sl) {
storageLevel <- sparkR.callJStatic("org.apache.spark.storage.StorageLevel", "fromString", sl)
sparkR.callJMethod(ardd@jrdd, "persist", storageLevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

setMethod("unpersist",
signature(ardd = "GenomicRDD"),
function(ardd) {
sparkR.callJMethod(ardd@jrdd, "unpersist")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, these should replace the RDD.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2648/
Test PASSed.

@akmorrow13 akmorrow13 added this to the 0.24.0 milestone Feb 7, 2018
Copy link
Member

@fnothaft fnothaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of stylistic nits. Looks good otherwise; thanks @akmorrow13!

JavaSaveArgs,
SAMHeaderWriter
}
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, JavaSaveArgs, SAMHeaderWriter }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please split import across multiple lines.

@@ -30,7 +30,7 @@ import org.bdgenomics.adam.models.{
SequenceDictionary
}
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.{ AvroRecordGroupGenomicRDD, JavaSaveArgs }
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroRecordGroupGenomicRDD, JavaSaveArgs }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split long import line across multiple lines.

AvroGenomicRDD,
VCFHeaderUtils
}
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, VCFHeaderUtils }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please split long import across multiple lines.

:param sl new StorageLevel
"""

return self._replaceRdd(self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really long. Can you split it out as:

jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk,
  sl.useMemory,
  sl.useOffHeap,
  sl.deserialized,
  sl.replication)

return self._replaceRdd(self._jvmRdd.persist(jsl))

Indented parameters should align with the opening delimiter.

@@ -201,3 +202,33 @@ def test_filterByOverlappingRegions(self):

filtered = reads.filterByOverlappingRegions(querys)
self.assertEquals(filtered.toDF().count(), 2)

def test_caching(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 2 lines of space before each new function definition.

@akmorrow13 akmorrow13 force-pushed the python-caching branch 2 times, most recently from 863b906 to 0fdbdf7 Compare February 7, 2018 18:07
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2650/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2651/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2652/
Test PASSed.

Copy link
Member

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor doc fixes, and a question about GenomicDatset.

}

/**
* Persists underlying RDD in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.

}

/**
* Unpersists underlying RDD from memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.

/**
* A trait describing a GenomicDataset that also supports the Spark SQL APIs.
*/
trait DatasetBoundGenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicDataset[T, U, V] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnothaft I think you may have suggested this, doesn't GenomicDataset already imply the RDD has been bound to a dataset? Couldn't these be moved there?

I'm not very sure what GenomicDataset is for, it isn't used everywhere I would've thought
https://github.com/bigdatagenomics/adam/search?utf8=%E2%9C%93&q=GenomicDataset&type=

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, GenomicDataset is any GenomicRDD where the underlying genomic data can be viewed as both an RDD or a Dataset, while the base GenomicRDD trait only provides RDDs. This distinction will go away soon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetUnbound/RDDBound/DatasetBound implies how the data is currently physically represented. Parquet unbound means that the data hasn't been materialized into memory yet and that the data is represented using a Parquet file, while RDD/DatasetBound means that the current (as of the last transformation) representation is an RDD/Dataset.

Since cache impacts the physical materialization of the data, we want the new DatasetBoundGenomicDataset abstract class to override cache, and not the GenomicDataset abstract class.

}

/**
* Persists underlying dataset in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persists underlying dataset in memory. → Persists underlying dataset in memory or disk.

* Persists underlying dataset in memory.
*
* @param sl new StorageLevel
* @return Unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@return Unit → @return V cached GenomicDataset

}

/**
* Unpersists underlying dataset from memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpersists underlying dataset from memory. → Unpersists underlying dataset from memory or disk.

/**
* Unpersists underlying dataset from memory.
*
* @return Unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@return Unit → @return V uncached GenomicDataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unpersisted, not uncached


def persist(self, sl):
"""
Persists underlying RDD in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.


def unpersist(self):
"""
Unpersists underlying RDD from memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2654/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list e719c88 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13
Copy link
Contributor Author

Jenkins, retest this please.

1 similar comment
@akmorrow13
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2655/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

:return: Returns a new, persisted RDD.
"""

jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use \ to break lines, and please indent to the opening delimiter. I.e., every sl should be at the same character position on the line.


return self._replaceRdd(self._jvmRdd.persist(jsl))

def unpersist(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two spaces above.

@fnothaft
Copy link
Member

fnothaft commented Feb 9, 2018

From looking at Jenkins, needs a ./scripts/format-sources.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2656/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 89131e6 # timeout=10Checking out Revision 89131e6 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 89131e6 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2659/
Test PASSed.

@akmorrow13
Copy link
Contributor Author

I think this is all set @fnothaft let me know if there is anything else that is needed

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2667/
Test PASSed.

@fnothaft
Copy link
Member

Thanks @akmorrow13! Merged manually as 6051321.

@fnothaft fnothaft closed this Feb 14, 2018
@heuermh heuermh changed the title Python and R caching [ADAM-1883] Python and R caching Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants