-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-1883] Python and R caching #1885
Conversation
1 similar comment
Test PASSed. |
Resolves #1883 |
This is tested and works like so:
|
I'm not sure I'm comfortable reviewing Python stuff yet. Is it reasonable to expose |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 1c40800 # timeout=10Checking out Revision 1c40800 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1c40800b3b7b8e34bbc320651ff03f1dcc7c1e12First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
931c44f
to
2e77fb1
Compare
|
re 2 I'm asking is the |
I agree @heuermh, but this is currently the only way to access a pyspark rdd from our API. The other rdd option is ._jvmRdd, but this is not a pyspark RDD, just an inaccessible Java Member. |
Test PASSed. |
I see, thanks |
@heuermh I updated the count example to exclude .rdd so it isn't so confusing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this @akmorrow13! Two things:
- Can you add a
persist(sl: StorageLevel)
method toGenomicRDD
? - Can you add the persist/cache/unpersist methods on the Python (and R)
GenomicRDD
classes? cache and unpersist should be trivial butpersist
will take a tiny amount more finesse.
To wit:
_jvmRDD has already been exposed, this PR does not expose that.
_
is Python convention for "protected", so don't expect users to call that. Instead, to GenomicRDD add:
def cache(self):
self._jvmRdd.cache()
@@ -137,6 +137,26 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging { | |||
*/ | |||
def replaceSequences(newSequences: SequenceDictionary): U | |||
|
|||
/** | |||
* Caches underlying RDD in memory. This interfaced is used to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop comment about R/Python, this method is useful independent of that.
|
||
/** | ||
* Unpersists underlying RDD from memory. This interfaced is used to | ||
* access caching functionality from the python and R APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop comment about R/Python, this method is useful independent of that.
* @return type of RDD that was cached | ||
*/ | ||
def cache() = { | ||
rdd.cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking out of ignorance, does caching make any sense for GenomicDataset.dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes... These should probably be overridden by the DatasetBound...RDD
classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akmorrow13! Sorry for the slow reply here. Yes, that is what I want, but the simpler way to do this is to have RDDBoundGenomicRDD
and DatasetBoundGenomicRDD
traits that both implement the cache
, persist
, and unpersist
methods. Then RDDBoundAlignmentRecordRDD
would extend RDDBoundGenomicRDD
, etc. We should do this for saveAsParquet
as well.
Also, my preference is that unpersist
returns Unit
(no return value) and that cache
/persist
return either U
or Unit
.
I added a persist function, but I have absolutely no idea to handle passing parameters from python to scala. I tried following the example for stringency, like this:
but that didn't seem to work |
Test PASSed. |
Test PASSed. |
adam-python/bdgenomics/adam/rdd.py
Outdated
def cache(self): | ||
self._jvmRdd.cache() | ||
|
||
def persist(self, sl): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sl
should be of typeStorageLevel
- Use the Java StorageLevels.create method
- Voila:
self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.StorageLevels.create(sl.useDisk, sl.useMemory, sl.useOffHeap, sl.useDeserialized, sl.replication)
Py4j lines Python types up with Java types, so you need to line Python bool
up with Java boolean
, not Scala.Boolean
, Python int
up with Java int
, not Scala.Integer
, etc.
Test PASSed. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 52c468a # timeout=10Checking out Revision 52c468a (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 52c468a > /home/jenkins/git2/bin/git rev-list 92d412d # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 52c468a # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 01bd9f7 # timeout=10Checking out Revision 01bd9f7 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 01bd9f7 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b38a1c5 # timeout=10Checking out Revision b38a1c5 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b38a1c5 > /home/jenkins/git2/bin/git rev-list 01bd9f7 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
adam-python/bdgenomics/adam/rdd.py
Outdated
Caches underlying RDD in memory. | ||
""" | ||
|
||
self._jvmRdd.cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
adam-python/bdgenomics/adam/rdd.py
Outdated
:param sl new StorageLevel | ||
""" | ||
|
||
self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
adam-python/bdgenomics/adam/rdd.py
Outdated
Unpersists underlying RDD from memory. | ||
""" | ||
|
||
self._jvmRdd.unpersist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
adam-r/bdgenomics.adam/R/rdd.R
Outdated
setMethod("cache", | ||
signature(ardd = "GenomicRDD"), | ||
function(ardd) { | ||
sparkR.callJMethod(ardd@jrdd, "cache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
adam-r/bdgenomics.adam/R/rdd.R
Outdated
sl = "character"), | ||
function(ardd, sl) { | ||
storageLevel <- sparkR.callJStatic("org.apache.spark.storage.StorageLevel", "fromString", sl) | ||
sparkR.callJMethod(ardd@jrdd, "persist", storageLevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
adam-r/bdgenomics.adam/R/rdd.R
Outdated
setMethod("unpersist", | ||
signature(ardd = "GenomicRDD"), | ||
function(ardd) { | ||
sparkR.callJMethod(ardd@jrdd, "unpersist") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, these should replace the RDD.
Test PASSed. |
bff842b
to
164c067
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of stylistic nits. Looks good otherwise; thanks @akmorrow13!
JavaSaveArgs, | ||
SAMHeaderWriter | ||
} | ||
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, JavaSaveArgs, SAMHeaderWriter } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please split import across multiple lines.
@@ -30,7 +30,7 @@ import org.bdgenomics.adam.models.{ | |||
SequenceDictionary | |||
} | |||
import org.bdgenomics.adam.rdd.ADAMContext._ | |||
import org.bdgenomics.adam.rdd.{ AvroRecordGroupGenomicRDD, JavaSaveArgs } | |||
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroRecordGroupGenomicRDD, JavaSaveArgs } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split long import line across multiple lines.
AvroGenomicRDD, | ||
VCFHeaderUtils | ||
} | ||
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, VCFHeaderUtils } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: please split long import across multiple lines.
adam-python/bdgenomics/adam/rdd.py
Outdated
:param sl new StorageLevel | ||
""" | ||
|
||
return self._replaceRdd(self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is really long. Can you split it out as:
jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk,
sl.useMemory,
sl.useOffHeap,
sl.deserialized,
sl.replication)
return self._replaceRdd(self._jvmRdd.persist(jsl))
Indented parameters should align with the opening delimiter.
@@ -201,3 +202,33 @@ def test_filterByOverlappingRegions(self): | |||
|
|||
filtered = reads.filterByOverlappingRegions(querys) | |||
self.assertEquals(filtered.toDF().count(), 2) | |||
|
|||
def test_caching(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: 2 lines of space before each new function definition.
863b906
to
0fdbdf7
Compare
Test PASSed. |
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor doc fixes, and a question about GenomicDatset
.
} | ||
|
||
/** | ||
* Persists underlying RDD in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.
} | ||
|
||
/** | ||
* Unpersists underlying RDD from memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.
/** | ||
* A trait describing a GenomicDataset that also supports the Spark SQL APIs. | ||
*/ | ||
trait DatasetBoundGenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicDataset[T, U, V] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fnothaft I think you may have suggested this, doesn't GenomicDataset
already imply the RDD has been bound to a dataset? Couldn't these be moved there?
I'm not very sure what GenomicDataset
is for, it isn't used everywhere I would've thought
https://github.com/bigdatagenomics/adam/search?utf8=%E2%9C%93&q=GenomicDataset&type=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, GenomicDataset
is any GenomicRDD
where the underlying genomic data can be viewed as both an RDD or a Dataset, while the base GenomicRDD
trait only provides RDDs. This distinction will go away soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetUnbound
/RDDBound
/DatasetBound
implies how the data is currently physically represented. Parquet unbound means that the data hasn't been materialized into memory yet and that the data is represented using a Parquet file, while RDD/DatasetBound means that the current (as of the last transformation) representation is an RDD/Dataset.
Since cache
impacts the physical materialization of the data, we want the new DatasetBoundGenomicDataset
abstract class to override cache
, and not the GenomicDataset
abstract class.
} | ||
|
||
/** | ||
* Persists underlying dataset in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Persists underlying dataset in memory. → Persists underlying dataset in memory or disk.
* Persists underlying dataset in memory. | ||
* | ||
* @param sl new StorageLevel | ||
* @return Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/** | ||
* Unpersists underlying dataset from memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unpersists underlying dataset from memory. → Unpersists underlying dataset from memory or disk.
/** | ||
* Unpersists underlying dataset from memory. | ||
* | ||
* @return Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unpersisted, not uncached
adam-python/bdgenomics/adam/rdd.py
Outdated
|
||
def persist(self, sl): | ||
""" | ||
Persists underlying RDD in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.
adam-python/bdgenomics/adam/rdd.py
Outdated
|
||
def unpersist(self): | ||
""" | ||
Unpersists underlying RDD from memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.
0fdbdf7
to
026ffad
Compare
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list e719c88 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
1 similar comment
Jenkins, retest this please. |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
adam-python/bdgenomics/adam/rdd.py
Outdated
:return: Returns a new, persisted RDD. | ||
""" | ||
|
||
jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use \
to break lines, and please indent to the opening delimiter. I.e., every sl
should be at the same character position on the line.
adam-python/bdgenomics/adam/rdd.py
Outdated
|
||
return self._replaceRdd(self._jvmRdd.persist(jsl)) | ||
|
||
def unpersist(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two spaces above.
From looking at Jenkins, needs a |
Test FAILed. Build result: FAILURE[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 89131e6 # timeout=10Checking out Revision 89131e6 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 89131e6 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
Jenkins, retest this please. |
Test PASSed. |
I think this is all set @fnothaft let me know if there is anything else that is needed |
36d11d5
to
129f617
Compare
Test PASSed. |
Thanks @akmorrow13! Merged manually as 6051321. |
No description provided.