[SPARK-1991] Support custom storage levels for vertices and edges #946

ankurdave · 2014-06-03T00:18:21Z

This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.

The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the targetStorageLevel attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls Graph#cache(), the vertices and edges are persisted using their target storage levels.

In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the withPartitionsRDD and withTargetStorageLevel methods.

I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.

AmplabJenkins · 2014-06-03T00:22:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T00:23:06Z

Merged build started.

AmplabJenkins · 2014-06-03T01:04:00Z

Merged build finished.

AmplabJenkins · 2014-06-03T01:04:00Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15359/

AmplabJenkins · 2014-06-03T01:07:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T01:08:07Z

Merged build started.

AmplabJenkins · 2014-06-03T01:48:35Z

Merged build finished.

AmplabJenkins · 2014-06-03T01:48:35Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15361/

This reverts commit 34bcefb.

AmplabJenkins · 2014-06-03T02:47:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T02:48:04Z

Merged build started.

AmplabJenkins · 2014-06-03T03:28:38Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T03:28:38Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15368/

ankurdave · 2014-06-03T18:09:12Z

@rxin

rxin · 2014-06-03T20:08:19Z

graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala

+   * [[org.apache.spark.graphx.EdgeRDD#cache]] on the returned EdgeRDD.
+   */
+  private[graphx] def withTargetStorageLevel(
+      targetStorageLevel_ : StorageLevel): EdgeRDD[ED, VD] = {


I think it is ok to just shadow the class member targetStorageLevel rather than adding a weird _ at the end ...

AmplabJenkins · 2014-06-03T20:17:58Z

Merged build triggered.

AmplabJenkins · 2014-06-03T20:18:07Z

Merged build started.

rxin · 2014-06-03T20:20:28Z

graphx/src/main/scala/org/apache/spark/graphx/lib/Analytics.scala

@@ -52,25 +56,48 @@ object Analytics extends Logging {
      }
    }

+    def pickStorageLevel(v: String): StorageLevel = {


perhaps move this into spark storagelevle itself.

rxin · 2014-06-03T20:24:44Z

The changes look good to me, other than the minor thing on storage level.

AmplabJenkins · 2014-06-03T20:32:59Z

Merged build triggered.

AmplabJenkins · 2014-06-03T20:38:19Z

Merged build started.

AmplabJenkins · 2014-06-03T20:55:11Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T20:55:11Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15398/

mateiz · 2014-06-03T21:12:00Z

graphx/src/main/scala/org/apache/spark/graphx/EdgeRDD.scala

@@ -32,7 +33,8 @@ import org.apache.spark.graphx.impl.EdgePartition
 * `impl.ReplicatedVertexView`.
 */
 class EdgeRDD[@specialized ED: ClassTag, VD: ClassTag](
-    val partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])])
+    val partitionsRDD: RDD[(PartitionID, EdgePartition[ED, VD])],
+    val targetStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
  extends RDD[Edge[ED]](partitionsRDD.context, List(new OneToOneDependency(partitionsRDD))) {


Should EdgeRDD be marked @DeveloperAPI? Or can users use it directly? This is technically a binary-compat breaking change (though it doesn't affect source compat).

BTW you could avoid the breakage by having separate 2-arg and 3-arg constructors but if this is an internal API it's fine to leave it. Just wanted to ask whether users call this directly.

Users may manipulate it directly, because it's returned by Graph#edges, but they should never call the constructor. I actually wanted to make this constructor private, but that interfered with Scala specialization.

Ah, weird. Probably long-term the way to do it might be to create a trait EdgeRDD that users see, and an EdgeRDDImpl that is private[graphx].

AmplabJenkins · 2014-06-03T21:18:27Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-03T21:18:27Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15402/

mateiz · 2014-06-03T21:20:10Z

Hey so one comment, overall this patch has several areas where it introduces default parameters, including some public APIs (e.g. Graph.fromEdgeTuples). This will break binary compatibility but not source compatibility in Scala. Since GraphX is alpha, maybe that's okay, but we need to decide on what granularity of releases we can make such changes.

Do we want these changes in 1.0.1? I think it would be bad to break binary compatibility even for an alpha component in a maintenance release.
In a future revamp of GraphX, we should consider switching away from default values if we expect to add more optional parameters.
If there's a way you can do that in this patch without harming the API too much we might consider doing that, though I'm not sure that's the case here. (One way would be to have people do Graph.fromEdgeTuples and then call .withStorageLevels on the result).

ankurdave · 2014-06-03T21:24:34Z

I think it's not essential to get this into 1.0.1 since that'll be a bugfix release, but I agree about default parameters. In a future PR, or maybe even this one, I can remove the default values.

Unfortunately it won't work to use the builder pattern for this, because Graph.fromEdgeTuples calls cache() and therefore needs the storage levels immediately.

rxin · 2014-06-03T21:54:20Z

Thanks. I am merging this in master.

Author: Ankur Dave <ankurdave@gmail.com> Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from #946

This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed. The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels. In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods. I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed. Author: Ankur Dave <ankurdave@gmail.com> Closes apache#946 from ankurdave/SPARK-1991 and squashes the following commits: ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks" 34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks 6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges

Author: Ankur Dave <ankurdave@gmail.com> Closes apache#970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache#946

This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed. The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels. In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods. I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed. Author: Ankur Dave <ankurdave@gmail.com> Closes apache#946 from ankurdave/SPARK-1991 and squashes the following commits: ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks" 34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks 6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges

Author: Ankur Dave <ankurdave@gmail.com> Closes apache#970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache#946

[SPARK-1991] Support custom storage levels for vertices and edges

6fdd137

Exclude all of GraphX from binary compatibility checks

34bcefb

ankurdave added 2 commits June 2, 2014 19:45

Revert "Exclude all of GraphX from binary compatibility checks"

c5ca068

This reverts commit 34bcefb.

Exclude all of GraphX from compatibility checks vs. 1.0.0

c34abc0

rxin reviewed Jun 3, 2014
View reviewed changes

Shadow members in withXYZ() methods rather than using underscores

ccaf06f

rxin reviewed Jun 3, 2014
View reviewed changes

Move pickStorageLevel to StorageLevel.fromString

ce17d95

mateiz reviewed Jun 3, 2014
View reviewed changes

asfgit closed this in b1feb60 Jun 3, 2014

ankurdave added a commit to ankurdave/spark that referenced this pull request Jun 4, 2014

Minor: Fix documentation error from apache#946

6d07343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1991] Support custom storage levels for vertices and edges #946

[SPARK-1991] Support custom storage levels for vertices and edges #946

ankurdave commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

ankurdave commented Jun 3, 2014

rxin Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

rxin Jun 3, 2014

rxin commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

mateiz Jun 3, 2014

mateiz Jun 3, 2014

ankurdave Jun 3, 2014

mateiz Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

mateiz commented Jun 3, 2014

ankurdave commented Jun 3, 2014

rxin commented Jun 3, 2014

[SPARK-1991] Support custom storage levels for vertices and edges #946

[SPARK-1991] Support custom storage levels for vertices and edges #946

Conversation

ankurdave commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

ankurdave commented Jun 3, 2014

rxin Jun 3, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

rxin Jun 3, 2014

Choose a reason for hiding this comment

rxin commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

mateiz Jun 3, 2014

Choose a reason for hiding this comment

mateiz Jun 3, 2014

Choose a reason for hiding this comment

ankurdave Jun 3, 2014

Choose a reason for hiding this comment

mateiz Jun 3, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 3, 2014

AmplabJenkins commented Jun 3, 2014

mateiz commented Jun 3, 2014

ankurdave commented Jun 3, 2014

rxin commented Jun 3, 2014