Unify GraphImpl RDDs + other graph load optimizations #137

ankurdave · 2014-04-14T19:23:51Z

This is work in progress. The PR also includes hotfixes apache/spark#367 and apache/spark#368. The main change is in 3c7a6bc.

This commit makes the following changes:

Unify RDDs to avoid zipPartitions. A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).
Avoid duplicate shuffle in graph building. We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.
Avoid no-op shuffle when joins are fully eliminated. This is a side effect of unifying the edges and the triplet view.
Join elimination for mapTriplets.
Ship only the needed vertex attributes when upgrading the triplet view. If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in ReplicatedVertexView#upgrade.

Empty edge partitions sometimes appear in the output of zipPartitions for unknown reasons, causing calls to Iterator#next to fail. This commit checks these cases, handles them by returning an empty iterator, and logs an error if this would cause GraphX to drop a corresponding non-empty partition. Resolves amplab#52.

Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition. GraphX currently creates one entry per vertex partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case. Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.

This should all work as expected with the current version of the tachyon tarball (0.4.1) Author: Nick Lanham <nick@afternight.org> Closes amplab#137 from nicklan/bundle-tachyon and squashes the following commits: 2eee15b [Nick Lanham] Put back in exec, start tachyon first 738ba23 [Nick Lanham] Move tachyon out of sbin f2f9bc6 [Nick Lanham] More checks for tachyon script 111e8e1 [Nick Lanham] Only try tachyon operations if tachyon script exists 0561574 [Nick Lanham] Copy over web resources so web interface can run 4dc9809 [Nick Lanham] Update to tachyon 0.4.1 0a1a20c [Nick Lanham] Add scripts using tachyon tarball

(cherry picked from commit 54cbd2b) Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

(cherry picked from commit bcc79f2)

(cherry picked from commit 1993225) Conflicts: graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala

This commit makes the following changes: 1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices). 2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former. 3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view. 4. *Join elimination for mapTriplets.* 5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.

ankurdave added 4 commits April 8, 2014 22:56

Merge branch 'fix-pid2vid-size' into unify-rdds

a142d14

Merge branch 'handle-empty-partitions' into unify-rdds

777654e

ankurdave added 8 commits April 20, 2014 22:47

Log current Pregel iteration

d419f00

(cherry picked from commit 54cbd2b) Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala

In Analytics, take PageRank numIter

3860a49

(cherry picked from commit bcc79f2)

In GraphLoader, coalesce to minEdgePartitions

dc5512a

(cherry picked from commit 1993225) Conflicts: graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala

Show RDD name in stage table for easier debugging

7c6ecfe

Custom serializer for RoutingTableMessage

d6c8a9e

Use *AggMsgSerializer for all RDD[(VertexId, VD)] shuffles

cccff33

Faster EdgePartition building

f84585c

ankurdave closed this Jun 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify GraphImpl RDDs + other graph load optimizations #137

Unify GraphImpl RDDs + other graph load optimizations #137

ankurdave commented Apr 14, 2014

Unify GraphImpl RDDs + other graph load optimizations #137

Unify GraphImpl RDDs + other graph load optimizations #137

Conversation

ankurdave commented Apr 14, 2014