Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify GraphImpl RDDs + other graph load optimizations #137

Closed
wants to merge 12 commits into from

Conversation

ankurdave
Copy link
Member

This is work in progress. The PR also includes hotfixes apache/spark#367 and apache/spark#368. The main change is in 3c7a6bc.

This commit makes the following changes:

  1. Unify RDDs to avoid zipPartitions. A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices).
  2. Avoid duplicate shuffle in graph building. We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former.
  3. Avoid no-op shuffle when joins are fully eliminated. This is a side effect of unifying the edges and the triplet view.
  4. Join elimination for mapTriplets.
  5. Ship only the needed vertex attributes when upgrading the triplet view. If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in ReplicatedVertexView#upgrade.

Empty edge partitions sometimes appear in the output of zipPartitions
for unknown reasons, causing calls to Iterator#next to fail. This commit
checks these cases, handles them by returning an empty iterator, and
logs an error if this would cause GraphX to drop a corresponding
non-empty partition.

Resolves amplab#52.
Each vertex partition is co-located with a pid2vid array created in
RoutingTable.scala. This array maps edge partition IDs to the list of
vertices in the current vertex partition that are mentioned by edges in
that partition. Therefore the pid2vid array should have one entry per
edge partition.

GraphX currently creates one entry per vertex partition, which is a bug
that leads to an ArrayIndexOutOfBoundsException when there are more edge
partitions than vertex partitions. This commit fixes the bug and adds a
test for this case.

Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug.
pombredanne pushed a commit to pombredanne/graphx that referenced this pull request Apr 14, 2014
This should all work as expected with the current version of the tachyon tarball (0.4.1)

Author: Nick Lanham <nick@afternight.org>

Closes amplab#137 from nicklan/bundle-tachyon and squashes the following commits:

2eee15b [Nick Lanham] Put back in exec, start tachyon first
738ba23 [Nick Lanham] Move tachyon out of sbin
f2f9bc6 [Nick Lanham] More checks for tachyon script
111e8e1 [Nick Lanham] Only try tachyon operations if tachyon script exists
0561574 [Nick Lanham] Copy over web resources so web interface can run
4dc9809 [Nick Lanham] Update to tachyon 0.4.1
0a1a20c [Nick Lanham] Add scripts using tachyon tarball
(cherry picked from commit 54cbd2b)

Conflicts:
	graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
(cherry picked from commit bcc79f2)
(cherry picked from commit 1993225)

Conflicts:
	graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
This commit makes the following changes:

1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs:
vertices, edges, routing table, and triplet view. This commit merges
them down to two: vertices (with routing table), and edges (with
replicated vertices).

2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles
when building a graph: one to extract routing information from the edges
and move it to the vertices, and another to find nonexistent vertices
referred to by edges. With this commit, the latter is done as a side
effect of the former.

3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side
effect of unifying the edges and the triplet view.

4. *Join elimination for mapTriplets.*

5. *Ship only the needed vertex attributes when upgrading the
triplet view.* If the triplet view already contains source attributes,
and we now need both attributes, only ship destination attributes rather
than re-shipping both. This is done in `ReplicatedVertexView#upgrade`.
@ankurdave ankurdave closed this Jun 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant