A set of shuffle map output related changes #587

rxin · 2013-04-30T20:32:40Z

Added a BlockObjectWriter interface for writing out a series of jvm objects to disk.
Added a getBlockWriter interface to DiskStore.
Added a new method to BlockManager for writing out shuffle files. This method provides a short-circuited way to write shuffle data out directly to disk.
Added ShuffleBlockManager and updated ShuffleMapTask to use the manager to write map outputs out to disk (ultimately via the BlockObjectWriter interface).
Allow specifying a shuffle serilaizer on a per-shuffle basis.

doesn't need to build up an array buffer for each shuffle bucket.

local block reads).

Conflicts: core/src/main/scala/spark/storage/DiskStore.scala

consolidate shuffle output files.

size is 8KB in FastBufferedOutputStream, which is too small and would cause a lot of disk seeks.

Conflicts: core/src/main/scala/spark/BlockStoreShuffleFetcher.scala

AmplabJenkins · 2013-04-30T20:51:34Z

All tests passed for this pull request. For more details, visit http://amplab.cs.berkeley.edu/jenkins/.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20/

ryanlecompte · 2013-04-30T20:57:28Z

core/src/main/scala/spark/storage/ShuffleBlockManager.scala

+private[spark]
+class ShuffleBlockManager(blockManager: BlockManager) {
+
+  val shuffles = new ConcurrentHashMap[Int, Shuffle]


This map doesn't appear to be used in this class?

Ah let me remove it. It was for a separate design that I decided to pull out last minute.

AmplabJenkins · 2013-04-30T23:46:47Z

Tests failed for this pull request. For more details, visit http://amplab.cs.berkeley.edu/jenkins/.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21/

rxin · 2013-05-01T23:18:36Z

I think the above jenkin test failure was due to a hang (fixed in #586, already merged)not caused by this pull request.

mateiz · 2013-05-03T04:43:50Z

core/src/main/scala/spark/serializer/Serializer.scala

+ * class name. If a previous instance of the serializer object has been created, the get
+ * method returns that instead of creating a new one.
+ */
+object Serializer {


I'd prefer to avoid singleton objects so that we can eventually run multiple instances of Spark in the same JVM (e.g. for multithreaded test execution). Can you either make these methods be part of SparkEnv, or possibly add a "serializer manager" in SparkEnv?

mateiz · 2013-05-03T05:02:52Z

Hey Reynold,

This looks good except for the singleton object and the semantics of revertPartialWrites on disk. For the latter, do you want to keep this interface to support merging shuffle files in the future? If so, we should at least document that it can cause previously committed writes to be deleted, but it might also be good to just make it resize the file to right before it started writing stuff.

mateiz · 2013-05-03T05:03:54Z

By the way, this is how you can truncate the file: http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/channels/FileChannel.html#truncate%28long%29. You can probably use FileOutputStream.getChannel.truncate() for example.

rxin · 2013-05-03T08:06:15Z

Done. PTAL.

I actually added the append feature fully back to DiskStore (although the higher level shuffle block manager doesn't yet do consolidation).

AmplabJenkins · 2013-05-03T09:35:54Z

Tests failed for this pull request. For more details, visit http://amplab.cs.berkeley.edu/jenkins/.
Refer to this link for build results: http://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27/

rxin · 2013-05-03T19:06:57Z

Again - don't think the above test failures were caused by this pr. It is a little bit annoying that Jenkins just keeps saying PRs are failing tests ...

mateiz · 2013-05-03T19:09:47Z

Yeah, we should probably disable that for now.

A set of shuffle map output related changes

mateiz · 2013-05-04T00:09:11Z

Thanks Reynold, I've merged this.

Args for worker rather than master Author: Chen Chao <crazyjvm@gmail.com> Closes mesos#587 from CrazyJvm/patch-6 and squashes the following commits: b54b89f [Chen Chao] Args for worker rather than master

rxin added 8 commits April 23, 2013 17:48

Added a BlockObjectWriter interface in block manager so ShuffleMapTask

31ce6c6

doesn't need to build up an array buffer for each shuffle bucket.

Allow changing the serializer on a per shuffle basis.

aa618ed

Allow the specification of a shuffle serializer in the read path (for

ba6ffa6

local block reads).

Merge branch 'blockmanager' of github.com:rxin/spark into blockmanager

d3586ef

Conflicts: core/src/main/scala/spark/storage/DiskStore.scala

Merge branch 'master' of github.com:mesos/spark into blockmanager

ed4ddf4

Added a shuffle block manager so it is easier in the future to

7007201

consolidate shuffle output files.

Allow specifying the shuffle write file buffer size. The default buffer

1055785

size is 8KB in FastBufferedOutputStream, which is too small and would cause a lot of disk seeks.

Merge branch 'master' of github.com:mesos/spark into blockmanager

cea6174

Conflicts: core/src/main/scala/spark/BlockStoreShuffleFetcher.scala

ryanlecompte reviewed Apr 30, 2013
View reviewed changes

Two minor fixes according to Ryan LeCompte's review.

dd7bef3

mateiz reviewed May 3, 2013
View reviewed changes

rxin added 2 commits May 3, 2013 01:02

Updated according to Matei's code review comment.

2bc895a

Merge branch 'master' of github.com:mesos/spark into blockmanager

93091f6

mateiz added a commit that referenced this pull request May 4, 2013

Merge pull request #587 from rxin/blockmanager

2484ad7

A set of shuffle map output related changes

mateiz merged commit 2484ad7 into mesos:master May 4, 2013

jason-dai mentioned this pull request May 31, 2013

Consolidations of shuffle files from different map tasks #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A set of shuffle map output related changes #587

A set of shuffle map output related changes #587

rxin commented Apr 30, 2013

AmplabJenkins commented Apr 30, 2013

ryanlecompte Apr 30, 2013

rxin Apr 30, 2013

AmplabJenkins commented Apr 30, 2013

rxin commented May 1, 2013

mateiz May 3, 2013

rxin May 3, 2013

mateiz commented May 3, 2013

mateiz commented May 3, 2013

rxin commented May 3, 2013

AmplabJenkins commented May 3, 2013

rxin commented May 3, 2013

mateiz commented May 3, 2013

mateiz commented May 4, 2013

A set of shuffle map output related changes #587

A set of shuffle map output related changes #587

Conversation

rxin commented Apr 30, 2013

AmplabJenkins commented Apr 30, 2013

ryanlecompte Apr 30, 2013

Choose a reason for hiding this comment

rxin Apr 30, 2013

Choose a reason for hiding this comment

AmplabJenkins commented Apr 30, 2013

rxin commented May 1, 2013

mateiz May 3, 2013

Choose a reason for hiding this comment

rxin May 3, 2013

Choose a reason for hiding this comment

mateiz commented May 3, 2013

mateiz commented May 3, 2013

rxin commented May 3, 2013

AmplabJenkins commented May 3, 2013

rxin commented May 3, 2013

mateiz commented May 3, 2013

mateiz commented May 4, 2013