SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

mateiz · 2014-08-01T02:53:11Z

This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

mateiz · 2014-08-01T02:53:26Z

CC @andrewor14 @aarondav

SparkQA · 2014-08-01T02:59:17Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17630/consoleFull

SparkQA · 2014-08-01T03:44:57Z

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17630/consoleFull

andrewor14 · 2014-08-01T04:55:38Z

test this please

SparkQA · 2014-08-01T04:59:16Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull

SparkQA · 2014-08-01T05:52:26Z

QA results for PR 1707:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull

aarondav · 2014-08-02T02:26:44Z

core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala

+        threadMemory(threadId) = curMem + numBytes
+        // Notify other waiting threads because the # active of threads may have increased, so
+        // they may cancel their current waits
+        notifyAll()


perhaps notifyAll unconditionally (or conditioned only on having increased the number of active threads)

SparkQA · 2014-08-03T18:14:23Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17818/consoleFull

SparkQA · 2014-08-03T19:12:50Z

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17818/consoleFull

aarondav · 2014-08-03T19:13:12Z

core/src/main/scala/org/apache/spark/shuffle/ShuffleMemoryManager.scala

+   * some situations, to make sure each thread has a chance to ramp up to a reasonable share of
+   * the available memory before being forced to spill.
+   */
+  def tryToAcquire(numBytes: Long): Boolean = synchronized {


We talked offline about possibly having this allocate "as much as possible" rather than all-or-nothing. Did you decide one way or another?

Yeah, I'm still working on that.

SparkQA · 2014-08-03T23:39:15Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17831/consoleFull

mateiz · 2014-08-03T23:48:25Z

@aarondav alright, I've updated this to partially grant bytes now. Incidentally, this now seems to fail a test in ExternalSorterSuite due to the issue fixed in #1722. It works if I also merge that in.

SparkQA · 2014-08-04T00:25:39Z

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17831/consoleFull

mateiz · 2014-08-04T02:59:48Z

BTW that failing test ^ is exactly the one that fails on my laptop due to the issue that #1722 fixes.

andrewor14 · 2014-08-04T17:59:38Z

core/src/main/scala/org/apache/spark/SparkEnv.scala

-  // All accesses should be manually synchronized
-  val shuffleMemoryMap = mutable.HashMap[Long, Long]()
+  // Manages the memory used by externally spilling collections in shuffle operations
+  val shuffleMemoryManager = new ShuffleMemoryManager(conf)


Do we want to add this to the constructor, like we do for other *Managers?

This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

- Always notifyAll if a new thread was added in tryToAcquire - Log when a thread blocks

Instead of returning false if we can't grant all the memory a caller requested, we can now grant part of their request, while still keeping the previous behavior of not forcing a thread to spill if it has less than 1 / 2N, and not letting any thread get more than 1 / N. This should better utilize the available shuffle memory pool.

mateiz · 2014-08-04T21:34:48Z

Thanks Andrew. I think I've addressed all the comments.

SparkQA · 2014-08-04T21:39:31Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17879/consoleFull

andrewor14 · 2014-08-04T21:40:15Z

LGTM

mateiz · 2014-08-05T01:22:21Z

Thanks for the review, going to merge this then.

mateiz · 2014-08-05T01:23:04Z

Actually let me retest it since the previous run was cancelled.

mateiz · 2014-08-05T01:23:09Z

Jenkins, test this please

SparkQA · 2014-08-05T01:29:38Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17905/consoleFull

mateiz · 2014-08-05T04:59:58Z

test this please

SparkQA · 2014-08-05T05:04:19Z

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17919/consoleFull

mateiz · 2014-08-05T06:40:31Z

Jenkins actually passed this (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17919/consoleFull) but a glitch in the reporting script made it not post here, so going to merge it.

mateiz · 2014-08-05T06:40:42Z

Thanks for the review.

…lling collections This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere). Author: Matei Zaharia <matei@databricks.com> Closes #1707 from mateiz/spark-2711 and squashes the following commits: debf75b [Matei Zaharia] Review comments 24f28f3 [Matei Zaharia] Small rename c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests 315e3a5 [Matei Zaharia] Some review comments b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections (cherry picked from commit 4fde28c) Signed-off-by: Matei Zaharia <matei@databricks.com>

…lling collections This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere). Author: Matei Zaharia <matei@databricks.com> Closes apache#1707 from mateiz/spark-2711 and squashes the following commits: debf75b [Matei Zaharia] Review comments 24f28f3 [Matei Zaharia] Small rename c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests 315e3a5 [Matei Zaharia] Some review comments b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections

… test in Spark rio (apache#1707) We run Iceberg unit tests in Spark rio. During it, it replaces Iceberg’s Hive and Spark version with latest versions from checkout Spark repo, to ensure latest Spark/Hive work with Iceberg. As Boson is included in both Iceberg and Spark like Hive, we need to do the same for Boson dependency.

aarondav reviewed Aug 2, 2014
View reviewed changes

aarondav reviewed Aug 3, 2014
View reviewed changes

andrewor14 reviewed Aug 4, 2014
View reviewed changes

mateiz added 5 commits August 4, 2014 14:34

Some review comments

315e3a5

- Always notifyAll if a new thread was added in tryToAcquire - Log when a thread blocks

Small rename

24f28f3

Review comments

debf75b

asfgit closed this in 4fde28c Aug 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

mateiz commented Aug 1, 2014

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

andrewor14 commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

aarondav Aug 2, 2014

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

aarondav Aug 3, 2014

mateiz Aug 3, 2014

SparkQA commented Aug 3, 2014

mateiz commented Aug 3, 2014

SparkQA commented Aug 4, 2014

mateiz commented Aug 4, 2014

andrewor14 Aug 4, 2014

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

andrewor14 commented Aug 4, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

SparkQA commented Aug 5, 2014

mateiz commented Aug 5, 2014

SparkQA commented Aug 5, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

Conversation

mateiz commented Aug 1, 2014

mateiz commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

andrewor14 commented Aug 1, 2014

SparkQA commented Aug 1, 2014

SparkQA commented Aug 1, 2014

aarondav Aug 2, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 3, 2014

SparkQA commented Aug 3, 2014

aarondav Aug 3, 2014

Choose a reason for hiding this comment

mateiz Aug 3, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 3, 2014

mateiz commented Aug 3, 2014

SparkQA commented Aug 4, 2014

mateiz commented Aug 4, 2014

andrewor14 Aug 4, 2014

Choose a reason for hiding this comment

mateiz commented Aug 4, 2014

SparkQA commented Aug 4, 2014

andrewor14 commented Aug 4, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014

SparkQA commented Aug 5, 2014

mateiz commented Aug 5, 2014

SparkQA commented Aug 5, 2014

mateiz commented Aug 5, 2014

mateiz commented Aug 5, 2014