Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections #1707

Closed
wants to merge 5 commits into from

Conversation

mateiz
Copy link
Contributor

@mateiz mateiz commented Aug 1, 2014

This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

@mateiz
Copy link
Contributor Author

mateiz commented Aug 1, 2014

CC @andrewor14 @aarondav

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17630/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17630/consoleFull

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 1, 2014

QA results for PR 1707:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17645/consoleFull

threadMemory(threadId) = curMem + numBytes
// Notify other waiting threads because the # active of threads may have increased, so
// they may cancel their current waits
notifyAll()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps notifyAll unconditionally (or conditioned only on having increased the number of active threads)

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17818/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17818/consoleFull

* some situations, to make sure each thread has a chance to ramp up to a reasonable share of
* the available memory before being forced to spill.
*/
def tryToAcquire(numBytes: Long): Boolean = synchronized {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked offline about possibly having this allocate "as much as possible" rather than all-or-nothing. Did you decide one way or another?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm still working on that.

@SparkQA
Copy link

SparkQA commented Aug 3, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17831/consoleFull

@mateiz
Copy link
Contributor Author

mateiz commented Aug 3, 2014

@aarondav alright, I've updated this to partially grant bytes now. Incidentally, this now seems to fail a test in ExternalSorterSuite due to the issue fixed in #1722. It works if I also merge that in.

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA results for PR 1707:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17831/consoleFull

@mateiz
Copy link
Contributor Author

mateiz commented Aug 4, 2014

BTW that failing test ^ is exactly the one that fails on my laptop due to the issue that #1722 fixes.

// All accesses should be manually synchronized
val shuffleMemoryMap = mutable.HashMap[Long, Long]()
// Manages the memory used by externally spilling collections in shuffle operations
val shuffleMemoryManager = new ShuffleMemoryManager(conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add this to the constructor, like we do for other *Managers?

mateiz added 5 commits August 4, 2014 14:34
This tracks memory properly if there are multiple spilling collections
in the same task (which was a problem before), and also implements an
algorithm that lets each thread grow up to 1 / 2N of the memory pool
(where N is the number of threads) before spilling, which avoids an
inefficiency with small spills we had before (some threads would spill
many times at 0-1 MB because the pool was allocated elsewhere).
- Always notifyAll if a new thread was added in tryToAcquire
- Log when a thread blocks
Instead of returning false if we can't grant all the memory a caller
requested, we can now grant part of their request, while still keeping
the previous behavior of not forcing a thread to spill if it has less
than 1 / 2N, and not letting any thread get more than 1 / N. This should
better utilize the available shuffle memory pool.
@mateiz
Copy link
Contributor Author

mateiz commented Aug 4, 2014

Thanks Andrew. I think I've addressed all the comments.

@SparkQA
Copy link

SparkQA commented Aug 4, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17879/consoleFull

@andrewor14
Copy link
Contributor

LGTM

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

Thanks for the review, going to merge this then.

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

Actually let me retest it since the previous run was cancelled.

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

Jenkins, test this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17905/consoleFull

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

test this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2014

QA tests have started for PR 1707. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17919/consoleFull

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

Jenkins actually passed this (see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17919/consoleFull) but a glitch in the reporting script made it not post here, so going to merge it.

@mateiz
Copy link
Contributor Author

mateiz commented Aug 5, 2014

Thanks for the review.

asfgit pushed a commit that referenced this pull request Aug 5, 2014
…lling collections

This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

Author: Matei Zaharia <matei@databricks.com>

Closes #1707 from mateiz/spark-2711 and squashes the following commits:

debf75b [Matei Zaharia] Review comments
24f28f3 [Matei Zaharia] Small rename
c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
315e3a5 [Matei Zaharia] Some review comments
b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections

(cherry picked from commit 4fde28c)
Signed-off-by: Matei Zaharia <matei@databricks.com>
@asfgit asfgit closed this in 4fde28c Aug 5, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…lling collections

This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

Author: Matei Zaharia <matei@databricks.com>

Closes apache#1707 from mateiz/spark-2711 and squashes the following commits:

debf75b [Matei Zaharia] Review comments
24f28f3 [Matei Zaharia] Small rename
c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
315e3a5 [Matei Zaharia] Some review comments
b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
… test in Spark rio (apache#1707)

We run Iceberg unit tests in Spark rio. During it, it replaces Iceberg’s Hive and Spark version with latest versions from checkout Spark repo, to ensure latest Spark/Hive work with Iceberg.

As Boson is included in both Iceberg and Spark like Hive, we need to do the same for Boson dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants