Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Memory Limited Sort (Externalized / Spill) #1568

Closed
7 of 8 tasks
Tracked by #587
alamb opened this issue Jan 15, 2022 · 4 comments
Closed
7 of 8 tasks
Tracked by #587

[EPIC] Memory Limited Sort (Externalized / Spill) #1568

alamb opened this issue Jan 15, 2022 · 4 comments

Comments

@alamb
Copy link
Contributor

alamb commented Jan 15, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Support Sorting "arbitrarily" large inputs (that don't fit in RAM or within some user defined budget)

This ticket concerns the memory potentially used the Sort operator -- it doesn't cover other potential targets (e.g. sort based aggregation, for example). That will be covered by other tasks tracked by #587

Describe the solution you'd like

  1. Allow DataFusion users to specify a RAM budget (aka via the config introduced in Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526) and have their queries complete running without the sort exceeding that budget.
  2. Make sure there is a single production ready implementation of N-way Merge of Sorted Streams to build on top of

For the Sort operator, I think the best behavior would be:

  1. Sort all the input in RAM (as is done today in SortExec ), if the memory budget allows
  2. If all the input does not fit in the RAM budget, the in memory portion is sorted and written to temporary disk files
  3. Temporary disk files are read / merged to produce the final results

@yjshen 's PR #1526 has most of the individual pieces of this logic, but they aren't hooked up yet (that is if you run SELECT .. FROM foo ORDER BY x) it will still exhaust memory.

Some ideas of how to break this task down

Describe alternatives you've considered
Use two implementations of Sort (one in memory and one that can spill). This would ensure no performance regressions for the non-spilling case but would require users to decide which one to use.

Context
This is follow on work from the great PR from @yjshen in #1526 and part of the story of limiting memory used by DataFusion #587

@alamb
Copy link
Contributor Author

alamb commented Jan 15, 2022

cc @yjshen here are some thoughts on the next steps for sorting -- what do you think? Are you planning to do any/all of this? Is there something in particular I could help the most with?

@yjshen
Copy link
Member

yjshen commented Jan 16, 2022

Thanks @alamb . Let me start from #1572 first. Anyone interested in could drop in then.

@yjshen
Copy link
Member

yjshen commented Jan 19, 2022

Add #1611 to tracking spills in sort metrics.

@alamb
Copy link
Contributor Author

alamb commented Nov 28, 2022

I think this issue is mostly done. We can track additional work items as follow on epics, etc

@alamb alamb closed this as completed Nov 28, 2022
@alamb alamb changed the title Memory Limited Sort (Externalized / Spill) [EPIC] Memory Limited Sort (Externalized / Spill) Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants