Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce amount of stack memory usage #4103

Merged
merged 5 commits into from
Jan 18, 2024
Merged

perf: reduce amount of stack memory usage #4103

merged 5 commits into from
Jan 18, 2024

Conversation

aljazerzen
Copy link
Member

@aljazerzen aljazerzen commented Jan 18, 2024

Followup for #2857 (comment)

Refactor a few functions to use heap memory instead of stack memory, to reduce stack memory usage in the case where we recurse into function calls (which happens in long pipelines).

Ideally, stack size of the compile would not grow linearly with PRQL source call nesting depth, but that would be much harder to pull-off than just optimizing our code a bit.


For future reference: my debugging process

  • put the offending query into _a.prql,
  • run LLDB, it should stop when the panic is detected,
  • run thread backtrace to get the list of all frames, enumerated. The name of the game is to refactor code such that number of frames in memory goes up, which means that the average size of a frame goes down.
  • to see frame sizes:
    • settings set frame-format '${frame.index} ${frame.sp} ${line.file.basename}\n'
    • thread backtrace
    • copy-paste output to lo calc
    • to convert hex SP into memory offset: =HEX2DEC(RIGHT(C12,9)) (RIGHT is needed because hex2dec cannot parse more than 10 digits)
    • =E11-E12 to get the frame size
    • frame sizes are offset by 1, because that's how stack pointers and frame layouts work
    • when I started, frame sizes were: 1520, 4320, 14864, 19744, 26688
    • somewhere in-between, frame sizes were: 1520, 4320, 13744, 13712, 22144

@aljazerzen aljazerzen changed the title perf stack usage perf: reduce amount of stack memory usage Jan 18, 2024
@aljazerzen
Copy link
Member Author

Also, we might want to test with --release on Windows, or set higher stack size.

@max-sixty
Copy link
Member

Very impressive debugging!!

FWIW I managed to sample the backtrace with the #![feature(backtrace_frames)] feature, described in #2857 (comment) :

+        let bt = backtrace::Backtrace::capture();
+        let len = bt.frames().len();
+        log::trace!("{}", len);
+        if len >= 120 {
+            log::trace!("{:?}", bt);
+        }

This has the advantage of no LLDB etc, but the disadvantage that you only get the backtraces before the stack overflow.


Also, we might want to test with --release on Windows, or set higher stack size.

Though this test is only there to test whether huge queries cause a stack overflow. I'm not sure we get much from it if we change the configs for the platform so it passes!

@max-sixty
Copy link
Member

Windows now passes!! 👏 👏

@max-sixty
Copy link
Member

OoI, how did you identify what should be boxed? Did you just have a sense of what the large things on the stack were?

@max-sixty
Copy link
Member

Re the error, we could possibly set the too-large-for-stack (or just ignore them at the specific sites). From https://rust-lang.github.io/rust-clippy/master/index.html#/boxed_local

@aljazerzen
Copy link
Member Author

OoI, how did you identify what should be boxed? Did you just have a sense of what the large things on the stack were?

You can look at individual frames on the stack with lldb and find stuff that is large (i.e. Func or Expr structs) and make them Box<Func> and Box<Expr>. You can see the size of our structs with prqlc debug ast or by using lldb commands.

@aljazerzen
Copy link
Member Author

Also, we might want to test with --release on Windows, or set higher stack size.

Though this test is only there to test whether huge queries cause a stack overflow. I'm not sure we get much from it if we change the configs for the platform so it passes!

Yeah, but prqlc is meant to be used in --release mode. That raises size of query that can fit into 1MB stack memory by a lot. Also, 1MB stack memory is not a lot.

An alternative would be to use stacker to resize the stack size dynamically.

@max-sixty
Copy link
Member

Yeah, but prqlc is meant to be used in --release mode. That raises size of query that can fit into 1MB stack memory by a lot. Also, 1MB stack memory is not a lot.

Right right, though I'm saying something a bit different:

  • All tests pass which test the functionality of PRQL on Windows
  • The test "do extremely long queries work" fails on Windows
  • So changing our testing environment to make that test pass doesn't show anything!
  • Maybe it shows that an extremely long query passes in release mode? But we know that already, and we could find an even longer query that would fail in release mode...

(not sure if I'm being clear, happy to speak about it on the call)

@max-sixty
Copy link
Member

An alternative would be to use stacker to resize the stack size dynamically.

Ah interesting, do you think that's better than doing something like a trampoline? (I understand these things conceptually but have never faced this problem technically, so am learning as much as helping...)

@aljazerzen
Copy link
Member Author

Maybe it shows that an extremely long query passes in release mode?

Exactly.

The test is basically saying that "under 1MB stack conditions, we support only queries as long as our long_query case". By not using --release mode in tests restricts these conditions much more that the actual "production" conditions.

But now that I've improved the resolver, we can leave the test as is and use these "too" restricted conditions as the benchmark for when we increase memory usage by mistake.


So I've never dealt with such problem neither, but I think that:

  • stacker would be an easy solution to get rid of this problem with minimal performance and implementation cost,
  • using a trampoline would not only get rid of this problem, but also improve performance (because it does not need to recurse anymore, but it can use iteration), but would be very hard to implement (since resolver recurses on many different locations).

@aljazerzen aljazerzen merged commit 2d9e8fc into main Jan 18, 2024
81 checks passed
@aljazerzen aljazerzen deleted the perf-stack-usage branch January 18, 2024 18:41
@max-sixty
Copy link
Member

OK great on both counts. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants