Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo test takes 8x more time on my laptop after the switch to Miden VM v0.10.5 (from v0.10.3) #317

Open
greenhat opened this issue Sep 11, 2024 · 13 comments · May be fixed by #319
Open

cargo test takes 8x more time on my laptop after the switch to Miden VM v0.10.5 (from v0.10.3) #317

greenhat opened this issue Sep 11, 2024 · 13 comments · May be fixed by #319
Labels
bug Something isn't working testing
Milestone

Comments

@greenhat
Copy link
Contributor

After the switch to Miden VM v0.10.5 in 65365a0 the cargo test takes 16 minutes on my laptop instead of 2 minutes before the switch.
Interestingly, on CI time went up only from ~19 minutes to ~32 minutes. I suspect with more cores, the performance degrades more quickly.

@greenhat
Copy link
Contributor Author

greenhat commented Sep 11, 2024

flamegraph

The flamegraph for one of the tests is generated at 65365a0 (Miden VM v0.10.5) via:

cargo flamegraph --root --unit-test miden_integration_tests -- rust_masm_tests::instructions::band_i64

@greenhat
Copy link
Contributor Author

flamegraph
And here is the same test at 202cb9f (previous commit, with Miden VM v0.10.3) taking 1/10 time of the above.

@greenhat greenhat changed the title cargo test takes 8x times after the switch to Miden VM v0.10.5 (from v0.10.3) cargo test takes 8x more time on my laptop after the switch to Miden VM v0.10.5 (from v0.10.3) Sep 11, 2024
greenhat added a commit that referenced this issue Sep 11, 2024
Needed when running `cargo flamegraph --root` for #317
greenhat added a commit that referenced this issue Sep 11, 2024
Needed when running `cargo flamegraph --root` for #317
@greenhat greenhat added this to the Beta 1 milestone Sep 11, 2024
@plafer
Copy link

plafer commented Sep 11, 2024

Interesting that changes in 0.10.5 caused this... Are you able to zoom in to the Assembler methods to get a better sense of where the time is being spent?

Side note: I use cargo nextest locally which is a faster test runner (although this is a not a fix to the above regression!)

@greenhat
Copy link
Contributor Author

greenhat commented Sep 11, 2024

Interesting that changes in 0.10.5 caused this... Are you able to zoom in to the Assembler methods to get a better sense of where the time is being spent?

Thank you for looking into it. For some reason, when you click on the svg file it opens in the separate tab and zooming does not work, but the mouse-over displays full text. However, if you download the svg file and open the local file, zooming works nicely(on click).

Side note: I use cargo nextest locally which is a faster test runner (although this is a not a fix to the above regression!)

Yes, we use it as well, but here it is not much help.

The test is running the compiled MAST 256 times (each on a fresh VM instance) using

pub fn run_masm_vs_rust<T>(
rust_out: T,
package: &Package,
args: &[Felt],
session: &Session,
) -> Result<(), TestCaseError>
where
T: Clone + PopFromStack + std::cmp::PartialEq + std::fmt::Debug,
{
let exec = Executor::for_package(package, args.to_vec(), session)
.map_err(|err| TestCaseError::fail(err.to_string()))?;
let output = exec.execute_into(&package.unwrap_program(), session);
prop_assert_eq!(rust_out.clone(), output, "VM output mismatch");
// TODO: Uncomment after https://github.com/0xPolygonMiden/compiler/issues/228 is fixed
// let emul_out: T = (*execute_emulator(ir_program.clone(), args).first().unwrap()).into();
// prop_assert_eq!(rust_out, emul_out, "Emulator output mismatch");
Ok(())
}

I suspect starting and/or shutting down the VM instance became more expensive.

@bobbinth
Copy link
Contributor

My guess is that it has something to do with loading the standard library into the assembler/processor. Are we able to run this against the next branch and see if it still happens?

@bobbinth
Copy link
Contributor

Although, the main thing that changed between 10.3 and 10.5 was the merging of adjacent basic blocks - but that shouldn't have such a big impact.

@bitwalker
Copy link
Contributor

bitwalker commented Sep 12, 2024

The major increase in time here is 100% due to the loading of a MAST library, though a subsequent commit to the one you linked modified the compiler to cache the standard library after it is loaded, and simply clone it after that. A similar change was made in the VM, though I can't recall if that has been released yet.

This was initially noticed by me when we turned on implicit linking of the standard library (since we went from not linking against it at all in many tests, to linking it in all of them), but the actual overhead comes from the hashing done when loading the MAST.

This should be fixed in the current next of the compiler though, it certainly isn't taking me 16 minutes to run the test suite, but maybe we've had a regression somehow (possibly due to the transaction kernel library being implicitly linked when --target rollup is set). Did this come up when doing cargo-miden changes, maybe it is related to that, since I think cargo-miden always sets --target rollup, and we haven't done the caching "trick" we did with the standard library there yet.

@greenhat
Copy link
Contributor Author

greenhat commented Sep 12, 2024

My guess is that it has something to do with loading the standard library into the assembler/processor. Are we able to run this against the next branch and see if it still happens?

I run our tests using almost the tip of the Miden VM next branch - 0xPolygonMiden/miden-vm@1b4609e (the lazy_lock introduced in
0xPolygonMiden/miden-vm@0ce2fe5 is still unstable in our rustc version) and it resolved the issue! The cargo test is back to under 2 minutes on my laptop.
I made #319 to switch to it.

EDIT: The tests on CI got back to 15 minutes (vs. ~33 minutes). See #319

@greenhat
Copy link
Contributor Author

The major increase in time here is 100% due to the loading of a MAST library, though a subsequent commit to the one you linked modified the compiler to cache the standard library after it is loaded, and simply clone it after that. A similar change was made in the VM, though I can't recall if that has been released yet.

This was initially noticed by me when we turned on implicit linking of the standard library (since we went from not linking against it at all in many tests, to linking it in all of them), but the actual overhead comes from the hashing done when loading the MAST.

This should be fixed in the current next of the compiler though, it certainly isn't taking me 16 minutes to run the test suite, but maybe we've had a regression somehow (possibly due to the transaction kernel library being implicitly linked when --target rollup is set).

I'm still getting the same ~16 minutes (just tested and got 15 and a half minutes) in the next branch.
How much time does cargo test take on your machine? I believe we have the same laptop model, so we should see comparable results.

Did this come up when doing cargo-miden changes, maybe it is related to that, since I think cargo-miden always sets --target rollup, and we haven't done the caching "trick" we did with the standard library there yet.

Nope, I've been experiencing this issue since last week on the next branch, so I started digging and was able to find the commit after which it started - 65365a0

@bitwalker
Copy link
Contributor

the lazy_lock introduced in .. is still unstable in our rustc version

Hmm, it was introduced in 1.80.0, which is our MSRV and the same as our current rustc toolchain version, so it shouldn't be unstable. It could be the case though that because it is a nightly toolchain from prior to the stable 1.80.0 release, that it hadn't been marked stable yet. I would expect this not to cause us any issues though either way.

@bitwalker
Copy link
Contributor

How much time does cargo test take on your machine? I believe we have the same laptop model, so we should see comparable results.

so I started digging and was able to find the commit after which it started - 65365a0

I'm able to reproduce the issue, but what I was getting at is that the commit you are referencing was from before the fix that addressed the issue. Meaning that, somehow, our fix is no longer working. I definitely want to get to the bottom of that, even if it can be fixed by updating the VM deps to use git temporarily, as it means we are doing something, somewhere, that we shouldn't be.

@bitwalker bitwalker added bug Something isn't working testing labels Sep 12, 2024
@bobbinth
Copy link
Contributor

In next, we should be handling miden-stdlib very efficiently. The deserialization shouldn't involve any hashing and it should be done only once per process (because a static variable is used to hold the deserialized library). miden-base is not using static variables yet, but it should also be pretty efficient (i.e., no hashing on deserialization). So, I'm very curious what's causing the issue here.

(it is of course possible that something in the above doesn't work as expected).

@bitwalker
Copy link
Contributor

@bobbinth To be clear, the issue here is in the latest published release, not next.

One thing I did notice when examining the flamegraph, is that dropping the vectors used to store the decorators gets really expensive when you have a decorator attached to every node in the program (when debug info is enabled, as it always is in the compiler currently) - ~30% of the program runtime is spent just on dropping those.

Now, on to the actual issue, and the reason why it appears to be fixed in next:

Because next has switched to allocating libraries in Arc, and doling out clones of the pointer, rather than the library data itself, this issue will only present itself when dropping libraries that aren't effectively immortal like miden-stdlib and miden-base, but the "fix" here is only for the narrow case of those two libraries, which are allocated and cached the first time they are loaded, so the drops never happen. It will still continue to affect programs consisting of many libraries, or large libraries, as soon as all of the live Arc references are dropped.

/cc @greenhat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants