Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] isolation + threadlocal + test selection with miri #1476

Closed
Firstyear opened this issue Jul 15, 2020 · 13 comments
Closed

[question] isolation + threadlocal + test selection with miri #1476

Firstyear opened this issue Jul 15, 2020 · 13 comments
Labels
C-support Category: Not necessarily a bug, but someone asking for support

Comments

@Firstyear
Copy link
Contributor

Hi - this is a question about miri, and not a bug. It's quite likely that I'm doing something wrong,

I'm using miri with concread due to the amount of unsafe + potential concurrency issues that can exist in a datastructure. So far I believe that miri is "passing" but I'm seeing a couple of odd behaviours.

First, I can't seem to run miri again "targeted" tests. For example:

cargo miri test -- -Zmiri-disable-isolation -- bptree2

This still runs "all tests" rather than the subset I want to target. What's the correct way to run miri against only a subset of my test suite?

I'm also seeing test failures such as:

---- collections::bptree::tests::test_bptree2_map_from_iter_1 stdout ----
thread 'main' panicked at 'assertion failed: is_empt', src/collections/bptree/node.rs:70:5

failures:
 ...
    collections::bptree::tests::test_bptree2_map_from_iter_1
   ...

Without miri, this test passes. The failure is occuring in a memory leak check that runs inside of my tests, but it relies on thread-locals / globals and I'm wondering if disable isolation is potentially causing the threads to stomp on each other somehow. There are no leaks in the full tree or stress tests, so I'm not sure it's a leak in my datastructure, but it's probably my fault in how I'm doing the leak check or using miri here.

A way to fix this could be if there is a cfg(miri) flag (similar to cfg test) then I could skip the custom leak check. Does such a flag exist?

Finally, I see my system memory grow to huge amounts - I exceeded 8GB of ram during one attempt and started to swap. Does miri not release it's data until the end of execution or something similar? Or could it be due to the size of the tests I'm attempting to run (I have some multithreaded stress tests still in here)?

Anyway, thanks for your time, and again, thanks for making such an excellent tool!

@RalfJung RalfJung added the C-support Category: Not necessarily a bug, but someone asking for support label Jul 15, 2020
@RalfJung
Copy link
Member

RalfJung commented Jul 15, 2020

This still runs "all tests" rather than the subset I want to target. What's the correct way to run miri against only a subset of my test suite?

The command you gave works for me and we are even testing that on CI. So I think we need some more details here to figure out what is happening.

The failure is occuring in a memory leak check that runs inside of my tests, but it relies on thread-locals / globals and I'm wondering if disable isolation is potentially causing the threads to stomp on each other somehow.

Miri forces the number of threads to be 1, which you can replicate outside of Miri using cargo test -- --test-threads=1. Also see #1360. Maybe that explains your problem here?

But other than that, Miri's implementation of thread-local storage should be correct, so it would be interesting to figure out the underlying problem here.

A way to fix this could be if there is a cfg(miri) flag (similar to cfg test) then I could skip the custom leak check. Does such a flag exist?

It does, Ctrl-F for "config flag" in our README. But of course it would be better to figure out what actually happens.

Finally, I see my system memory grow to huge amounts - I exceeded 8GB of ram during one attempt and started to swap. Does miri not release it's data until the end of execution or something similar? Or could it be due to the size of the tests I'm attempting to run (I have some multithreaded stress tests still in here)?

Miri releases program memory when the program frees it. However, when the program does a lot of work on the same memory, a lot of metadata for the alias tracking can accumulate. You could try to see if -Zmiri-disable-stacked-borrows helps, but of course that means aliasing violations will not be detected any more.

Also since you mention concurrency, you should have gotten a warning that concurrency support is still experimental. In particular, thread-local storage is not freed currently (#1369). (We also keep some thread metadata around forever after the thread dies, but that's just ~100 bytes per thread so I doubt it will be a problem.)

Anyway, thanks for your time, and again, thanks for making such an excellent tool!

❤️ I am glad that Miri is helpful for you. :)

@Firstyear
Copy link
Contributor Author

This still runs "all tests" rather than the subset I want to target. What's the correct way to run miri against only a subset of my test suite?

The command you gave works for me and we are even testing that on CI. So I think we need some more details here to figure out what is happening.

As mentioned it was likely my fault, and as soon as I write reproduction instructions, it works ...

I think what happens is when you run a single test it passes, then it immediately compiles and tests a second binary, so the output of the second binary shows 0 tests, and that's what confused me! Really sorry about that.

The failure is occuring in a memory leak check that runs inside of my tests, but it relies on thread-locals / globals and I'm wondering if disable isolation is potentially causing the threads to stomp on each other somehow.

Miri forces the number of threads to be 1, which you can replicate outside of Miri using cargo test -- --test-threads=1. Also see #1360. Maybe that explains your problem here?

It works with many threads and threads=1 outside of miri, even under stress, which is why I'm a bit confused and possibly assuming incorrectly what the cause could be.

It does, Ctrl-F for "config flag" in our README. But of course it would be better to figure out what actually happens.

Sure, but I think the other thing is that in my mind is that the "failure" right now happens in leak-detection, which miri will do for me. So while maybe it's an unknown, this does seem like the correct option to disable my own leak detector when running under miri. That also will remove the need to have thread-locals etc which as you point out, could also be an issue.

Another thought that comes to mind is I have been previously testing on macos and now I'm testing on Linux so I wonder if there is some platform specific issue that may be occuring.

Also since you mention concurrency, you should have gotten a warning that concurrency support is still experimental. In particular, thread-local storage is not freed currently (#1369). (We also keep some thread metadata around forever after the thread dies, but that's just ~100 bytes per thread so I doubt it will be a problem.)

This could be related, as mentioned my leak-tracking tracks every single allocated datastructure node id, and that set could grow quite large on some tests, and of course, miri would also check those leak tracking sets for correctness too. So I'm thinking the correct answer as before - disable my leak tracker while under miri as it's double handling the same problem.

Anyway, I've use the cfg there to disable miri during this test and indeed I can see a memory leak, so again - very likely a mistake in my program rather than miri :)

Thanks again!

@Firstyear
Copy link
Contributor Author

Right, I understand the cause of this issue now.

So, during a normal test as I allocate and recreate each tracking set for the allocated nodes, it would only show leaks "from that test case". There were 5 tests in my iter.rs module where I was not assigning a tree root correctly, causing those tests to leak. But in those 5 tests I was not calling my "assert_released" function to ensure everything was cleaned. So on the "next test" the allocation set was replaced and those leaks were "missed". That means subsequent tests also did correctly report they were not leaking (because they weren't!)

However, when running under miri, I think this wasn't occuring. So once the leafiter tests had been executed, the allocation sets persisted, which caused subsequent tests to then fail - this is because the tracking set contained the leaked nodes from leafiter tests, but the later tests like from_iter were checking assert_released and were detecting the nodes leaked from previous tests.

So I have now fixed the leak in my iter tests, and 100% pass once again (I'll leave it to you to decide if this leak counts as a trophy, as the leak only affected tests, not the core datastructure, but miri was still invaluable in helping to find it and give me trust in my code!).

During the process I did notice one "quality of life improvement" though. I noticed with a subset of tests you get out put like this:

running 10 tests
test collections::bptree::iter::tests::test_bptree2_iter_leafiter_1 ... ok
test collections::bptree::iter::tests::test_bptree2_iter_leafiter_2 ... ok
test collections::bptree::iter::tests::test_bptree2_iter_leafiter_3 ... ok
test collections::bptree::iter::tests::test_bptree2_iter_leafiter_4 ... ok
test collections::bptree::iter::tests::test_bptree2_iter_leafiter_5 ... ok
test collections::bptree_legacy::iter::tests::test_bptree_iter_leafiter_1 ... ok
test collections::bptree_legacy::iter::tests::test_bptree_iter_leafiter_2 ... ok
test collections::bptree_legacy::iter::tests::test_bptree_iter_leafiter_3 ... ok
test collections::bptree_legacy::iter::tests::test_bptree_iter_leafiter_4 ... ok
test collections::bptree_legacy::iter::tests::test_bptree_iter_leafiter_5 ... ok

test result: ok. 10 passed; 0 failed; 0 ignored; 0 measured; 175 filtered out

The following memory was leaked:
alloc876074 (Rust heap, size: 120, align: 8) {
    0x00 │ 07 00 00 00 00 00 00 20 0a 00 00 00 00 00 00 00 │ ....... ........
    0x10 │ 0b 00 00 00 00 00 00 00 0c 00 00 00 00 00 00 00 │ ................
    0x20 │ 0d 00 00 00 00 00 00 00 0e 00 00 00 00 00 00 00 │ ................
    0x30 │ 0f 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 │ ................
    0x40 │ 0a 00 00 00 00 00 00 00 0b 00 00 00 00 00 00 00 │ ................
    0x50 │ 0c 00 00 00 00 00 00 00 0d 00 00 00 00 00 00 00 │ ................
    0x60 │ 0e 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 │ ................
    0x70 │ 10 00 00 00 00 00 00 00                         │ ........
}
... <many more leaks below>

Now this is all well and good, we know we have leaks, but:

  • My allocations apparently are quite "random". The alloc would never be the same between two executions, which meant that -Zmiri-track-alloc-id=<id> was unable to help me find the leaking allocation and leaking test
  • You can't see which allocation relates to what test, so it's impossible to tell which test leaked what memory

It would be "interesting" to perhaps track at the least, the top level function or test case that each allocation belongs to, so that you can more easily isolated what test is leaking. Ie alloc<id> + test_id or similar.

Alternately, perhaps a "bisect" function where if you have a leak, the tests are re-run and reduced to the set of cases that do have the leak.

Anyway, once again, thanks for your time and patience with my questions, and I'm glad to have solved this. Thanks mate!

@oli-obk
Copy link
Contributor

oli-obk commented Jul 16, 2020

My allocations apparently are quite "random". The alloc would never be the same between two executions, which meant that -Zmiri-track-alloc-id= was unable to help me find the leaking allocation and leaking test

Miri is deterministic unless you access the operating system. We have a seed (-Zmiri-seed=0) that by default is 0, and you can change it to test different szenarios. Other than that, every execution should be the same with the same alloc ids and everything.

Alternately, perhaps a "bisect" function where if you have a leak, the tests are re-run and reduced to the set of cases that do have the leak.

that seems like something that the testing infrastructure could benefit from independently of miri.

@RalfJung
Copy link
Member

So on the "next test" the allocation set was replaced and those leaks were "missed". That means subsequent tests also did correctly report they were not leaking (because they weren't!)
However, when running under miri, I think this wasn't occuring.

That should never happen. We do not deallocate #[thread_local] statics, but that is not observable by the program unless the program is UB. Other than that, if a program behaves differently under Miri and "for real" (and this is not caused by e.g. there being just one CPU core in Miri and thus everything running on the same thread), that's a critical bug.

Why would whatever code "resets" your allocation statistics not run in Miri? Or was that "resetting" relying on being in a new thread with clean TLS? But then outside Miri with just a single thread for the test harness you should have the same problem.

My allocations apparently are quite "random". The alloc would never be the same between two executions

That should only happen if your program actually does something differently -- Miri itself introduces no non-determinism, as @oli-obk said. But if e.g. the exact number or size of allocations depends on things returned by the OS and those differ between runs, then alloc IDs can start to diverge.

It would be "interesting" to perhaps track at the least, the top level function or test case that each allocation belongs to, so that you can more easily isolated what test is leaking. Ie alloc + test_id or similar.

Please open a separate issue for that. However, the test harness for Miri is just another program, and the top-level function is "main" of the test harness, so I have no good idea for how to achieve this. Miri doesn't even know it is running a test suite, it just runs the MIR that is generated by rustc --test.

@Firstyear
Copy link
Contributor Author

So on the "next test" the allocation set was replaced and those leaks were "missed". That means subsequent tests also did correctly report they were not leaking (because they weren't!)
However, when running under miri, I think this wasn't occuring.

That should never happen. We do not deallocate #[thread_local] statics, but that is not observable by the program unless the program is UB. Other than that, if a program behaves differently under Miri and "for real" (and this is not caused by e.g. there being just one CPU core in Miri and thus everything running on the same thread), that's a critical bug.

While there is certainly unsafe in my library, I only use std collections and safe constructs in this tracker. The definition is:

thread_local!(static NODE_COUNTER: AtomicUsize = AtomicUsize::new(1));
thread_local!(static ALLOC_LIST: Mutex<BTreeSet<usize>> = Mutex::new(BTreeSet::new()));

At this point I'm questioning now that it's very likely something I am doing wrong though, rather than a fault in miri, so I will think about this more to see if I can identify a reproducer for you.

Why would whatever code "resets" your allocation statistics not run in Miri? Or was that "resetting" relying on being in a new thread with clean TLS? But then outside Miri with just a single thread for the test harness you should have the same problem.

My allocations apparently are quite "random". The alloc would never be the same between two executions

That should only happen if your program actually does something differently -- Miri itself introduces no non-determinism, as @oli-obk said. But if e.g. the exact number or size of allocations depends on things returned by the OS and those differ between runs, then alloc IDs can start to diverge.

I'm using Box, which are the allocations in question. So I suspect you are correct that this involves the OS allocator.

It would be "interesting" to perhaps track at the least, the top level function or test case that each allocation belongs to, so that you can more easily isolated what test is leaking. Ie alloc + test_id or similar.

Please open a separate issue for that. However, the test harness for Miri is just another program, and the top-level function is "main" of the test harness, so I have no good idea for how to achieve this. Miri doesn't even know it is running a test suite, it just runs the MIR that is generated by rustc --test.

That's true. Thought needed on how to achieve this. But I will open the issue.

@oli-obk
Copy link
Contributor

oli-obk commented Jul 17, 2020

I'm using Box, which are the allocations in question. So I suspect you are correct that this involves the OS allocator.

miri injects its own allocator. By "using the OS" ralf means reading files or accessing the file system. Hypothetically even using syscalls, but I think we have no problematic ones there.

@RalfJung
Copy link
Member

RalfJung commented Jul 17, 2020

Right, if e.g. the content of a file you are reading or an env var changes, that could lead to different program behavior and thus different allocation IDs. The allocator itself is fully deterministic in Miri.

@Firstyear
Copy link
Contributor Author

I'm not reading any files, using any time sources or changing environment variables. I do have isolation disabled though because in a seperate test it does need clock monotonic. Perhaps this is related.

@RalfJung
Copy link
Member

Even if you don't change them, disabling isolation causes Miri to read all env vars on startup. That can already introduce differences.

@Firstyear
Copy link
Contributor Author

That probably explains that. Thanks for your time!

@RalfJung
Copy link
Member

The thing we still haven't explained is why your leak checker ceased to work under Miri -- or rather, why its state wasn't reset the way we'd expect. Would still be interesting if you could minimize that into code that you can share.

@Firstyear
Copy link
Contributor Author

I'm trying to understand that still, and I'll report when I understand what's going wrong :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-support Category: Not necessarily a bug, but someone asking for support
Projects
None yet
Development

No branches or pull requests

3 participants