Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per test timeout #71

Merged
merged 9 commits into from
Aug 4, 2022
Merged

Per test timeout #71

merged 9 commits into from
Aug 4, 2022

Conversation

Licenser
Copy link
Contributor

So this is still a draft, it adds the option to define a timeout as part of the serial test something along the lines of:

#[test]
#[serial(timeout = 600)]
fn test() -> {}

I've not added tests or documentation yet as I wanted to get feedback on if this approach is OK before putting in that work :)

@palfrey
Copy link
Owner

palfrey commented Jul 1, 2022

I like the overall approach. Couple of notes:

  • Could the parameter become timeout_secs or similar? I don't like time parameters as numbers without units, as it causes confusion later on (as some users might assume msec or similar)
  • If timeout cfg is not set but someone does set an explicit timeout value for a test then there should be an error (ideally compile-time) as that won't do anything I think.

@Licenser
Copy link
Contributor Author

Licenser commented Jul 1, 2022

I'm wondering if perhaps this should remove the timeout future flag, it'd be a breaking change but that way timeout becomes optional setting (if none is given no timeout), it would reduce the configuration surface for users.

It might feel a bit odd to have tow PR's removing each other 😂 but for the sake of version progression (breaking vs non breaking changes in semver) it does make some sense to have #70 as a non breaking change, and then this as a breaking change moving more into the direction that reflects the vision of where things are going.

On the other hand there was a thought behind making it not error, mainly it allows a feature flag to quickly toggle timeouts on and off for testing with various settings, but I can totally see how hat is confusing.

@Licenser Licenser changed the title Per test timeout (WIP Per test timeout (WIP) Jul 1, 2022
@palfrey
Copy link
Owner

palfrey commented Jul 1, 2022

I'm wondering if perhaps this should remove the timeout future flag, it'd be a breaking change but that way timeout becomes optional setting (if none is given no timeout), it would reduce the configuration surface for users.

It might feel a bit odd to have tow PR's removing each other joy but for the sake of version progression (breaking vs non breaking changes in semver) it does make some sense to have #70 as a non breaking change, and then this as a breaking change moving more into the direction that reflects the vision of where things are going.

I think the timeout feature flag is maybe a semver breaking change (because anyone using default-features = no will get different behaviour). I'm currently leaning towards "get all of this in, sit on it for at least a few days just to check no-one else has anything else they haven't reported yet, then do 0.9.0" TBH v.s. any intermediate releases though!

But, back to the topic, yes, I think it makes sense to remove it. Going back to a "no timeouts" default, while permitting it getting added on a test-by-test basis is good. I suspect practically no-one uses the feature as it is released right now, and if they do, they can either not upgrade or switch to the new syntax. Also, that timeouts doesn't need any extra crates kinda leans me towards "support it for everyone".

@Licenser Licenser force-pushed the per-test-timeout branch from 7d0a595 to 109b73d Compare July 1, 2022 12:52
@palfrey
Copy link
Owner

palfrey commented Jul 10, 2022

How's this going @Licenser? Any thoughts on timelines for this one?

@Licenser
Copy link
Contributor Author

Heya,

sorry about that, I got sidetracked with 1000 other things and ended up sick the last few days :) should be able to wrap this up today or worst this week :)

@Licenser Licenser force-pushed the per-test-timeout branch 2 times, most recently from 3b97657 to c1f9034 Compare July 13, 2022 09:37
@Licenser Licenser marked this pull request as ready for review July 13, 2022 09:43
@Licenser
Copy link
Contributor Author

There we go, just needed to clean up the file_lock part. It didn't support timeouts in the past and I kept it this way but added the plumbing to add it later on if it's ever desired :).

@Licenser Licenser changed the title Per test timeout (WIP) Per test timeout Jul 14, 2022
@palfrey
Copy link
Owner

palfrey commented Jul 16, 2022

I'm happy with the implementation. It still needs docs adding for it, and at least one or two integration tests (i.e. actual uses of serial with the timeout_ms arg) before it gets merged.

Thanks!

@palfrey
Copy link
Owner

palfrey commented Jul 21, 2022

Tried adding some tests to serial_test_test/src/lib.rs of the form

    #[test]
    #[serial(timeout_key, timeout_ms = 300)]
    fn demo_timeout_with_key() {
    }

    #[test]
    #[serial(timeout_ms = 300)]
    fn demo_timeout() {
    }

and got the error expected struct `Duration`, found `u64`

@Licenser
Copy link
Contributor Author

Darn it, I'm sorry I didn't even notice that sub project -.-, my bad, should be fixed now :)

@Licenser Licenser force-pushed the per-test-timeout branch 3 times, most recently from f673797 to a40be1f Compare July 22, 2022 08:41
@Licenser
Copy link
Contributor Author

I'm confused why the tests fail at this point, I don't see a correlation to the changed code, and I can't reproduce the failures locally.

@palfrey
Copy link
Owner

palfrey commented Jul 22, 2022

I'm confused why the tests fail at this point, I don't see a correlation to the changed code, and I can't reproduce the failures locally.

Not sure. I also can't reproduce the failures locally. Possibly try bumping the timeouts to 60s just in case?

@Licenser
Copy link
Contributor Author

Giving that a shot :)

@palfrey
Copy link
Owner

palfrey commented Jul 22, 2022

So I'm pretty sure I know what's wrong. There's a set of tests parallel_with_key_[1-3] which only work because there's at least 3 test threads. On most modern machines, this doesn't matter, as the default thread count is "number of CPUs". On Github CI, the default ends up being 1, and so I set the RUST_TEST_THREADS variable to 3 so there's enough threads. If I set that locally, I can sometimes reproduce this.

What I think is happening is that somewhere in the serial_test function wrappers, some lock is being held on when it shouldn't, and so if one of the parallel tests kicks off before the demo_timeout ones, it then blocks the demo_timeout one working (or the other two parallel ones happening) in some way. Trying to wrap my head around what's happening here exactly!

We could fix this PR by just upping the RUST_TEST_THREADS count as I'm fairly sure that'll solve the problem, but it's a hack and I'd rather get this working properly while we've got a PR that is able to clearly demonstrate the problem, even if it's got nothing to do with the work here. Just a matter of just about enough more tests added!

I'll prod at this on a new branch, but it's one of those cases that's a pain to test...

@palfrey palfrey mentioned this pull request Jul 22, 2022
@palfrey
Copy link
Owner

palfrey commented Jul 22, 2022

I think that #72 which redoes things with dashmap might solve things. Any thoughts?

@Licenser
Copy link
Contributor Author

I just removed the timeouts from the functions to see if the tests still fail, it looks "promising" (as in while writing this the tests are still going and not finishing). At least that's a good indication that the timeouts are not what's being off but that the breakage is linked to the number of tests - which would validate your theory.

Reading check_new_keys (both in #72 and current) I wonder if there is a logic error in the locking, I thought it's my brain not getting it - happens a lot with multithreaded locking code ;) - but I'll share it perhaps the it was onto something there.

The read_lock to check if the name is present and the write lock for inserting the name are done separately. L69 locks for reads, checks if the key is present, if not, it drops the lock in L74. In L80 it tries to take the write lock. I could see two threads slipping past the read lock and reaching the waiting for the write lock at the same time, but with entry or_default this should ™️ not be an issue.

The second thing I noticed during digging is that the timeout is only counted for how long to wait until we can enter insert into the map, not how long the test runs or am I totally misunderstanding the check_key function?

@Licenser
Copy link
Contributor Author

We could fix this PR by just upping the RUST_TEST_THREADS count as I'm fairly sure that'll solve the problem, but it's a hack and I'd rather get this working properly while we've got a PR that is able to clearly demonstrate the problem, even if it's got nothing to do with the work here. Just a matter of just about enough more tests added!

I agree, there is no hurry to get this merged from my side, lets keep this as broken for now the hang is a bigger concern then the timeout attribute :)

@palfrey
Copy link
Owner

palfrey commented Jul 31, 2022

The read_lock to check if the name is present and the write lock for inserting the name are done separately. L69 locks for reads, checks if the key is present, if not, it drops the lock in L74. In L80 it tries to take the write lock. I could see two threads slipping past the read lock and reaching the waiting for the write lock at the same time, but with entry or_default this should tm not be an issue.

Correct. The ideal situation would be some sort of "read lock with possible upgrade to write" and most of those seem to lock it down to just one getting upgraded. The logic as it is assumes that multiple threads may indeed slip past the read lock. Only one should then get the write lock but there was some hard to test scenarios that seem to avoid this, so hence the default bits.

The second thing I noticed during digging is that the timeout is only counted for how long to wait until we can enter insert into the map, not how long the test runs or am I totally misunderstanding the check_key function?

Correct. It's original purpose was to try and catch deadlock scenarios.

See also #45 and #46 for details around both of these.

@palfrey
Copy link
Owner

palfrey commented Jul 31, 2022

Correct. The ideal situation would be some sort of "read lock with possible upgrade to write" and most of those seem to lock it down to just one getting upgraded.

To clarify: I didn't use those, as they all seemed to then cause other issues. I can't recall what offhand, but fundamentally I couldn't figure out how to use them, so this simpler setup was used.

@palfrey
Copy link
Owner

palfrey commented Aug 4, 2022

I think that #72 which redoes things with dashmap might solve things. Any thoughts?

I've merged this now. Can you resolve the conflicts here and let's see if it's still running clean and then let's get it merged.

@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2022

On it, fingers crossed :)

Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
Signed-off-by: Heinz N. Gies <heinz@licenser.net>
@Licenser
Copy link
Contributor Author

Licenser commented Aug 4, 2022

It passed! Great call on the dashmap :) wooh!

@palfrey palfrey merged commit d108a62 into palfrey:main Aug 4, 2022
@Licenser Licenser deleted the per-test-timeout branch August 5, 2022 16:25
@palfrey
Copy link
Owner

palfrey commented Aug 10, 2022

Released as part of 0.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants