Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/flaky split round robin limited fds #6043

Merged

Conversation

cre4ture
Copy link
Contributor

@cre4ture cre4ture commented Mar 3, 2024

Addresses #6031

  • refactoring: split large function get_writer into two parts
  • robutify closing of fds by retry until it works or no fds to close left. Previous implemenation did only one try and failed then.
  • robustify test setup: add sleep 0.5 && to ensure that the limits are applied before split application starts.

update: I implemented the idea with pre_exec from @tertsdiepraam as replacement for the sleep 0.5 test-setup fix.
the results of 15254 (1620) runs can be seen here: https://github.com/cre4ture/coreutils/actions/runs/8155517787/job/22291219367?pr=10
There was no single case where the original implementation of split failed.

I proved that both changes on its own fix the flakiness of the test by running all tests on 22 runners, each ~50times:

grafik

Evaluation:
the reference contained 12 out of 22 cases where the split test failed.
the other two tests didn't contain a single case each where this test was failing.

Test runs:
reference: https://github.com/cre4ture/coreutils/actions/runs/8123936859/job/22204882466?pr=9
split robustification: https://github.com/cre4ture/coreutils/actions/runs/8124460686/job/22205966825?pr=11
test robustification: https://github.com/cre4ture/coreutils/actions/runs/8124437301/job/22205913310?pr=10

@cre4ture cre4ture changed the title Fix/flaky split round robin limited fds Fix/flaky split round robin limited fds Mar 3, 2024
Copy link

github-actions bot commented Mar 3, 2024

GNU testsuite comparison:

Congrats! The gnu test tests/mv/backup-dir is no longer failing!
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

1 similar comment
Copy link

github-actions bot commented Mar 3, 2024

GNU testsuite comparison:

Congrats! The gnu test tests/mv/backup-dir is no longer failing!
Skipping an intermittent issue tests/tail/inotify-dir-recreate (passes in this run but fails in the 'main' branch)

@tertsdiepraam
Copy link
Member

If I understand this all correctly, I think the correct thing to do is set the limits on the process not with prlimit but with setlimit from pre_exec. Would you like to give that a shot?

@cre4ture cre4ture force-pushed the fix/flaky_split_round_robin_limited_fds branch 3 times, most recently from 66d406c to eed5f1c Compare March 5, 2024 15:10
@cre4ture
Copy link
Contributor Author

cre4ture commented Mar 5, 2024

update: I implemented the idea with pre_exec from @tertsdiepraam.
the results of 15*2*54 (1620) runs can be seen here: https://github.com/cre4ture/coreutils/actions/runs/8155517787/job/22291219367?pr=10
There was no single case where the original implementation of split failed.

@cre4ture cre4ture force-pushed the fix/flaky_split_round_robin_limited_fds branch 2 times, most recently from ee9691b to a6b5699 Compare March 5, 2024 17:39
Copy link

github-actions bot commented Mar 5, 2024

GNU testsuite comparison:

Skip an intermittent issue tests/tail/inotify-dir-recreate (fails in this run but passes in the 'main' branch)

@cre4ture cre4ture force-pushed the fix/flaky_split_round_robin_limited_fds branch from a6b5699 to e69caed Compare March 5, 2024 19:43
Copy link
Member

@tertsdiepraam tertsdiepraam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test fix looks great! Now I just wonder about the split change. I can't think of a case where we'd have to close multiple fds to be able to open 1. Would this really make it more robust? Or has this to do with changing limits while running split?

@BenWiederhake
Copy link
Collaborator

Or has this to do with changing limits while running split?

I think that's precisely the case. The limit drastically shrunk while executing (from unlimited to 9), so split had to close many fds before being able to operate again. I don't know whether that's a scenario that we want to support, but given that cree4ture already wrote the code, this would be an easy win.

@tertsdiepraam
Copy link
Member

Sure, but I think we then need a comment explaining why it works the way it does.

@cre4ture
Copy link
Contributor Author

cre4ture commented Mar 6, 2024

I think the split-side change is still usefull. Even though the test is also green without it.

Why?
I could imagine some (possibly exotic?) usecases for this.
E.g. split running in parallel to other processes (e.g. another split) doing similar stuff, sharing the same limits.
In this scenario, after closing one fd, the other process might "steel" the freed fd and open a file on its side.
Then it would be beneficial if split would be able to close another fd before cancellation.

I'm not sure wether the limits of a common parent process are actually "shared" as I was explaining it. Or if each child gets exactly half of it, or again the same amount of free fds as the parent?

But at least according to https://unix.stackexchange.com/a/34335 the limits can also be set per user or group.
With this, multiple processes share the same free, limited ressources I assume. Then, with parallel execution anything can happen. ;-)

So, when we agree that this features is usefull, we can either comment this, or try to explicitly test this. I'm just not yet sure how a deterministic test for this could look like.

@tertsdiepraam
Copy link
Member

Alright, I'm convinced :)

Could you add a comment about it?

@cre4ture cre4ture force-pushed the fix/flaky_split_round_robin_limited_fds branch from e69caed to 2889119 Compare March 7, 2024 22:51
@cre4ture
Copy link
Contributor Author

cre4ture commented Mar 7, 2024

Alright, I'm convinced :)

Could you add a comment about it?

done :-)

@cre4ture cre4ture force-pushed the fix/flaky_split_round_robin_limited_fds branch from 2889119 to db142f9 Compare March 8, 2024 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants