-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition on multiple nearly-simultaneous uploads of the same collection (different versions) #1571
Labels
Comments
mdellweg
added a commit
to mdellweg/pulp_ansible
that referenced
this issue
Sep 15, 2023
This bugfix contains a migration and is therefore not backportable. fixes pulp#1571
mdellweg
added a commit
to mdellweg/pulp_ansible
that referenced
this issue
Sep 15, 2023
This bugfix contains a migration and is therefore not backportable. fixes pulp#1571
mdellweg
added a commit
to mdellweg/pulp_ansible
that referenced
this issue
Sep 15, 2023
This is the backported version of the fix. A different strategy was pursued. fixes pulp#1571
mdellweg
added a commit
that referenced
this issue
Sep 19, 2023
This bugfix contains a migration and is therefore not backportable. fixes #1571
@mdellweg is the bugfix release planned anytime soon? |
mdellweg
added a commit
to mdellweg/pulp_ansible
that referenced
this issue
Sep 26, 2023
re pulp#1571 [noissue]
webknjaz
added a commit
to webknjaz/ansible
that referenced
this issue
Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a race condition that is occasiontally observable in our CI when provisioning the initial dummy collections for testing. Refs: * ansible/galaxy_ng#1950 * ansible/galaxy#3267 * pulp/pulp_ansible#1571 * pulp/pulp_ansible#1572 * pulp/pulp_ansible#1595 * pulp/pulp_ansible#1623 * pulp/pulp_ansible#1624
webknjaz
added a commit
to webknjaz/ansible
that referenced
this issue
Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a race condition that is occasiontally observable in our CI when provisioning the initial dummy collections for testing. Refs: * https://github.com/ansible/galaxy_ng/releases/tag/4.9.0 * ansible/galaxy_ng#1950 * ansible/galaxy#3267 * pulp/pulp_ansible#1571 * pulp/pulp_ansible#1572 * pulp/pulp_ansible#1595 * pulp/pulp_ansible#1623 * pulp/pulp_ansible#1624
webknjaz
added a commit
to webknjaz/ansible
that referenced
this issue
Dec 11, 2023
It updates the underlying `pulp_ansible` to v0.20.2, which fixes a race condition that is occasionally observable in our CI when provisioning the initial dummy collections for testing. Refs: * https://github.com/ansible/galaxy_ng/releases/tag/4.9.0 * ansible/galaxy_ng#1950 * ansible/galaxy#3267 * pulp/pulp_ansible#1571 * pulp/pulp_ansible#1572 * pulp/pulp_ansible#1595 * pulp/pulp_ansible#1623 * pulp/pulp_ansible#1624
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Version
Describe the bug
We use galaxy-ng in our tests for ansible-core upstream. The jobs are flaky and get stuck periodically.
We don't always see the helper container logs as they aren't available in the CI but I recently observed something locally that just screams a race condition, which led me to believe that the problem is in this repository, specifically.
There are the corresponding log lines:
To Reproduce
ansible-galaxy collection publish
all of them simultaneously (could use a normal shell loop and add&
to the publish command to spawn many processes)Expected behavior
No errors. The latest version is always updated.
Additional context
To add more context, our tests run a galaxy-ng container and provision a bunch of dummy collections for testing of the dependency resolver used in the
ansible-galaxy collection install|download
commands. Then, we proceed to test the related scenarios.Here are the collections provisioned with our helper: https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/vars/main.yml#L27-L167
The helper itself is https://github.com/ansible/ansible/blob/81c83c6/test/integration/targets/ansible-galaxy-collection/library/setup_collections.py and upon closer inspection, it runs
ansible-galaxy collection publish
asynchronously in subprocesses but no more than 4 at a time.Back to the test collections being provisioned. There is one of the collections
namespace1.name1
that has 14 different versions. That's the one that is most likely to hit race conditions due to the nature of the piece of logic inpulp_ansible
where it is happening.Our helper records the results associated with each upload and the snippet above is extracted from its log.
Said log has successful uploads for all of the collections in the list (not just this one) except for
namespace1.name1:0.0.7
while all the other artifacts of this collection (0.0.1-0.0.6, 0.0.8-0.0.10, 0.1.0, 1.0.0, 1.0.9, 1.1.0-beta.1) are identical and are processed in the same loop.So that
duplicate key value violates unique constraint "unique_is_highest"
is a clear indication of a database constraint kicking in and failing the DB entry update. It's not hard to imagine that with the asynchronous tasks being scheduled, multiple uploads would trigger simultaneous attempts to update the database table in a way that would cause a conflict.I'm convinced that if the code were to use transactions/locks/retries, it could be more robust and process this case more gracefully.
The constraint itself is introduced @ https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/models.py#L234 / https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/migrations/0005_collectionversion_is_highest.py#L32.
I've also found this visualization of the DB schema: https://alikins.github.io/schemaspy/tables/ansible_collectionversion.html.
I believe that the buggy piece of code that gets into the race condition directly is https://github.com/pulp/pulp_ansible/blob/567b0c7/pulp_ansible/app/tasks/collections.py#L424-L477.
The text was updated successfully, but these errors were encountered: