Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests: Manually stop daemon after verdi devel revive test #5689

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 5, 2022

Fixes #5687

There was a problem where the verdi process pause test in the
tests/cmdline/commands/test_process.py would except because the
timeout would be hit. The direct result was because the daemon worker
could not load the node from the database, which in turns was because
the session was in a pending rollback state. This was because a previous
operation on the database excepted. This exception seemed to be due to
the daemon trying to call CalcJob.delete_state or
Process.delete_checkpoint in the on_terminated calls. For some
reason, the update statement that would be executed for this, to remove
the relevant attribute key, would match 0 rows. The suspicion is because
the relevant node had already been removed from the database, probably
because another test, ran between the two daemon tests, had cleaned the
database and so the node no longer existed, but the process task somehow
did.

It is not quite clear exactly where the problem lies, but for now the
temporary work-around is to manually stop the daemon in the first test,
which apparently cleans the state such that the original exception is no
longer hit and the daemon doesn't get stuck with an inconsistent session.

@sphuber
Copy link
Contributor Author

sphuber commented Oct 5, 2022

I think this should fix the problem, although the solution is not really ideal. It is a bit of a workaround since I still couldn't understand a 100% what is going on. This solution seems to work for now though and since it is really messing with all builds, we might want to consider merging this while we investigate further to find the real root cause.

@sphuber sphuber requested a review from chrisjsewell October 5, 2022 22:37
@sphuber sphuber changed the title Tests: Manually stop daemon after verdi devel revive test Tests: Manually stop daemon after verdi devel revive test Oct 6, 2022
Warnings are raised when a profile is loaded that configures a RabbitMQ
server with an unsupported version or if the installed `aiida-core` code
is not a released version. These warnings are not relevant for testing
and so they are suppressed by setting the relevant config options.

The options are set on the automatically created config in the case of
the temporary test profile, as well as the test profile that is created
manually before for the tests run in the Github Actions workflow.
The `Computer` created by the `aiida_localhost` fixture configures the
`core.direct` scheduler plugin, which does not support setting a maximum
memory directive. Doing so leads to a warning being logged everytime a
job is submitted to the computer.
If the `submit_and_wait` fixture times out waiting for the submitted
process to reach the desired state, usually there is a problem with the
daemon workers. To make debugging easier, the status of the daemon as
well as the content of the daemon log file are included in the exception
message.
There was a problem where the `verdi process pause` test in the
`tests/cmdline/commands/test_process.py` would except because the
timeout would be hit. The direct result was because the daemon worker
could not load the node from the database, which in turns was because
the session was in a pending rollback state. This was because a previous
operation on the database excepted. This exception seemed to be due to
the daemon trying to call `CalcJob.delete_state` or
`Process.delete_checkpoint` in the `on_terminated` calls. For some
reason, the update statement that would be executed for this, to remove
the relevant attribute key, would match 0 rows. The suspicion is because
the relevant node had already been removed from the database, probably
because another test, ran between the two daemon tests, had cleaned the
database and so the node no longer existed, but the process task somehow
did.

It is not quite clear exactly where the problem lies, but for now the
temporary work-around is to manually stop the daemon in the first test,
which apparently cleans the state such that the original exception is no
longer hit and the daemon doesn't get stuck with an inconsistent session.
@sphuber sphuber force-pushed the fix/5687/daemon-fixture-delete-checkpoint branch from 40dc604 to ae90d17 Compare October 7, 2022 10:03
@sphuber
Copy link
Contributor Author

sphuber commented Oct 7, 2022

@chrisjsewell I am merging this soon since it is blocking all other PRs. Let me know if you want to have a look still or I can go ahead

Copy link
Member

@chrisjsewell chrisjsewell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers

@sphuber sphuber merged commit ea135d3 into aiidateam:main Oct 7, 2022
@sphuber sphuber deleted the fix/5687/daemon-fixture-delete-checkpoint branch October 7, 2022 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants