-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API endpoint /api/tool_shed_repositories
fails with PendingRollbackError
under high load
#18627
Comments
Pinging @davelopez. |
The explicit session management seems fine as a fix, can you PR this ? We should be closing all scoped sessions when the request ends, but that might not work for the install model. Are you using a separate database for this ( |
We are using the same database connection for user data and tool shed install data. Ok, I PR this 👍 |
This is probably the more general fix for galaxyproject#18627.
This is probably the more general fix for galaxyproject#18627.
This is probably the more general fix for galaxyproject#18627.
Describe the bug
Under sufficiently high load, repeatedly calling the endpoint
/api/tool_shed_repositories
leads to HTTP 500 errors. Galaxy attempts to run a database query to get the list of repositories but it fails with aPendingRollbackError
.Server side error (Sentry issue):
Client side error (ephemeris):
This bug has been observed when running the automatic tool installation CI of useGalaxy.eu. At useGalaxy.eu, tools are managed via the usegalaxy-eu/usegalaxy-eu-tools repository. This repository contains a couple of YAML files (e.g. this one) that specify the name and owner of each tool in the Galaxy Tool Shed. A GitHub Actions Workflow runs a script that generates lockfiles (analogous to cargo.lock or poetry.lock) from those YAML files that fix the versions of the tools that will be installed. Another script updates the lock files. Each Sunday, a Jenkins project runs ephemeris on all lockfiles to install the tools. The Jenkins job fails to install some tools, and what is worse, even whole lockfiles because of this error (see report).
Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.1 (minor 2.dev0)
Commit: 5d6f5af29144ceb352f6356019a81547fc73f083 (from the usegalaxy-eu/galaxy fork)
Browser and Operating System
Operating System: Linux
Browser: ephemeris -- uses --> bioblend -- uses --> Python requests library
To Reproduce
Either run the tool installation CI or this code snippet.
After a few minutes,
Failed to get repositories list: GET: error 500: b'Internal Server Error', 0 attempts left: Internal Server Error
will show up in your terminal. On Sentry, thePendingRollbackError
pops up.In addition, running the code above has undesired side effects (things start malfunctioning) because it causes Galaxy to spawn new database connections until they are all exhausted.
Other examples:
Expected behavior
Either HTTP 200 reponses, or 502 and 503 errors. The latter are expected when the server is unable to keep up with the amount of requests (for example, setting
num_processes = 100
or higher in the code snippet from above). This API endpoint is costly in general, the server will take about ~10 seconds to provide a response (but that is a separate issue).Screenshots

Additional context
Replacing
_get_tool_shed_repositories
on lib/galaxy/webapps/galaxy/services/tool_shed_repositories.py withfixes the issue (see this error-free tool installation CI report from yesterday). It is unclear to me whether just
with self._install_model_context() as session:
,check_database_connection(session)
or the combination of both is needed. This also fixes the side effects mentioned earlier.From my point of view, the following issues and PRs are related to this problem or can be useful to understand it:
check_database_connection
#17598I assume the fix above works because, under sufficiently high load, calls are leaving an invalidated connection behind that must now
as the SQLAlchemy docs claim. I am no expert in SQLAlchemy but I guess it must have something to do with the fact that
scoped_session.remove()
is probably not being called after the request has been served (see Using Thread-Local Scope with Web Applications). I hope we can use this issue to collaboratively shed some light on this.The text was updated successfully, but these errors were encountered: