API endpoint `/api/tool_shed_repositories` fails with `PendingRollbackError` under high load #18627

kysrpex · 2024-07-31T07:00:37Z

Describe the bug

Under sufficiently high load, repeatedly calling the endpoint /api/tool_shed_repositories leads to HTTP 500 errors. Galaxy attempts to run a database query to get the list of repositories but it fails with a PendingRollbackError.

Server side error (Sentry issue):

PendingRollbackError: Can't reconnect until invalid transaction is rolled back.  Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)

Client side error (ephemeris):

Storing log file in: /tmp/ephemeris_kwp2f5zh
Traceback (most recent call last):
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/bin/shed-tools", line 8, in <module>
    sys.exit(main())
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 562, in main
    **kwargs)
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 152, in install_repositories
    filtered_repos = self.filter_installed_repos(repository_list)
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 87, in filter_installed_repos
    installed_repos = flatten_repo_info(self.installed_repositories())
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 76, in installed_repositories
    get_all_tools=True
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/get_tool_list_from_galaxy.py", line 140, in tool_list
    repo_list = self.repository_list
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/get_tool_list_from_galaxy.py", line 113, in repository_list
    repos = tsc.get_repositories()
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/bioblend/galaxy/toolshed/__init__.py", line 36, in get_repositories
    return self._get()
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/bioblend/galaxy/client.py", line 151, in _get
    status_code=r.status_code,
bioblend.ConnectionError: GET: error 500: b'Internal Server Error', 0 attempts left: Internal Server Error

This bug has been observed when running the automatic tool installation CI of useGalaxy.eu. At useGalaxy.eu, tools are managed via the usegalaxy-eu/usegalaxy-eu-tools repository. This repository contains a couple of YAML files (e.g. this one) that specify the name and owner of each tool in the Galaxy Tool Shed. A GitHub Actions Workflow runs a script that generates lockfiles (analogous to cargo.lock or poetry.lock) from those YAML files that fix the versions of the tools that will be installed. Another script updates the lock files. Each Sunday, a Jenkins project runs ephemeris on all lockfiles to install the tools. The Jenkins job fails to install some tools, and what is worse, even whole lockfiles because of this error (see report).

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.1 (minor 2.dev0)
Commit: 5d6f5af29144ceb352f6356019a81547fc73f083 (from the usegalaxy-eu/galaxy fork)

Browser and Operating System
Operating System: Linux
Browser: ephemeris -- uses --> bioblend -- uses --> Python requests library

To Reproduce

Either run the tool installation CI or this code snippet.

import multiprocessing
import time

from bioblend.galaxy import GalaxyInstance
from bioblend.galaxy.toolshed import ToolShedClient
from ephemeris.get_tool_list_from_galaxy import GiToToolYaml

galaxy = GalaxyInstance(
    "https://usegalaxy.eu",
    key="********************************",  # Galaxy API key
)
gitotoolyaml = GiToToolYaml(galaxy)
tsc = ToolShedClient(gitotoolyaml.gi)

def hammer_with_requests(worker: int = 0):
    tries = 0
    while True:
         tries += 1
         now = time.time()
         try:
             repositories = tsc.get_repositories()
         except Exception as e:
             print(e)
             continue
         after = time.time()
         print(f"Worker {worker}, try {tries}, `len(repositories) = {len(repositories)}`, `time = {after-now}`")

num_processes = 50

processes = []
for i in range(num_processes):
    process = multiprocessing.Process(target=hammer_with_requests, args=(i,))
    processes.append(process)
    process.start()

After a few minutes, Failed to get repositories list: GET: error 500: b'Internal Server Error', 0 attempts left: Internal Server Error will show up in your terminal. On Sentry, the PendingRollbackError pops up.

In addition, running the code above has undesired side effects (things start malfunctioning) because it causes Galaxy to spawn new database connections until they are all exhausted.

galaxy@sn06:~$ gxadmin query latest-users
psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

Other examples:

https://sentry.galaxyproject.org/organizations/galaxy/issues/158328/events/b7a5d7d73fbd4d7f95811da95878e470/

OperationalError

(psycopg2.OperationalError) connection to server at "****.galaxyproject.eu" (xxx.xxx.xxx.xxx), port **** failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8

)

https://sentry.galaxyproject.org/organizations/galaxy/issues/158329/events/c9487039615c4173aff5b938c42969c1/

OperationalError

(psycopg2.OperationalError) connection to server at "****.galaxyproject.eu" (xxx.xxx.xxx.xxx), port **** failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8

)

Expected behavior
Either HTTP 200 reponses, or 502 and 503 errors. The latter are expected when the server is unable to keep up with the amount of requests (for example, setting num_processes = 100 or higher in the code snippet from above). This API endpoint is costly in general, the server will take about ~10 seconds to provide a response (but that is a separate issue).

Screenshots

Additional context

Replacing _get_tool_shed_repositories on lib/galaxy/webapps/galaxy/services/tool_shed_repositories.py with

    def _get_tool_shed_repositories(self, **kwd):
        stmt = select(ToolShedRepository)
        for key, value in kwd.items():
            if value is not None:
                column = ToolShedRepository.__table__.c[key]  # type:ignore[attr-defined]
                stmt = stmt.filter(column == value)
        stmt = stmt.order_by(ToolShedRepository.name).order_by(cast(ToolShedRepository.ctx_rev, Integer).desc())
        # logging.info("Listing tool_shed_repositories using the session as context manager")
        with self._install_model_context() as session:
            check_database_connection(session)
            return session.scalars(stmt).all()

fixes the issue (see this error-free tool installation CI report from yesterday). It is unclear to me whether just with self._install_model_context() as session:, check_database_connection(session) or the combination of both is needed. This also fixes the side effects mentioned earlier.

From my point of view, the following issues and PRs are related to this problem or can be useful to understand it:

I assume the fix above works because, under sufficiently high load, calls are leaving an invalidated connection behind that must now

be explicitly rolled back

as the SQLAlchemy docs claim. I am no expert in SQLAlchemy but I guess it must have something to do with the fact that scoped_session.remove() is probably not being called after the request has been served (see Using Thread-Local Scope with Web Applications). I hope we can use this issue to collaboratively shed some light on this.

The text was updated successfully, but these errors were encountered:

kysrpex · 2024-07-31T07:02:51Z

Pinging @davelopez.

mvdbeek · 2024-07-31T12:55:19Z

The explicit session management seems fine as a fix, can you PR this ? We should be closing all scoped sessions when the request ends, but that might not work for the install model. Are you using a separate database for this (install_database_connection in galaxy.yml) ?

kysrpex · 2024-07-31T13:42:55Z

The explicit session management seems fine as a fix, can you PR this ? We should be closing all scoped sessions when the request ends, but that might not work for the install model. Are you using a separate database for this (install_database_connection in galaxy.yml) ?

We are using the same database connection for user data and tool shed install data.

Ok, I PR this 👍

This is probably the more general fix for galaxyproject#18627.

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024

Tie install model scope to request scope

101b3ef

This is probably the more general fix for galaxyproject#18627.

mvdbeek mentioned this issue Jul 31, 2024

[24.0] Close install model session when request ends #18629

Merged

4 tasks

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024

Tie install model scope to request scope

ecd7bbb

This is probably the more general fix for galaxyproject#18627.

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024

Tie install model scope to request scope

d83b56b

This is probably the more general fix for galaxyproject#18627.

kysrpex closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API endpoint `/api/tool_shed_repositories` fails with `PendingRollbackError` under high load #18627

API endpoint `/api/tool_shed_repositories` fails with `PendingRollbackError` under high load #18627

kysrpex commented Jul 31, 2024 •

edited

Loading

kysrpex commented Jul 31, 2024

mvdbeek commented Jul 31, 2024

kysrpex commented Jul 31, 2024

API endpoint /api/tool_shed_repositories fails with PendingRollbackError under high load #18627

API endpoint /api/tool_shed_repositories fails with PendingRollbackError under high load #18627

Comments

kysrpex commented Jul 31, 2024 • edited Loading

kysrpex commented Jul 31, 2024

mvdbeek commented Jul 31, 2024

kysrpex commented Jul 31, 2024

API endpoint `/api/tool_shed_repositories` fails with `PendingRollbackError` under high load #18627

API endpoint `/api/tool_shed_repositories` fails with `PendingRollbackError` under high load #18627

kysrpex commented Jul 31, 2024 •

edited

Loading