Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API endpoint /api/tool_shed_repositories fails with PendingRollbackError under high load #18627

Closed
kysrpex opened this issue Jul 31, 2024 · 3 comments

Comments

@kysrpex
Copy link
Contributor

kysrpex commented Jul 31, 2024

Describe the bug

Under sufficiently high load, repeatedly calling the endpoint /api/tool_shed_repositories leads to HTTP 500 errors. Galaxy attempts to run a database query to get the list of repositories but it fails with a PendingRollbackError.

Server side error (Sentry issue):

PendingRollbackError: Can't reconnect until invalid transaction is rolled back.  Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)

Client side error (ephemeris):

Storing log file in: /tmp/ephemeris_kwp2f5zh
Traceback (most recent call last):
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/bin/shed-tools", line 8, in <module>
    sys.exit(main())
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 562, in main
    **kwargs)
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 152, in install_repositories
    filtered_repos = self.filter_installed_repos(repository_list)
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 87, in filter_installed_repos
    installed_repos = flatten_repo_info(self.installed_repositories())
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/shed_tools.py", line 76, in installed_repositories
    get_all_tools=True
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/get_tool_list_from_galaxy.py", line 140, in tool_list
    repo_list = self.repository_list
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/ephemeris/get_tool_list_from_galaxy.py", line 113, in repository_list
    repos = tsc.get_repositories()
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/bioblend/galaxy/toolshed/__init__.py", line 36, in get_repositories
    return self._get()
  File "/scratch/shiningpanda/jobs/023ca033/virtualenvs/d41d8cd9/lib/python3.6/site-packages/bioblend/galaxy/client.py", line 151, in _get
    status_code=r.status_code,
bioblend.ConnectionError: GET: error 500: b'Internal Server Error', 0 attempts left: Internal Server Error

This bug has been observed when running the automatic tool installation CI of useGalaxy.eu. At useGalaxy.eu, tools are managed via the usegalaxy-eu/usegalaxy-eu-tools repository. This repository contains a couple of YAML files (e.g. this one) that specify the name and owner of each tool in the Galaxy Tool Shed. A GitHub Actions Workflow runs a script that generates lockfiles (analogous to cargo.lock or poetry.lock) from those YAML files that fix the versions of the tools that will be installed. Another script updates the lock files. Each Sunday, a Jenkins project runs ephemeris on all lockfiles to install the tools. The Jenkins job fails to install some tools, and what is worse, even whole lockfiles because of this error (see report).

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 24.1 (minor 2.dev0)
Commit: 5d6f5af29144ceb352f6356019a81547fc73f083 (from the usegalaxy-eu/galaxy fork)

Browser and Operating System
Operating System: Linux
Browser: ephemeris -- uses --> bioblend -- uses --> Python requests library

To Reproduce

Either run the tool installation CI or this code snippet.

import multiprocessing
import time

from bioblend.galaxy import GalaxyInstance
from bioblend.galaxy.toolshed import ToolShedClient
from ephemeris.get_tool_list_from_galaxy import GiToToolYaml

galaxy = GalaxyInstance(
    "https://usegalaxy.eu",
    key="********************************",  # Galaxy API key
)
gitotoolyaml = GiToToolYaml(galaxy)
tsc = ToolShedClient(gitotoolyaml.gi)

def hammer_with_requests(worker: int = 0):
    tries = 0
    while True:
         tries += 1
         now = time.time()
         try:
             repositories = tsc.get_repositories()
         except Exception as e:
             print(e)
             continue
         after = time.time()
         print(f"Worker {worker}, try {tries}, `len(repositories) = {len(repositories)}`, `time = {after-now}`")

num_processes = 50

processes = []
for i in range(num_processes):
    process = multiprocessing.Process(target=hammer_with_requests, args=(i,))
    processes.append(process)
    process.start()

After a few minutes, Failed to get repositories list: GET: error 500: b'Internal Server Error', 0 attempts left: Internal Server Error will show up in your terminal. On Sentry, the PendingRollbackError pops up.

In addition, running the code above has undesired side effects (things start malfunctioning) because it causes Galaxy to spawn new database connections until they are all exhausted.

galaxy@sn06:~$ gxadmin query latest-users
psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

Other examples:

OperationalError

(psycopg2.OperationalError) connection to server at "****.galaxyproject.eu" (xxx.xxx.xxx.xxx), port **** failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8

)

OperationalError

(psycopg2.OperationalError) connection to server at "****.galaxyproject.eu" (xxx.xxx.xxx.xxx), port **** failed: Connection timed out
Is the server running on that host and accepting TCP/IP connections?

(Background on this error at: https://sqlalche.me/e/20/e3q8

)

Expected behavior
Either HTTP 200 reponses, or 502 and 503 errors. The latter are expected when the server is unable to keep up with the amount of requests (for example, setting num_processes = 100 or higher in the code snippet from above). This API endpoint is costly in general, the server will take about ~10 seconds to provide a response (but that is a separate issue).

Screenshots
grafik

Additional context

Replacing _get_tool_shed_repositories on lib/galaxy/webapps/galaxy/services/tool_shed_repositories.py with

    def _get_tool_shed_repositories(self, **kwd):
        stmt = select(ToolShedRepository)
        for key, value in kwd.items():
            if value is not None:
                column = ToolShedRepository.__table__.c[key]  # type:ignore[attr-defined]
                stmt = stmt.filter(column == value)
        stmt = stmt.order_by(ToolShedRepository.name).order_by(cast(ToolShedRepository.ctx_rev, Integer).desc())
        # logging.info("Listing tool_shed_repositories using the session as context manager")
        with self._install_model_context() as session:
            check_database_connection(session)
            return session.scalars(stmt).all()

fixes the issue (see this error-free tool installation CI report from yesterday). It is unclear to me whether just with self._install_model_context() as session:, check_database_connection(session) or the combination of both is needed. This also fixes the side effects mentioned earlier.

From my point of view, the following issues and PRs are related to this problem or can be useful to understand it:

I assume the fix above works because, under sufficiently high load, calls are leaving an invalidated connection behind that must now

be explicitly rolled back

as the SQLAlchemy docs claim. I am no expert in SQLAlchemy but I guess it must have something to do with the fact that scoped_session.remove() is probably not being called after the request has been served (see Using Thread-Local Scope with Web Applications). I hope we can use this issue to collaboratively shed some light on this.

@kysrpex
Copy link
Contributor Author

kysrpex commented Jul 31, 2024

Pinging @davelopez.

@mvdbeek
Copy link
Member

mvdbeek commented Jul 31, 2024

The explicit session management seems fine as a fix, can you PR this ? We should be closing all scoped sessions when the request ends, but that might not work for the install model. Are you using a separate database for this (install_database_connection in galaxy.yml) ?

@kysrpex
Copy link
Contributor Author

kysrpex commented Jul 31, 2024

The explicit session management seems fine as a fix, can you PR this ? We should be closing all scoped sessions when the request ends, but that might not work for the install model. Are you using a separate database for this (install_database_connection in galaxy.yml) ?

We are using the same database connection for user data and tool shed install data.

Ok, I PR this 👍

mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024
This is probably the more general fix for
galaxyproject#18627.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024
This is probably the more general fix for
galaxyproject#18627.
mvdbeek added a commit to mvdbeek/galaxy that referenced this issue Jul 31, 2024
This is probably the more general fix for
galaxyproject#18627.
@kysrpex kysrpex closed this as completed Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants