Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance for SQLAlchemy >= 1.4 due to disabled caching #436

Closed
lonvia opened this issue Mar 31, 2023 · 8 comments · Fixed by #529
Closed

Bad performance for SQLAlchemy >= 1.4 due to disabled caching #436

lonvia opened this issue Mar 31, 2023 · 8 comments · Fixed by #529

Comments

@lonvia
Copy link

lonvia commented Mar 31, 2023

#338 has disable caching for all Geometry and Geography types. This effectively disables SQLAlchemy's caching mechanism for anything where geometry columns are involved.

Take this simple table:

import sqlalchemy as sa
from geoalchemy2 import Geometry

t = sa.Table('test', sa.MetaData(),
             sa.Column('id', sa.BigInteger, nullable=False, unique=True),
             sa.Column('centroid', Geometry(srid=4326, spatial_index=False)))

Selecting the full table will create a cache key:

>>> sa.select(t)._generate_cache_key()
CacheKey(key=(
  '0', 
  <class 'sqlalchemy.sql.selectable.Select'>, 
  '_raw_columns', 
  (
    (
      <Table object at 0x7fb2689d2390>, 
    ),
  ),
  '_label_style', 
  symbol('LABEL_STYLE_DISAMBIGUATE_ONLY'), 
  '_compile_options', 
  (
    <class 'sqlalchemy.sql.selectable.SelectState.default_select_compile_options'>, 
    ()
  ),
),)

while only selecting the geometry column does not:

>>> sa.select(t.c.centroid)._generate_cache_key()
None

If I change Geometry to cache_ok=True, then I measure about 20% performance gain in my application, so the difference is fairly significant.

@adrien-berchet
Copy link
Member

Hi @lonvia
Thanks for this report!
Indeed, I was not sure how the new cache mechanism would behave so I preferred to disable it (and I did not benchmarked this, 20% is indeed a lot 😕). And re-enabling it actually breaks a test, so I think we should keep it disabled for now. But at a first glance, it seems to me that the failure is caused by an almost impossible state which consists in passing a WKB fetched from a Postgres DB to a MySQL DB. I would have liked to be able to switch from one dialect to another with the same WKB/WKT elements but maybe it's a bit too ambitious. I will think about this...
In the mean time, if you want to propose a patch for this I will be happy to review it :)

@calebj
Copy link

calebj commented Jan 4, 2025

I figured it out. The conversion of the value of a WKBElement from memoryview to bytes doesn't happen when the compilation is cached, and as a result, a memoryview object gets passed as the first parameter in this query:

SELECT ST_AsText(ST_GeomFromWKB(%s, %s))

which happens the second time that statement is used in a query, as seen in the output with sqlalchemy echo enabled:

INFO     sqlalchemy.engine.Engine:base.py:1846 [no key 0.00008s] ()
INFO     sqlalchemy.engine.Engine:base.py:1846 ALTER TABLE lake ADD SPATIAL INDEX(geom);
INFO     sqlalchemy.engine.Engine:base.py:1846 [generated in 0.00020s] ()
INFO     sqlalchemy.engine.Engine:base.py:1846 INSERT INTO lake (geom) VALUES (ST_GeomFromText(%s, 4326))
INFO     sqlalchemy.engine.Engine:base.py:1846 [generated in 0.00023s] ('LINESTRING (0 0, 1 1)',)
INFO     sqlalchemy.engine.Engine:base.py:1846 SELECT ST_AsText(ST_GeomFromWKB(%s, %s)) AS `ST_AsText_1`
INFO     sqlalchemy.engine.Engine:base.py:1846 [generated in 0.00011s] (b'\x01\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?', 4326)
INFO     sqlalchemy.engine.Engine:base.py:1846 SELECT ST_SRID(ST_GeomFromWKB(%s, %s)) AS `ST_SRID_1`
INFO     sqlalchemy.engine.Engine:base.py:1846 [generated in 0.00007s] (b'\x01\x02\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\xf0?', 4326)
INFO     sqlalchemy.engine.Engine:base.py:1846 SELECT ST_AsText(ST_GeomFromWKB(%s, %s)) AS `ST_AsText_1`
INFO     sqlalchemy.engine.Engine:base.py:1846 [cached since 0.001253s ago] (<memory at 0x782808def280>, 4326)

I hacked around this by removing the convertion to a memoryview in from_shape(). Changing all of the type checks to bytes in test_shape.py causes all tox tests to pass with all cache_ok set to True. Progress!

The root cause is relying on the conversion in an expression compile function. I don't think statement compilation is the place for that, and the compiler cache is doing the right thing when faced with the same function being called with the same parameter type (WKBElement).

I know psycopg2 will convert a memoryview by default. So I tried making mysqlclient do the same by modifying the engine fixture in the tests:

            # For other dialects the engine is directly returned
            connect_args = {}

            if db_url.startswith("mysql://") or db_url.startswith("mariadb://"):
                # top of file:
                # from MySQLdb.converters import conversions as mysqldb_conv
                conv = connect_args['conv'] = mysqldb_conv.copy()

                def mv_conv(mv: memoryview, conv_dict: dict):
                    return mv.tobytes().hex()

                conv[memoryview] = mv_conv

            current_engine = create_engine(db_url, echo=_engine_echo, connect_args=connect_args)
            current_engine.update_execution_options(search_path=["gis", "public"])

But I ran into the problem that mysqlclient doesn't escape the bytes correctly:

E       sqlalchemy.exc.ProgrammingError: (MySQLdb.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '\x01\x02' at line 1")
E       [SQL: SELECT ST_AsText(ST_GeomFromWKB(%s, %s)) AS `ST_AsText_1`]
E       [parameters: (<memory at 0x7e37e0fab4c0>, 4326)]
E       (Background on this error at: https://sqlalche.me/e/20/f405)

It shouldn't be passing it as a literal... even if that was the intent, the correct syntax is X'0102...', not '\x01\x02'. For some reason, the driver is escaping the binary string output by the conversion function as if it were a string. Obviously, this doesn't happen when passing bytes objects as parameters directly.

Tracing it to the source for mysqlclient was a dead end, since the code would have to be modified to pass memoryview via _bytes_literal instead of to escape(), which jumps into the implementation of that function in C.

With the driver as a dead end, I looked to sqlalchemy hooks and added this to the engine fixture instead of the connect args:

    if current_engine.dialect.name in ["mysql", "mariadb"]:
        ...
        @event.listens_for(current_engine, "before_cursor_execute", retval=True)
        def before_cursor_execute(
                conn, cursor, statement, parameters, context, executemany
        ):
            if isinstance(parameters, (tuple, list)):
                parameters = tuple(x.tobytes() if isinstance(x, memoryview) else x
                                   for x in parameters)
            elif isinstance(parameters, dict):
                for k in parameters:
                    if isinstance(parameters[k], memoryview):
                        parameters[k] = parameters[k].tobytes()

            return statement, parameters

...and it works, with caching! I think the internals will need some changes to play nice and only pass bytes objects to the DB, but this should get others going if needed.

Patch
diff --git a/geoalchemy2/admin/dialects/mysql.py b/geoalchemy2/admin/dialects/mysql.py
index 5b7ef6c..18dfe03 100644
--- a/geoalchemy2/admin/dialects/mysql.py
+++ b/geoalchemy2/admin/dialects/mysql.py
@@ -172,9 +172,6 @@ def _compile_GeomFromText_MySql(element, compiler, **kw):
 
 def _compile_GeomFromWKB_MySql(element, compiler, **kw):
     element.identifier = "ST_GeomFromWKB"
-    wkb_data = list(element.clauses)[0].value
-    if isinstance(wkb_data, memoryview):
-        list(element.clauses)[0].value = wkb_data.tobytes()
     compiled = compiler.process(element.clauses, **kw)
     srid = element.type.srid
 
diff --git a/geoalchemy2/types/__init__.py b/geoalchemy2/types/__init__.py
index 7696698..c044413 100644
--- a/geoalchemy2/types/__init__.py
+++ b/geoalchemy2/types/__init__.py
@@ -110,7 +110,7 @@ class _GISType(UserDefinedType):
     """ This is the way by which spatial operators are defined for
         geometry/geography columns. """
 
-    cache_ok = False
+    cache_ok = True
     """ Disable cache for this type. """
 
     def __init__(
@@ -245,7 +245,7 @@ class Geometry(_GISType):
     """ The element class to use. Used by the parent class'
         ``result_processor`` method. """
 
-    cache_ok = False
+    cache_ok = True
     """ Disable cache for this type. """
 
 
@@ -275,7 +275,7 @@ class Geography(_GISType):
     """ The element class to use. Used by the parent class'
         ``result_processor`` method. """
 
-    cache_ok = False
+    cache_ok = True
     """ Disable cache for this type. """
 
 
@@ -315,7 +315,7 @@ class Raster(_GISType):
     """ The element class to use. Used by the parent class'
         ``result_processor`` method. """
 
-    cache_ok = False
+    cache_ok = True
     """ Disable cache for this type. """
 
     def __init__(self, spatial_index=True, from_text=None, name=None, nullable=True) -> None:
diff --git a/tests/conftest.py b/tests/conftest.py
index df75114..a588d01 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -3,7 +3,7 @@ from pathlib import Path
 
 import pytest
 from sqlalchemy import MetaData
-from sqlalchemy import create_engine
+from sqlalchemy import create_engine, event
 from sqlalchemy import text
 from sqlalchemy.dialects.mysql.base import MySQLDialect
 from sqlalchemy.dialects.sqlite.base import SQLiteDialect
@@ -236,6 +236,20 @@ def engine(tmpdir, db_url, _engine_echo, _require_all_dialects):
             else:
                 pytest.skip(reason=msg)
 
+        @event.listens_for(current_engine, "before_cursor_execute", retval=True)
+        def before_cursor_execute(
+                conn, cursor, statement, parameters, context, executemany
+        ):
+            if isinstance(parameters, (tuple, list)):
+                parameters = tuple(x.tobytes() if isinstance(x, memoryview) else x
+                                   for x in parameters)
+            elif isinstance(parameters, dict):
+                for k in parameters:
+                    if isinstance(parameters[k], memoryview):
+                        parameters[k] = parameters[k].tobytes()
+
+            return statement, parameters
+
     yield current_engine
     current_engine.dispose()

@adrien-berchet
Copy link
Member

adrien-berchet commented Jan 4, 2025

Ooooh that's super interesting @calebj , thank you very much for digging into this!!!
I will try to work on this soon to see how we can use your patch. And I will have to see how it can be used in the main code, not only in the tests.

@adrien-berchet
Copy link
Member

Hi there.
Based on the suggestion of @calebj I was able to enable the cache for all the types. I also created a plugin for SQLAlchemy Engine objects to automatically attach the relevant event listeners based on the dialect of the engine. These two changes should improve both the performance and the usability of GeoAlchemy2.
Just to be sure I didn't break anything, could you guys test the branch of #529 on your use cases please? @lonvia @calebj

@calebj
Copy link

calebj commented Jan 8, 2025

I use PostgreSQL with asyncpg so I have no use cases that need the plugin, but the changes didn't break anything as a side effect when I added it to the engine.

My only concern is regarding drivers that properly support conversion functions (most of them do). Checking all of the parameters in each statement is a heavy-handed fix to work around bad behavior of the driver used in the test suite. Most of the others support registering converter functions that work. Unfortunately, there is no universal case that can be added in geoalchemy. It would be nice to be able to disable the hook, maybe with a URL parameter.

EDIT: I linked the changes to implement the option as a URL parameter on the PR.

@adrien-berchet
Copy link
Member

Thank you very much for your feedback and suggestion @calebj !
But I'm not sure I understand your concerns and I suspect you missed something about the event listeners added by the plugin so let me clarify just in case: the plugin selects the event listeners and attaches them to the engine based on the dialect of the engine. So for PostgreSQL no event listeners are actually attached to the engine, so nothing happens (the function is not even called). So the workaround for casting the parameters will only impact the performance for the MySQL and MariaDB but will have zero impact on other dialects.
If this was clear but your concerns are about something else, could you elaborate a bit more to help me understand please?

@calebj
Copy link

calebj commented Jan 9, 2025

I understand how the plugin works. The concern is that the workaround applies to all MySQL drivers, when the workaround might only be necessary for some of them. We already know mysqldb misbehaves, but all tests pass using a converter on pymysql. Although it's impossible to test the async drivers aiomysql and asyncmy in this test suite, they should work the same way since they are heavily based on pymysql. That's why I think a parameter to disable the hook is called for.

@adrien-berchet
Copy link
Member

Ohhh ok, I misunderstood, I thought you were speaking about dialects instead of drivers, I see your point now, thanks for your explanation!
So it looks indeed hard to automatically chose the proper behavior depending on the driver, and even if we could do it now it will be hard to know if the behavior of one driver changes at some point. So your solution looks good, I'm going to check it and merge it.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants