-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding mess recap #332
Comments
Can you register a custom typecaster for the text type, to return |
As discuss, let's:
|
Thanks @dvarrazzo, I didn't see your post before posting mine. |
As far as I understand, the problem is that the encoding we need to decode the query field from |
Ah sorry: creating a specific typecaster doesn't work in psycopg2 because the decode is done internally, before passing the data to the Python function, which receive a If you only need to survive the accident of receiving a badly encoded char, getting bytea from the query seems enough: # Conn 1, in bad encoding
conn = psycopg2.connect("dbname=latin1 client_encoding=sql_ascii")
cur = conn.cursor()
cur.execute("select 'à'".encode("latin1")) # Conn 2 in utf8
...
cur.execute("select query::bytea from pg_stat_activity where datname = 'latin1'");
In [20]: cur.fetchall()
Out[20]: [(<memory at 0x7fb89ed57dc0>,)]
In [21]: bytes(_20[0][0])
Out[21]: b"select '\xe0', pg_sleep(10)"
In [22]: bytes(_20[0][0]).decode("utf8", "replace")
Out[22]: "select '�', pg_sleep(10)" do you need to know more precisely the content of the query? |
I think we would still hit problem number 3 (which we unfortunately couldn't reproduce) |
Yes, problem n°3 being But, indeed, it might be worth moving to psycopg 3 while working on this issue. |
Is it widely available or is it only available on recent distributions ? |
In psycopg3 the problem can be solved easily, because the decoding step is in the loader: In [1]: import psycopg
In [2]: conn1 = psycopg.connect("dbname=latin1 client_encoding=sql_ascii")
In [3]: conn1.execute("select 'à'".encode("latin1"))
Out[3]: <psycopg.Cursor [TUPLES_OK] [INTRANS] (user=piro database=latin1) at 0x7f378203b740>
In [4]: conn = psycopg.connect()
In [5]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Traceback (most recent call last)
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 8: invalid continuation byte
In [6]: class SafeText(psycopg.types.string.TextLoader):
...: def load(self, data):
...: try:
...: return super().load(data)
...: except Exception:
...: return bytes(data).decode(self._encoding, "replace")
...:
In [7]: conn.adapters.register_loader("text", SafeText)
In [8]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Out[8]: "select '�'" you can even use extra knowledge about the encoding of the database you are reading the query from, and use that one in the custom loader, instead of |
(rather than the above, subclassing the built-in |
No, it's not widely available in distributions yet; which is indeed a
problem for us.
|
Is |
Yes, Psycopg 3 is packaged by Debian https://tracker.debian.org/pkg/psycopg3 1, but not yet available in stable. For RPM-based distributions, it's packaged by PGDG so that would work for us. In Footnotes
|
ahah, I forgot https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016031 |
So, if psycopg 3 was packaged in apt pgdg too, pg_activity would be mostly covered, right? |
Yes.
|
So, the approach I'm taking to hopefully fix these issues is the following:
|
This is a recap of the encoding problem we face(d).
So if we had this:
PostgreSQL used to complain and crash pg_activity:
We added a conversion in the query :
\
's. The following kind of queries:would fail with:
We added a replace in the query :
For all these problems, the query crashes so even if we catch the exception we have nothing to display. So it would be better to manage this in the python where we would have a more granularity.
Solutions:
get the string from PostgreSQL and convert to the correct collation in python wouldn't work (it's no1).
get a bytea and convert in Python wouldn't work it's no2&3
asking psycopg2 to return bytes instead of unicode strings as proposed by @dvarrazzo a long time ago . This would convert all strings to bytes so we have to convert all of them in pg_activity based on the collation we fetch in
pg_database.encoding
. I tried this with a custom cursor factory to do the automatic conversion when we have bytes in the resultset (except i didn't drop the replace and cast in the query which we should do). It's looks rather complicated. Is there another solution ?The text was updated successfully, but these errors were encountered: