Encoding mess recap #332

blogh · 2023-01-11T10:12:12Z

This is a recap of the encoding problem we face(d).

The queries listed in pg_stat_activity are encoded with the collation of the database the run into. (FATAL: 'utf-8' codec can't decode byte 0xe7 in position 128: invalid continuation byte #149)

So if we had this:

denis@postgres=# CREATE DATABASE latin1 ENCODING 'latin1' TEMPLATE template0 LC_COLLATE 'fr_FR.latin1' LC_CTYPE 'fr_FR.latin1';
CREATE DATABASE
denis@postgres=# \c latin1 
You are now connected to database "latin1" as user "denis".
denis@latin1=# CREATE TABLE test (data text);
CREATE TABLE
denis@latin1=# BEGIN;
BEGIN
denis@latin1=# INSERT INTO test VALUES ('é');
INSERT 0 1

PostgreSQL used to complain and crash pg_activity:

>>> conn = psycopg2.connect(database="postgres")
>>> conn.encoding
'UTF8'
>>> cur = conn.cursor()
>>> cur.execute("select query from pg_stat_activity")
>>> cur.fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 26: invalid continuation byte

We added a conversion in the query :

convert_from(sq.query::bytea, coalesce(pg_catalog.pg_encoding_to_char(b.encoding), 'UTF8')) AS query

It still failed in some cases (Encoding bug, still not out of the woods ... #275) where we have \'s. The following kind of queries:

SELECT '\It`s gona crash', pg_sleep(10);
select '123' ~ '\d+', pg_sleep(3);
select 'Hello world!\n', pg_sleep(3);

would fail with:

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea

We added a replace in the query :

convert_from(replace(a.query, '\', '\\')::bytea, coalesce(pg_catalog.pg_encoding_to_char(b.encoding), 'UTF8')) AS query

It still doesn't do it because some times strings end up as non valid bytea sequences (bytea field causing stacktrace #302)

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea

For all these problems, the query crashes so even if we catch the exception we have nothing to display. So it would be better to manage this in the python where we would have a more granularity.

Solutions:

get the string from PostgreSQL and convert to the correct collation in python wouldn't work (it's no1).
get a bytea and convert in Python wouldn't work it's no2&3
asking psycopg2 to return bytes instead of unicode strings as proposed by @dvarrazzo a long time ago . This would convert all strings to bytes so we have to convert all of them in pg_activity based on the collation we fetch in pg_database.encoding. I tried this with a custom cursor factory to do the automatic conversion when we have bytes in the resultset (except i didn't drop the replace and cast in the query which we should do). It's looks rather complicated. Is there another solution ?

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2023-01-11T10:58:10Z

Can you register a custom typecaster for the text type, to return data.decode(conn.encoding, 'replace')?

blogh · 2023-01-11T11:03:07Z

As discuss, let's:

add the register_type() only for the cursors that need it;
keep the data in bytes until we display them to avoid doing work on stuff we don't need.

blogh · 2023-01-11T11:04:57Z

Thanks @dvarrazzo, I didn't see your post before posting mine.
I'll check with @dlax.

dlax · 2023-01-11T11:11:27Z

Can you register a custom typecaster for the text type, to return data.decode(conn.encoding, 'replace')?

As far as I understand, the problem is that the encoding we need to decode the query field from pg_stat_activity is probably not the connection encoding in general; rather it's the encoding of the database the query is running in.

dvarrazzo · 2023-01-11T12:08:49Z

Ah sorry: creating a specific typecaster doesn't work in psycopg2 because the decode is done internally, before passing the data to the Python function, which receive a str. This comes from the fact that psycopg2 is pretty much totally text-based. In psycopg 3 it can be done, because it's binary-based and the dumpers receive bytes, but let's first see if it's useful.

If you only need to survive the accident of receiving a badly encoded char, getting bytea from the query seems enough:

# Conn 1, in bad encoding

conn = psycopg2.connect("dbname=latin1 client_encoding=sql_ascii")
cur = conn.cursor()
cur.execute("select 'à'".encode("latin1"))

# Conn 2 in utf8
...
cur.execute("select query::bytea from pg_stat_activity where datname = 'latin1'");

In [20]: cur.fetchall()
Out[20]: [(<memory at 0x7fb89ed57dc0>,)]

In [21]: bytes(_20[0][0])
Out[21]: b"select '\xe0', pg_sleep(10)"

In [22]: bytes(_20[0][0]).decode("utf8", "replace")
Out[22]: "select '�', pg_sleep(10)"

do you need to know more precisely the content of the query?

blogh · 2023-01-11T12:51:59Z

I think we would still hit problem number 3 (which we unfortunately couldn't reproduce)

dlax · 2023-01-11T13:11:27Z

Yes, problem n°3 being query::bytea producing invalid_text_representation: invalid input syntax for type bytea on postgres side.

But, indeed, it might be worth moving to psycopg 3 while working on this issue.

blogh · 2023-01-11T13:18:01Z

Is it widely available or is it only available on recent distributions ?

dvarrazzo · 2023-01-11T13:26:36Z

In psycopg3 the problem can be solved easily, because the decoding step is in the loader:

In [1]: import psycopg

In [2]: conn1 = psycopg.connect("dbname=latin1 client_encoding=sql_ascii")

In [3]: conn1.execute("select 'à'".encode("latin1"))
Out[3]: <psycopg.Cursor [TUPLES_OK] [INTRANS] (user=piro database=latin1) at 0x7f378203b740>

In [4]: conn = psycopg.connect()

In [5]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Traceback (most recent call last)
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 8: invalid continuation byte

In [6]: class SafeText(psycopg.types.string.TextLoader):
   ...:     def load(self, data):
   ...:         try:
   ...:             return super().load(data)
   ...:         except Exception:
   ...:             return bytes(data).decode(self._encoding, "replace")
   ...: 

In [7]: conn.adapters.register_loader("text", SafeText)

In [8]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Out[8]: "select '�'"

you can even use extra knowledge about the encoding of the database you are reading the query from, and use that one in the custom loader, instead of self._encoding.

dvarrazzo · 2023-01-11T13:29:32Z

(rather than the above, subclassing the built-in TextLoader, it would be wiser to write a Loader instance from scratch, starting by copying TextLoader from the source code, because _encoding is not part of the public interface).

dlax · 2023-01-11T13:30:37Z

No, it's not widely available in distributions yet; which is indeed a problem for us.

dvarrazzo · 2023-01-11T13:32:46Z

Is pg_activity packaged in distributions? Other programs are starting to be based on it (pgadmin, pgcli...), so I'm expecting it to land in distros soon.

dlax · 2023-01-11T13:42:37Z

Yes, pg_activity packaged either directly in most distributions or through PGDG repository (which many postgres-people use obviously) and so recent versions are even available, through backports, in most stable or LTS distributions.

Psycopg 3 is packaged by Debian https://tracker.debian.org/pkg/psycopg3 ¹, but not yet available in stable. For RPM-based distributions, it's packaged by PGDG so that would work for us.

In pg_activity I think we cannot really move faster than distributions... unless our users accept to stay on an old version which is not quite our policy.

strangely, it's named psycopg3! ↩

dlax · 2023-01-11T13:46:18Z

strangely, it's named psycopg3!

ahah, I forgot https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016031

dvarrazzo · 2023-01-11T13:50:37Z

So, if psycopg 3 was packaged in apt pgdg too, pg_activity would be mostly covered, right?

dlax · 2023-01-11T13:56:25Z

Yes.

dlax · 2023-02-27T13:43:19Z

So, the approach I'm taking to hopefully fix these issues is the following:

load TEXT values as bytes whenever there's a query column from pg_stat_activity (or similar) involved
also load database encoding (not the one we're connected to, the one the query is running in)
decode bytes values when loading each row with this database encoding (using a row factory, as this cannot be done with a psycopg Loader -- as far as I understand -- because we need the whole row, not a single value)
use .decode(..., errors="replace") to prevent crash

blogh self-assigned this Jan 11, 2023

blogh added the bug label Jan 11, 2023

dlax mentioned this issue Feb 27, 2023

Decode the "query" column on Python side #345

Merged

dlax closed this as completed in #345 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding mess recap #332

Encoding mess recap #332

blogh commented Jan 11, 2023 •

edited

Loading

dvarrazzo commented Jan 11, 2023

blogh commented Jan 11, 2023

blogh commented Jan 11, 2023

dlax commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

blogh commented Jan 11, 2023 •

edited

Loading

dlax commented Jan 11, 2023

blogh commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023 via email •

edited

Loading

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023

dlax commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023 via email

dlax commented Feb 27, 2023

Encoding mess recap #332

Encoding mess recap #332

Comments

blogh commented Jan 11, 2023 • edited Loading

dvarrazzo commented Jan 11, 2023

blogh commented Jan 11, 2023

blogh commented Jan 11, 2023

dlax commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

blogh commented Jan 11, 2023 • edited Loading

dlax commented Jan 11, 2023

blogh commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023 via email • edited Loading

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023

Footnotes

dlax commented Jan 11, 2023

dvarrazzo commented Jan 11, 2023

dlax commented Jan 11, 2023 via email

dlax commented Feb 27, 2023

blogh commented Jan 11, 2023 •

edited

Loading

blogh commented Jan 11, 2023 •

edited

Loading

dlax commented Jan 11, 2023 via email •

edited

Loading