Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding mess recap #332

Closed
blogh opened this issue Jan 11, 2023 · 17 comments · Fixed by #345
Closed

Encoding mess recap #332

blogh opened this issue Jan 11, 2023 · 17 comments · Fixed by #345
Assignees
Labels

Comments

@blogh
Copy link
Collaborator

blogh commented Jan 11, 2023

This is a recap of the encoding problem we face(d).

  1. The queries listed in pg_stat_activity are encoded with the collation of the database the run into. (FATAL: 'utf-8' codec can't decode byte 0xe7 in position 128: invalid continuation byte #149)

So if we had this:

denis@postgres=# CREATE DATABASE latin1 ENCODING 'latin1' TEMPLATE template0 LC_COLLATE 'fr_FR.latin1' LC_CTYPE 'fr_FR.latin1';
CREATE DATABASE
denis@postgres=# \c latin1 
You are now connected to database "latin1" as user "denis".
denis@latin1=# CREATE TABLE test (data text);
CREATE TABLE
denis@latin1=# BEGIN;
BEGIN
denis@latin1=# INSERT INTO test VALUES ('é');
INSERT 0 1

PostgreSQL used to complain and crash pg_activity:

>>> conn = psycopg2.connect(database="postgres")
>>> conn.encoding
'UTF8'
>>> cur = conn.cursor()
>>> cur.execute("select query from pg_stat_activity")
>>> cur.fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 26: invalid continuation byte

We added a conversion in the query :

convert_from(sq.query::bytea, coalesce(pg_catalog.pg_encoding_to_char(b.encoding), 'UTF8')) AS query
  1. It still failed in some cases (Encoding bug, still not out of the woods ... #275) where we have \'s. The following kind of queries:
SELECT '\It`s gona crash', pg_sleep(10);
select '123' ~ '\d+', pg_sleep(3);
select 'Hello world!\n', pg_sleep(3);

would fail with:

psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea

We added a replace in the query :

convert_from(replace(a.query, '\', '\\')::bytea, coalesce(pg_catalog.pg_encoding_to_char(b.encoding), 'UTF8')) AS query
  1. It still doesn't do it because some times strings end up as non valid bytea sequences (bytea field causing stacktrace #302)
psycopg2.errors.InvalidTextRepresentation: invalid input syntax for type bytea

For all these problems, the query crashes so even if we catch the exception we have nothing to display. So it would be better to manage this in the python where we would have a more granularity.

Solutions:

  • get the string from PostgreSQL and convert to the correct collation in python wouldn't work (it's no1).

  • get a bytea and convert in Python wouldn't work it's no2&3

  • asking psycopg2 to return bytes instead of unicode strings as proposed by @dvarrazzo a long time ago . This would convert all strings to bytes so we have to convert all of them in pg_activity based on the collation we fetch in pg_database.encoding. I tried this with a custom cursor factory to do the automatic conversion when we have bytes in the resultset (except i didn't drop the replace and cast in the query which we should do). It's looks rather complicated. Is there another solution ?

@blogh blogh self-assigned this Jan 11, 2023
@blogh blogh added the bug label Jan 11, 2023
@dvarrazzo
Copy link

Can you register a custom typecaster for the text type, to return data.decode(conn.encoding, 'replace')?

@blogh
Copy link
Collaborator Author

blogh commented Jan 11, 2023

As discuss, let's:

  • add the register_type() only for the cursors that need it;
  • keep the data in bytes until we display them to avoid doing work on stuff we don't need.

@blogh
Copy link
Collaborator Author

blogh commented Jan 11, 2023

Thanks @dvarrazzo, I didn't see your post before posting mine.
I'll check with @dlax.

@dlax
Copy link
Member

dlax commented Jan 11, 2023

Can you register a custom typecaster for the text type, to return data.decode(conn.encoding, 'replace')?

As far as I understand, the problem is that the encoding we need to decode the query field from pg_stat_activity is probably not the connection encoding in general; rather it's the encoding of the database the query is running in.

@dvarrazzo
Copy link

Ah sorry: creating a specific typecaster doesn't work in psycopg2 because the decode is done internally, before passing the data to the Python function, which receive a str. This comes from the fact that psycopg2 is pretty much totally text-based. In psycopg 3 it can be done, because it's binary-based and the dumpers receive bytes, but let's first see if it's useful.

If you only need to survive the accident of receiving a badly encoded char, getting bytea from the query seems enough:

# Conn 1, in bad encoding

conn = psycopg2.connect("dbname=latin1 client_encoding=sql_ascii")
cur = conn.cursor()
cur.execute("select 'à'".encode("latin1"))
# Conn 2 in utf8
...
cur.execute("select query::bytea from pg_stat_activity where datname = 'latin1'");

In [20]: cur.fetchall()
Out[20]: [(<memory at 0x7fb89ed57dc0>,)]

In [21]: bytes(_20[0][0])
Out[21]: b"select '\xe0', pg_sleep(10)"

In [22]: bytes(_20[0][0]).decode("utf8", "replace")
Out[22]: "select '�', pg_sleep(10)"

do you need to know more precisely the content of the query?

@blogh
Copy link
Collaborator Author

blogh commented Jan 11, 2023

I think we would still hit problem number 3 (which we unfortunately couldn't reproduce)

@dlax
Copy link
Member

dlax commented Jan 11, 2023

Yes, problem n°3 being query::bytea producing invalid_text_representation: invalid input syntax for type bytea on postgres side.

But, indeed, it might be worth moving to psycopg 3 while working on this issue.

@blogh
Copy link
Collaborator Author

blogh commented Jan 11, 2023

Is it widely available or is it only available on recent distributions ?

@dvarrazzo
Copy link

In psycopg3 the problem can be solved easily, because the decoding step is in the loader:

In [1]: import psycopg

In [2]: conn1 = psycopg.connect("dbname=latin1 client_encoding=sql_ascii")

In [3]: conn1.execute("select 'à'".encode("latin1"))
Out[3]: <psycopg.Cursor [TUPLES_OK] [INTRANS] (user=piro database=latin1) at 0x7f378203b740>

In [4]: conn = psycopg.connect()

In [5]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Traceback (most recent call last)
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 8: invalid continuation byte

In [6]: class SafeText(psycopg.types.string.TextLoader):
   ...:     def load(self, data):
   ...:         try:
   ...:             return super().load(data)
   ...:         except Exception:
   ...:             return bytes(data).decode(self._encoding, "replace")
   ...: 

In [7]: conn.adapters.register_loader("text", SafeText)

In [8]: conn.execute("select query from pg_stat_activity where datname = 'latin1'").fetchone()[0]
Out[8]: "select '�'"

you can even use extra knowledge about the encoding of the database you are reading the query from, and use that one in the custom loader, instead of self._encoding.

@dvarrazzo
Copy link

(rather than the above, subclassing the built-in TextLoader, it would be wiser to write a Loader instance from scratch, starting by copying TextLoader from the source code, because _encoding is not part of the public interface).

@dlax
Copy link
Member

dlax commented Jan 11, 2023 via email

@dvarrazzo
Copy link

Is pg_activity packaged in distributions? Other programs are starting to be based on it (pgadmin, pgcli...), so I'm expecting it to land in distros soon.

@dlax
Copy link
Member

dlax commented Jan 11, 2023

Yes, pg_activity packaged either directly in most distributions or through PGDG repository (which many postgres-people use obviously) and so recent versions are even available, through backports, in most stable or LTS distributions.

Psycopg 3 is packaged by Debian https://tracker.debian.org/pkg/psycopg3 1, but not yet available in stable. For RPM-based distributions, it's packaged by PGDG so that would work for us.

In pg_activity I think we cannot really move faster than distributions... unless our users accept to stay on an old version which is not quite our policy.

Footnotes

  1. strangely, it's named psycopg3!

@dlax
Copy link
Member

dlax commented Jan 11, 2023

strangely, it's named psycopg3!

ahah, I forgot https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1016031

@dvarrazzo
Copy link

So, if psycopg 3 was packaged in apt pgdg too, pg_activity would be mostly covered, right?

@dlax
Copy link
Member

dlax commented Jan 11, 2023 via email

@dlax
Copy link
Member

dlax commented Feb 27, 2023

So, the approach I'm taking to hopefully fix these issues is the following:

  • load TEXT values as bytes whenever there's a query column from pg_stat_activity (or similar) involved
  • also load database encoding (not the one we're connected to, the one the query is running in)
  • decode bytes values when loading each row with this database encoding (using a row factory, as this cannot be done with a psycopg Loader -- as far as I understand -- because we need the whole row, not a single value)
  • use .decode(..., errors="replace") to prevent crash

@dlax dlax closed this as completed in #345 Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants