-
Notifications
You must be signed in to change notification settings - Fork 180
Fingerprinting
This document exists to enable implementation of a fingerprint
method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query
(Ruby) and pg_query.go
(Go).
The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.
It is inspired by what Postgres' pg_stat_statements
extension does in order to generate a queryid
that is used to group similar queries together for statistics purposes.
Different from pg_stat_statements
, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on master and slave as the same, even if the relation OIDs are different (pg_stat_statements
would see different relation oids as different queries).
This is what pg_query
(Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.
- V1 Fingerprints are represented as
0x01
+ SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes) - In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
- Nodes in the AST are traversed top-down
- All nodes in a multi-statement AST are processed (so
SELECT 1; SELECT 2;
parsed as one statement will give you one fingerprint) - Node type is written first, in Postgres source convention (e.g.
A_Const
) - Then, in normal cases, going by alphabetic order of the field names:
- In some cases, the field is not written at all:
- nil values
- 0 value for integer/byte values
- false values for booleans
- empty lists
- empty strings
- Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
- Otherwise, first the field name is written (as it appears in the Postgres source in
underscore_notation
), followed by the field's value as one or more strings:- Value nodes will be output with their type and fields, e.g. a Float would be
["Float", "str", "0.5"]
(not just"0.5"
) - "true" for booleans that are true
- "123" for integer values
- "0" for an enum value
- Enums are written as their integer values
- Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
- Value nodes will be output with their type and fields, e.g. a Float would be
- In some cases, the field is not written at all:
- Special cases
-
List
nodes (including a simple array that contains nodes)- If parent field name is
fromClause
,targetList
,cols
orrexpr
:- Fingerprint each node, and store the resulting tokens in an array
- Ignore duplicate nodes (based on equivalent fingerprint tokens)
- Sort nodes by its tokens (simple string comparison)
- Write sorted tokens into the result hash
- Otherwise just fingerprint each node in original order
- If parent field name is
- Node types to be ignored
A_Const
Alias
Paramref
SetToDefault
- Value nodes of type
IntList
,OidList
orNull
- Attributes to be ignored
-
location
(for all nodes) -
ResTarget.name
(if parent field name istargetList
) PrepareStmt.name
ExecuteStmt.name
DeallocateStmt.name
TransactionStmt.options
TransactionStmt.gid
-
-
-
Attributes to be ignored:
ResTarget.name
only applies toSelectStmt
parent nodes (if parent field name istargetList
)- In version 1 this was a problem with
UpdateStmt
in particular, where the name is significant
- In version 1 this was a problem with
This version retains the 0x01
prefix of version 1 due to it being a minor change only.