-
Notifications
You must be signed in to change notification settings - Fork 180
Fingerprinting
This document exists to enable implementation of a fingerprint
method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query
(Ruby) and pg_query.go
(Go).
The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.
It is inspired by what Postgres' pg_stat_statements
extension does in order to generate a queryid
that is used to group similar queries together for statistics purposes.
Different from pg_stat_statements
, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on different servers as the same, even if the relation OIDs are different (pg_stat_statements
would see different relation oids as different queries).
This is what pg_query
(Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.
- V1 Fingerprints are represented as
0x01
+ SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes) - In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
- Nodes in the AST are traversed top-down
- All nodes in a multi-statement AST are processed (so
SELECT 1; SELECT 2;
parsed as one statement will give you one fingerprint) - Node type is written first, in Postgres source convention (e.g.
A_Const
) - Then, in normal cases, going by alphabetic order of the field names:
- In some cases, the field is not written at all:
- nil values
- 0 value for integer/byte values
- false values for booleans
- empty lists
- empty strings
- Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
- Otherwise, first the field name is written (as it appears in the Postgres source in
underscore_notation
), followed by the field's value as one or more strings:- Value nodes will be output with their type and fields, e.g. a Float would be
["Float", "str", "0.5"]
(not just"0.5"
) - "true" for booleans that are true
- "123" for integer values
- "0" for an enum value
- Enums are written as their integer values
- Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
- Value nodes will be output with their type and fields, e.g. a Float would be
- In some cases, the field is not written at all:
- Special cases
-
List
nodes (including a simple array that contains nodes)- If parent field name is
fromClause
,targetList
,cols
orrexpr
:- Fingerprint each node, and store the resulting tokens in an array
- Ignore duplicate nodes (based on equivalent fingerprint tokens)
- Sort nodes by its tokens (simple string comparison)
- Write sorted tokens into the result hash
- Otherwise just fingerprint each node in original order
- If parent field name is
- Node types to be ignored
A_Const
Alias
ParamRef
SetToDefault
- Value nodes of type
IntList
,OidList
orNull
- Attributes to be ignored
-
location
(for all nodes) -
ResTarget.name
(if parent field name istargetList
) PrepareStmt.name
ExecuteStmt.name
DeallocateStmt.name
TransactionStmt.options
TransactionStmt.gid
-
-
-
Attributes to be ignored:
ResTarget.name
should only be ignored if parent node is aSelectStmt
and parent field name istargetList
- In version 1 this was a problem with
UpdateStmt
in particular, where the name is significant
- In version 1 this was a problem with
This version retains the 0x01
prefix of version 1 due to it being a minor change only.
-
Attributes to be ignored:
DeclareCursorStmt.portalname
FetchStmt.portalname
ClosePortalStmt.portalname
This version retains the 0x01
prefix of version 1 due to it being a minor change only.
-
Attributes to be ignored:
-
RangeVar.relname
(if node also hasRangeVar.relpersistence = "t"
)
-
-
Special cases: List nodes where parent field name is
valuesLists
- Follow same logic described for fromClause/targetList/cols/rexpr
This version retains the 0x01
prefix of version 1 due to it being a minor change only.
- Same fingerprinting rules as 1.3, but in addition:
-
Attributes to be ignored:
RawStmt.stmt_len
RawStmt.stmt_location
Note: Please note that whilst version 3.0 is inspired by prior versions, the complete (updated) fingerprinting rules are repeated for this version. When implementing based on the new Protobuf parse tree format in libpg_query, take care of field names - they need to match Postgres source, which is represented in the Protobuf descriptor's json_name
field names - the Protobuf field names themselves are often subtly different in terms of capitalization.
When the text below refers to a "Node", it references the C level struct in the Postgres parse tree. In the libpg_query JSON format this is represented as a JSON object, and in the libpg_query Protobuf format this is represented as a message.
- π V3 Fingerprints are represented as a 64 bit (8 byte) integer, commonly shown as a 16 character hex value when displayed
- π V3 Fingerprints are based on the xxHash XXH3 64-bit hash function, with the number
3
provided as a seed - Nodes in the parse tree are traversed top-down
- π Ignore any nodes more than 100 nodes deep (start counting depth with 0 at the top level, incrementing by 1 when going into a child node through a field - subsequent fields on a node do not increment the depth counter, nor do other nodes at the same level, e.g. in a list - if depth >= 100, ignore that node and any child nodes)
- All nodes in a multi-statement AST are processed (so
SELECT 1; SELECT 2;
parsed as one statement will give you one fingerprint) - For each node, field names are handled in alphabetic order:
- In some cases, the field is not written at all:
- nil values
- 0 value for integer/byte values
- false values for booleans
- empty lists
- empty strings
- Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
- Otherwise, first the field name is written (as it appears in the Postgres source), followed by the field's value as one or more strings:
- Value nodes will be output with their type and fields, e.g. a Float would be
["Float", "str", "0.5"]
(not just"0.5"
) - "true" for booleans that are true
- "123" for integer values
- π "ENUM_VALUE" for an enum value (as they show in the Postgres source)
- Node values are traversed immediately (so the child struct goes first, fields that follow in the parent go afterwards)
- Value nodes will be output with their type and fields, e.g. a Float would be
- In some cases, the field is not written at all:
- Special cases
- π
Node
generic node type- Output the actual node type name (as represented in the Postgres source or libpg_query JSON output, e.g.
SelectStmt
), before handling other fields -- except, if the node type isList
, in that case do not output the node type name
- Output the actual node type name (as represented in the Postgres source or libpg_query JSON output, e.g.
- π
List
node (note the JSON format often represents these as simple arrays)- If parent field name is
fromClause
,targetList
,cols
,rexpr
,valuesLists
or πargs
:- Fingerprint each node, and store the resulting tokens, π together with each node's hash value in an array
- π Ignore duplicate nodes (based on the node hash value being equivalent)
- π Sort the overall array by each node's hash value (ascending)
- Write sorted tokens into the result hash
- Otherwise just traverse each node in original order (and add fields continuously to the hash)
- If parent field name is
-
RangeVar
node-
relname
:- Ignore completely if node has
RangeVar.relpersistence = "t"
- π Otherwise, remove any two or more subsequent digits (
s/\d{2,}//g
) and write out the resulting value
- Ignore completely if node has
-
- π
A_Expr
node- If kind is
AEXPR_OP_ANY
, writeAEXPR_OP
kind instead - If kind is
AEXPR_IN
, writeAEXPR_OP
kind instead
- If kind is
- Node types to be ignored
A_Const
Alias
ParamRef
SetToDefault
- Value nodes of type
IntList
,OidList
orNull
- π
TypeCast
, if thearg
is a node of typeA_Const
orParamRef
- Attributes to be ignored
-
location
, for all nodes -
ResTarget.name
, if parent node is aSelectStmt
and parent field name istargetList
PrepareStmt.name
ExecuteStmt.name
DeallocateStmt.name
TransactionStmt.options
TransactionStmt.gid
- π
TransactionStmt.savepoint_name
DeclareCursorStmt.portalname
FetchStmt.portalname
ClosePortalStmt.portalname
RawStmt.stmt_len
RawStmt.stmt_location
-
- π
- Same fingerprinting rules as 3.0, but in addition:
- Consider lists that have a single NIL element as being present for determining whether the parent field should be included in the fingerprint. In particular, this matters for fingerprinting "SELECT DISTINCT" without columns specified.
-
Attributes to be ignored:
- CreateFunctionStmt.options (and thus the function definition itself)
- FunctionParameter.name (a function is deemed unique based on the function names and parameter types, not the parameter names)
- DoStmt.args (this effectively groups all DO statements together, avoiding bloating query tracking with anonymous blocks that'd all be considered unique)
- ListenStmt.conditionname, UnlistenStmt.conditionname and NotifyStmt.conditionname (this effectively groups all LISTEN, UNLISTEN and NOTIFY statements together - since channels are often made unique to notify about different per-session channel names, this can easily bloat query tracking)