Fingerprinting

Rationale

This document exists to enable implementation of a fingerprint method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query (Ruby) and pg_query.go (Go).

The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.

It is inspired by what Postgres' pg_stat_statements extension does in order to generate a queryid that is used to group similar queries together for statistics purposes.

Different from pg_stat_statements, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on master and slave as the same, even if the relation OIDs are different (pg_stat_statements would see different relation oids as different queries).

Version 0 (Ruby only, deprecated)

This is what pg_query (Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.

Version 1 (based on PostgreSQL 9.5)

V1 Fingerprints are represented as 0x01 + SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes)
In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
Nodes in the AST are traversed top-down
All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
Node type is written first, in Postgres source convention (e.g. A_Const)
Then, in normal cases, going by alphabetic order of the field names:
- In some cases, the field is not written at all:
  - nil values
  - 0 value for integer/byte values
  - false values for booleans
  - empty lists
  - empty strings
- Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
- Otherwise, first the field name is written (as it appears in the Postgres source in underscore_notation), followed by the field's value as one or more strings:
  - Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
  - "true" for booleans that are true
  - "123" for integer values
  - "0" for an enum value
  - Enums are written as their integer values
  - Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
Special cases
- List nodes (including a simple array that contains nodes)
  - If parent field name is fromClause, targetList, cols or rexpr:
    - Fingerprint each node, and store the resulting tokens in an array
    - Ignore duplicate nodes (based on equivalent fingerprint tokens)
    - Sort nodes by its tokens (simple string comparison)
    - Write sorted tokens into the result hash
  - Otherwise just fingerprint each node in original order
- Node types to be ignored
  - A_Const
  - Alias
  - Paramref
  - SetToDefault
  - Value nodes of type IntList, OidList or Null
- Attributes to be ignored
  - location (for all nodes)
  - ResTarget.name (if parent field name is targetList)
  - PrepareStmt.name
  - ExecuteStmt.name
  - DeallocateStmt.name
  - TransactionStmt.options
  - TransactionStmt.gid

Version 1.1 (based on PostgreSQL 9.5)

Attributes to be ignored: ResTarget.name should only be ignored if parent node is a SelectStmt and parent field name is targetList
- In version 1 this was a problem with UpdateStmt in particular, where the name is significant

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 1.2 (based on PostgreSQL 9.5)

Attributes to be ignored:
- DeclareCursorStmt.portalname
- FetchStmt.portalname
- ClosePortalStmt.portalname

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Provide feedback

Saved searches