Skip to content
Lukas Fittl edited this page Mar 2, 2021 · 50 revisions

Rationale

This document exists to enable implementation of a fingerprint method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query (Ruby) and pg_query.go (Go).

The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.

It is inspired by what Postgres' pg_stat_statements extension does in order to generate a queryid that is used to group similar queries together for statistics purposes.

Different from pg_stat_statements, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on different servers as the same, even if the relation OIDs are different (pg_stat_statements would see different relation oids as different queries).

Version 0 (Ruby only, deprecated)

This is what pg_query (Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.

Version 1 (based on PostgreSQL 9.5)

  • V1 Fingerprints are represented as 0x01 + SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes)
  • In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
  • Nodes in the AST are traversed top-down
  • All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
  • Node type is written first, in Postgres source convention (e.g. A_Const)
  • Then, in normal cases, going by alphabetic order of the field names:
    • In some cases, the field is not written at all:
      • nil values
      • 0 value for integer/byte values
      • false values for booleans
      • empty lists
      • empty strings
    • Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
    • Otherwise, first the field name is written (as it appears in the Postgres source in underscore_notation), followed by the field's value as one or more strings:
      • Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
      • "true" for booleans that are true
      • "123" for integer values
      • "0" for an enum value
      • Enums are written as their integer values
      • Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
  • Special cases
    • List nodes (including a simple array that contains nodes)
      • If parent field name is fromClause, targetList, cols or rexpr:
        • Fingerprint each node, and store the resulting tokens in an array
        • Ignore duplicate nodes (based on equivalent fingerprint tokens)
        • Sort nodes by its tokens (simple string comparison)
        • Write sorted tokens into the result hash
      • Otherwise just fingerprint each node in original order
    • Node types to be ignored
      • A_Const
      • Alias
      • Paramref
      • SetToDefault
      • Value nodes of type IntList, OidList or Null
    • Attributes to be ignored
      • location (for all nodes)
      • ResTarget.name (if parent field name is targetList)
      • PrepareStmt.name
      • ExecuteStmt.name
      • DeallocateStmt.name
      • TransactionStmt.options
      • TransactionStmt.gid

Version 1.1 (based on PostgreSQL 9.5)

  • Attributes to be ignored: ResTarget.name should only be ignored if parent node is a SelectStmt and parent field name is targetList
    • In version 1 this was a problem with UpdateStmt in particular, where the name is significant

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 1.2 (based on PostgreSQL 9.5)

  • Attributes to be ignored:
    • DeclareCursorStmt.portalname
    • FetchStmt.portalname
    • ClosePortalStmt.portalname

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 1.3 (based on PostgreSQL 9.5)

  • Attributes to be ignored:
    • RangeVar.relname (if node also has RangeVar.relpersistence = "t")
  • Special cases: List nodes where parent field name is valuesLists
    • Follow same logic described for fromClause/targetList/cols/rexpr

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Version 2.0 (based on PostgreSQL 10)

  • Same fingerprinting rules as 1.3, but in addition:
  • Attributes to be ignored:
    • RawStmt.stmt_len
    • RawStmt.stmt_location

Version 3.0 (based on PostgreSQL 13)

⚠️⚠️ Work in progress, not yet stable ⚠️⚠️

Please note that whilst version 3.0 is inspired by prior versions, the complete (updated) fingerprinting rules are repeated for this version. When implementing based on the new Protobuf parse tree format in libpg_query, take care of field names - they need to match Postgres source, which is represented in the Protobuf descriptor's json_name field names - the Protobuf field names themselves are often subtly different in terms of capitalization.

When the text below refers to a "Node", it references the C level struct in the Postgres parse tree. In the libpg_query JSON format this is represented as a JSON object, and in the libpg_query Protobuf format this is represented as a message.

  • 🆕 V3 Fingerprints are represented as a 64 bit (8 byte) integer, commonly shown as a 16 character hex value when displayed
  • 🆕 V3 Fingerprints are based on the xxHash64 hash function, with the number 3 provided as a seed
  • Nodes in the parse tree are traversed top-down
  • All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
  • For each node, field names are handled in alphabetic order:
    • In some cases, the field is not written at all:
      • nil values
      • 0 value for integer/byte values
      • false values for booleans
      • empty lists
      • empty strings
    • Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
    • Otherwise, first the field name is written (as it appears in the Postgres source in underscore_notation), followed by the field's value as one or more strings:
      • Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
      • "true" for booleans that are true
      • "123" for integer values
      • "ENUM_VALUE" for an enum value (as they show in the Postgres source)
      • Node values are traversed immediately (so the child struct goes first, fields that follow in the parent go afterwards)
  • Special cases
    • 🆕 Node generic node type
      • Output the actual node type (as represented in the Postgres source or libpg_query JSON output), before handling other fields
    • 🆕 List node (note the JSON format often represents these as simple arrays)
      • If parent field name is fromClause, targetList, cols, rexpr, valuesLists or args:
        • Fingerprint each node (using the hash function), and store the resulting hash in an array
        • Ignore duplicate nodes (based on equivalent hashes)
        • Sort hash values (ascending by their integer value)
        • ❓ Write hash values (as string representation of the integer) -- TODO: Should we keep adding the original tokens here to the hash (like in v2, better for debugging - with some extra calculation effort)
      • Otherwise just traverse each node in original order (and add fields continuously to the hash, instead of doing intermittent results)
      • Note that empty lists (which may especially be caused by ignored node types), result in the field not being written at all
    • 🆕 RangeVar node
      • relname: Ignore two or more subsequent digits (\d{2,}) and remove them from the string before writing (motivation: group most time-partitioned tables together)
    • 🆕 A_Expr node
      • If kind is AEXPR_OP_ANY, write AEXPR_OP kind instead
      • If kind is AEXPR_IN, write AEXPR_OP kind instead
    • Node types to be ignored
      • A_Const
      • Alias
      • Paramref
      • SetToDefault
      • Value nodes of type IntList, OidList or Null
      • TypeCast, if the arg is a node of type A_Const or Paramref
    • Attributes to be ignored
      • location, for all nodes
      • ResTarget.name, if parent field name is targetList
      • PrepareStmt.name
      • ExecuteStmt.name
      • DeallocateStmt.name
      • TransactionStmt.options
      • TransactionStmt.gid
      • RangeVar.relname, if node also has RangeVar.relpersistence = "t"
      • ResTarget.name, if parent node is a SelectStmt and parent field name is targetList
      • DeclareCursorStmt.portalname
      • FetchStmt.portalname
      • ClosePortalStmt.portalname
      • RawStmt.stmt_len
      • RawStmt.stmt_location
      • 🆕 TransactionStmt.savepoint_name
Clone this wiki locally