Skip to content
Lukas Fittl edited this page Jun 26, 2016 · 50 revisions

Rationale

This document exists to enable implementation of a fingerprint method for a given parsetree that is compatible across different libraries (in different languages) that build upon libpg_query. Initially this is aimed to synchronize fingerprinting of pg_query (Ruby) and pg_query.go (Go).

The goal of fingerprinting is to allow you to identify similar queries that are different only because of the specific object that is being queried for (i.e. different user ids/etc), or because of formatting.

It is inspired by what Postgres' pg_stat_statements extension does in order to generate a queryid that is used to group similar queries together for statistics purposes.

Different from pg_stat_statements, we generate the fingerprint based on the raw parsetree, instead of the post-analysis parsetree. This means that we treat a query that is run on master and slave as the same, even if the relation OIDs are different (pg_stat_statements would see different relation oids as different queries).

Version 0 (Ruby only, deprecated)

This is what pg_query (Ruby) used to do for fingerprinting. Its not portable and not advised for future usage.

Version 1 (based on PostgreSQL 9.5)

  • V1 Fingerprints are represented as 0x01 + SHA-1 sum, total size in bytes is 21, although fingerprints are usually returned as null-terminated hexstrings (43 bytes)
  • In general, all relevant tokens are collected into an array, and then a hash of all is computed using SHA-1
  • Nodes in the AST are traversed top-down
  • All nodes in a multi-statement AST are processed (so SELECT 1; SELECT 2; parsed as one statement will give you one fingerprint)
  • Node type is written first, in Postgres source convention (e.g. A_Const)
  • Then, in normal cases, going by alphabetic order of the field names:
    • In some cases, the field is not written at all:
      • nil values
      • 0 value for integer/byte values
      • false values for booleans
      • empty lists
      • empty strings
    • Note that an ignored node could cause an otherwise populated value (or list) to be empty - the same rule as above needs to be applied and the field name must not be written
    • Otherwise, first the field name is written (as it appears in the Postgres source in underscore_notation), followed by the field's value as one or more strings:
      • Value nodes will be output with their type and fields, e.g. a Float would be ["Float", "str", "0.5"] (not just "0.5")
      • "true" for booleans that are true
      • "123" for integer values
      • "0" for an enum value
      • Enums are written as their integer values
      • Node values are traversed immediately (so their tokens go first, fields that follow go afterwards)
  • Special cases
    • List nodes (including a simple array that contains nodes)
      • If parent field name is fromClause, targetList, cols or rexpr:
        • Fingerprint each node, and store the resulting tokens in an array
        • Ignore duplicate nodes (based on equivalent fingerprint tokens)
        • Sort nodes by its tokens (simple string comparison)
        • Write sorted tokens into the result hash
      • Otherwise just fingerprint each node in original order
    • Node types to be ignored
      • A_Const
      • Alias
      • Paramref
      • SetToDefault
      • Value nodes of type IntList, OidList or Null
    • Attributes to be ignored
      • location (for all nodes)
      • ResTarget.name (if parent field name is targetList)
      • PrepareStmt.name
      • ExecuteStmt.name
      • DeallocateStmt.name
      • TransactionStmt.options
      • TransactionStmt.gid

Version 1.1 (based on PostgreSQL 9.5)

  • Attributes to be ignored: ResTarget.name only applies to SelectStmt parent nodes (if parent field name is targetList)
    • In version 1 this was a problem with UpdateStmt in particular, where the name is significant

This version retains the 0x01 prefix of version 1 due to it being a minor change only.

Clone this wiki locally