This repository contains a collection of utility macros for working with Dataform. The macros are designed to be used in a Dataform project and are imported into projects as packages.
To use the macros in this repository, you will need to add this repository to the dependencies in your packages.json
file:
{
"name": "jaffle_shop",
"dependencies": {
"@dataform/core": "2.9",
"qbi-dataform-utils": "https://github.com/quickbi/qbi-dataform-utils/archive/main.tar.gz"
}
}
To use a specific version of the package, you can specify the version tag or commit hash in the URL:
{
"name": "jaffle_shop",
"dependencies": {
"@dataform/core": "2.9",
"qbi-dataform-utils": "https://github.com/quickbi/qbi-dataform-utils/archive/v0.1.0.tar.gz"
}
}
Once you have added the package to your packages.json
file, you can install the package by running the following command:
dataform install
To use the macros in this repository, you will need to import them into your Dataform project. To import, add a includes/qbi_dataform_utils.js
file to your Dataform project and import the macros into the file. For example:
// includes/qbi_dataform_utils.js
const { deduplicate, generate_surrogate_key, union_relations } = require("qbi-dataform-utils");
module.exports = { deduplicate, generate_surrogate_key, union_relations };
The name of the includes file acts as a namespace for the imported macros. In other words, if the includes file is named qbi_dataform_utils.js
, you can refer to the imported macros using the ${qbi_dataform_utils.<macro_name>}
syntax. For example, to refer to the deduplicate
macro, use the ${qbi_dataform_utils.deduplicate(<relation>, <partition_by>, <order_by>)}
syntax.
This macro is used to deduplicate data in a relation or a CTE. It uses the row_number()
window function to assign a unique row number to each row in the table, and then filters out the rows where the row number is greater than 1.
The macro takes an object with the following properties:
relation
(string): The relations or CTEs to deduplicate.partition_by
(string): The field or fields to partition by. Multiple fields must be separated by commas.order_by
(string): The field or fields to order by. Multiple fields must be separated by commas.
Usage:
-- definitions/users.sql
config {
type: 'table',
assertions: {
uniqueKey: ['user_id'],
nonNull: ['user_id']
}
}
-- Deduplicate the stg_users table by user_id, keeping the most recent record
${qbi_dataform_utils.deduplicate({
relation: ref('stg_users'),
partition_by: 'user_id',
order_by: 'loaded_at desc'
})}
-- definitions/users.sql
config {
type: 'table',
assertions: {
uniqueKey: ['user_id'],
nonNull: ['user_id']
}
}
with users as (
select
*
from ${ref('stg_users')}
),
-- Deduplicate the users CTE by user_id, keeping the most recent record
deduplicated as (
${qbi_dataform_utils.deduplicate({
relation: 'users',
partition_by: 'user_id',
order_by: 'loaded_at desc'
})}
)
select
*
from deduplicated
This macro is used to generate a hashed surrogate key. Use this macro to generate unique identifiers for tables that do not have a natural key.
Arguments:
fields
(array): An array of fields to hash.default_null_value
(string): The value to use when the fields to hash are null. Default is_qbi_dataform_utils_surrogate_key_null
.
Usage:
-- definitions/stg_exchange_rates.sql
config {
type: 'table',
assertions: {
uniqueKey: ['exchange_rate_id'],
nonNull: ['exchange_rate_id']
}
}
select
${qbi_dataform_utils.generate_surrogate_key(['exchange_rate', 'currency'])} as exchange_rate_id,
*
from ${ref('raw_exchange_rates')}
This macro is used to union two or more relations (or CTEs) together. It is useful when you have multiple relations that have the same schema and you want to combine them into a single relation. The relations to union are specified as a map, where the keys are arbitrary names and the values are the relations to union. The macro will automatically add _dataform_source_key
and _dataform_source_relation
columns to the output relation to indicate which relation the row came from. If the relations have different schemas, you need to specify the fields to union on.
The macro takes an object with the following properties:
relations
(object): A map of relation keys and their corresponding relations or CTEs.fields
(array): An array of fields to union on. If not specified, the macro will union all fields (i.e.,[*]
).key_column_name
(string): The name of the column to use for the source relation key column. Default is_dataform_source_key
.value_column_name
(string): The name of the column to use for the source relation value column. Default is_dataform_source_value
.
Usage:
config {
type: 'table',
}
with unioned as (
${
qbi_dataform_utils.union_relations({
relations: {
'usd': ref('stg_exchange_rates_usd'),
'eur': ref('stg_exchange_rates_eur')
},
fields: ['date', 'exchange_rate'], -- Union date and exchange_rate fields only
key_column_name: 'currency' -- Rename _dataform_source_key to currency
})
}
)
select
* except (_dataform_source_value)
from unioned
config {
type: 'table',
}
with usd as (
select
*,
'usd' as target_currency
from ${ref('stg_exchange_rates_usd')}
),
eur as (
select
*,
'usd' as target_currency
from ${ref('stg_exchange_rates_eur')}
),
unioned as (
${
qbi_dataform_utils.union_relations({
relations: {
'usd': 'usd',
'eur': 'eur'
},
fields: ['date', 'exchange_rate', 'target_currency'],
key_column_name: 'currency'
})
}
)
select
* except (_dataform_source_value)
from unioned