-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ingestion time partition table on BigQuery as incremental materialization #75
Comments
@github-christophe-oudar So sorry again for the delayed response here! On the fourth or fifth read through, the proposal finally makes sense to me. I'm excited that you've identified a pure-SQL way to handle the creation of the ingestion-time-partitioned table ( One specific part just clicked for me: How does dbt "sort" each partition of data into its appropriate bucket, during the first / full-refresh table creation? The {{ config(
materialized = 'incremental',
partition_by={
"field": "_PARTITIONTIME",
"data_type": "timestamp"
}
) }}
select
transaction_date as _partitiontime,
transaction_id,
transaction_date
from {{ source('order_system', 'transactions') }} Then dbt performs the initial backfill (as well as subsequent merges) using that "derived" insert into mydataset.newtable (_partitiontime, transaction_id, transaction_date)
select
transaction_date as _partitiontime,
transaction_id,
transaction_date
from source_order_system.transactions And then, amazingly, this works: #legacySQL
select * from dbt_jcohen.my_incremental_model$20210101 So, what's tricky?
So, I'm happy with the proposal outlined here—and excited for what we can do performance-wise, in concert with the proposal in #77! I know it's my fault for getting back to you a few months late, and I know you've got other work on your plate these days. If another community member is interested and able to contribute this, we'd be happy to support you. |
@jtcohen6 I had a look at it last night but it's actually not so straightforward:
Then I tried to However as the query
is returning a It's a pretty thorny issue. It feels like it requires:
and save it to the "temp" table
And here we succeeded I guess 🙌 Whenever the table is created, we should do something similar: |
@github-christophe-oudar Ah, I see what you're saying. The limitation is, BigQuery won't let us run a query or create a temp table that contains a column named The downer is that, because this model's SQL returns a "forbidden" column name (
Revisiting the example above: {{ config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
partition_by={
"field": "transaction_date",
"data_type": "date",
"use_partitiontime": true # something like this?
}
) }}
select
transaction_id,
transaction_date
from {{ source('order_system', 'transactions') }} Then, we handle the rest: -- this is the DML that dbt encodes into the materialization, using metadata from preexisting table
insert into mydataset.newtable (_partitiontime, transaction_id, transaction_date)
select timestamp(transaction_date) as _partitiontime, * from (
-- this is the user-provided model SQL, templated here directly without first saving to temp table
select
transaction_id,
transaction_date
from source_order_system.transactions
) Put another way: {% set partition_time_exp =
partition_by.field if partition_by.data_type == 'timestamp'
else 'timestamp(' + partition_by.field ')' %}
insert into {{ target_relation }} (_partitiontime, {{ dest_columns_csv }})
select {{ partition_time_exp }} as _partitiontime from (
{{ sql }}
) We'd still need a way to get the column schema from the "source" (model SQL) query, in order to template the column list in the initial
select * from (
select
transaction_id,
transaction_date
from source_order_system.transactions
) as __dbt_sbq
where false
limit 0 The temp table approach might be preferable, since it's easier to reason about, and more similar to the logic that we have in the materialization today. |
I agree with you! I think the approach definitely makes sense! 🙌 I'll have another shot over the weekend (maybe earlier if I can find some time to do so 🤞). |
Hello, just to let you know that I started to work on the topic, I'm a bit slow as I just have some time on weekends lately. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Describe the feature
This is a follow-up from a discussion over the topic with @jtcohen6.
The issues with current incremental materialization implementation
Ingestion time partition tables are supported as
table
materialization right now but the support is meant to be deprecated.The specificity for those tables is that partitioning field is
_PARTITIONTIME
OR_PARTITIONDATE
as a "pseudo column".Since that column doesn't really exist as a column within the table, Google doesn't allow the same kind of operations on the table.
Indeed if you would like to use
BigQuery doesn't let you create a ingestion time partitioned table using usual dbt approach (https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_ingestion-time_partitioned_table):
will fail as _PARTITIONTIME is not directly in the defined columns.
Working approach
So the required approach is to have first:
And then:
Once we move to the merge part, it's indeed possible to insert data with
Practically, it won't be 100% straightforward as the it requires to use
SELECT _PARTITIONTIME, * FROM mydataset.newtable__dbt_tmp
as_PARTITIONTIME
is not directly a column and therefore not in theSELECT *
but else it appears to work.Practically, I think everything can be done within incremental.sql
Describe alternatives you've considered
Alternatives are:
Additional context
At Teads, we use a lot of time ingestion partitioned tables as they were created prior to column type partitioned table feature on BigQuery. Migrating would be an option but the overhead introduced for selecting data from column type partitioned table is "a drag" to make that move.
Who will this benefit?
It would benefit anyone using time ingestion partition table.
Are you interested in contributing this feature?
Yes
The text was updated successfully, but these errors were encountered: