-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental, partitioned session dim and fact tables. Allows for dimensional modeling on very large installs #175
Conversation
Honestly I do not understand what the point of separating dim & fct tables is for a single entity like sessions, it's not like the dim table keys anything long-lived. Since this MR doubles the tables, is this a good time to combine them? |
I agree there isn't much value in separating these dim & fct tables as they are today, but I'd like to give it some research into comparable Kimball modeling scenarios before making a final call. It's a separate concern than what's being addressed in the PR, so let me open an issue and we can move the conversation there. |
I approved these changes, but also wanted to add that I think we should have a view model on top of these session tables that handles cross-partition aggregation. I would go so far as to suggest that we give our session model or models the This would result in sessions breaking when they are at the start or the end of the date range being queried but would leave partitions within the range unbroken. I think this is a good, default behavior that people can choose to over-ride if they are willing to pay the performance price. |
Agree that the organization could use more thought. I had a realization after completing this: It would be easy to create a Just putting that idea out there for now. |
This is what I did in my own project some months back. Works well enough. |
Description & motivation
Current Issue:
Building
dim_ga4__sessions
requires very expensive window functions that run against the entire event table. This is because GA4 sessions can span multiple days and the primary method available to us to reduce query cost is to incrementally work with N days of data. This causes a conflict: When working with N days of data, you may be working with the last-half of a session that started at an earlier date or working with the beginning-half of a session that ends at a later date.Solution:
The only way to align a date-partitioned strategy with session dimensions is to revert to the assumption used by GA3 that sessions are wholly contained within a date. This PR adds a series of
_daily
models that make this assumption. Users of the package can make the decision whether they would like to work with multi-day sessions (at the expense of excess query costs) or daily sessions that are query-efficient.Summary of changes:
session partition
which is a session split into date-grain records. A 2-day session will have 2 session partitionsstg_ga4__sessions_traffic_sources_daily
which finds acquisition sources for daily session partitionsstg_ga4__derived_session_properties_daily
which finds session properties within session partitionsdim_ga4__sessions_daily
which finds the first value of all dimension columns, windowing on session partitionstg_ga4__session_conversions_daily
incremental and partitioned (was already grouping by day)fct_ga4__sessions_daily
incremental and partitionedChecklist
dbt test
andpython -m pytest .
to validate existing tests