-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add drop and recompute media model #1
add drop and recompute media model #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job Agnes, its coming along really well! I think there's a couple of main points to look out for that are just purely cosmetic:
- Standardising indentation across the board
- Making sure files end on a newline
- Adding as when providing aliases for columns
Then there's the replacing of casting from the ::
format to making use of the various {{ dbt_utils.type_X }}
to ensure that this model is compatible with all of the different databases that are supported.
I'm also not sure if it's worth implementing a fully incremental logic now, but it seems like a waste to continually recompute so many metrics around users and videos, although if that's out of scope then that's fine too.
I think breaking down the models into different subsections and therefore directories could also help with the organisation of the project, although that's not necessary.
I think we can also add some more variables into dbt_project.yml
to parameterise some settings to make it easier to modify.
Some of my comments/questions are probably quite ignorant since I haven't worked with the media dataset before so feel free to ignore if the questions don't make sense at all, but overall I'm excited to see how this project continues to evolve!
models/interactions.sql
Outdated
@@ -0,0 +1,195 @@ | |||
{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just for redshift/postgres? If so, should we specify a subdirectory for redshift within the models folder? Is the initial version only supporting redshift/postgres or everything? If everything, consider clustering and partitioning in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially it is for redshift/postgres but for sure we can add the additional configs to support the rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this is a really good effort. Well done.
I have left plenty of comments throughout but here are some points I would like to highlight:
- The play time calculations I am still a little unclear on. I suggest we work through the comments, make any changes/clarifications and then revisit this part.
- I think we need to think through the session and user level tables. Do we include these in the package as standalone tables, do we provide snippets as to how you might add these cols to the corresponding table in the web package, do we provide models that do this for you. I am not sure at this point. I think we probably need to experiment with that a little. I think what we have already at a sessions and users level is really useful, I think we just need to think how we surface this.
- How do we integrate Roku. It is tricky because Roku's
session_id
is taken fromcom_snowplowanalytics_snowplow_client_session_1
, where as all web traffic's session id isdomain_sessionid
. As a result, the mobile base module will surface all Roku events, where as the web base module will surface the remaining media events. How we handle that I am not sure at this stage. It might be that we just focus on either or for now.
Once you have had some time to digest the comments and make any required changes, I suggest we sit down as a 3 and discuss any outstanding questions and think through the points above
models/interactions.sql
Outdated
e.domain_userid, | ||
coalesce(y.player_id, me.html_id, r.content_title) as media_id, -- replace content_title with id field when available | ||
round(mp.duration) as duration, | ||
coalesce(y.player_id, me.html_id, r.content_title) as title, -- replace ids with title fields when available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not just the same as media_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Roku we have a content_id and a content_title as well. As we just decided to split Roku from the web models, I will remove this (which was intended as a placeholder for now) and keep the id only
models/interactions.sql
Outdated
event_type, | ||
start_tstamp, | ||
media_start_pos, | ||
lead(start_tstamp, 1) over(partition by page_view_id, media_id order by start_tstamp) end_tstamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are essentially partitioning by play_id
right? What do you thinking about introducing the surrogate key play_id
earlier in this model rather than in the plays_by_pageview
model? That way it is exposed as early as possible. We could also use the surrogate_key
utils macro rather than the hash macro. V similar under the hood but ensures all cols are casted to strings + replaces Nulls (Which would otherwise result in a null pk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation, it makes sense. In the reworked model it will be added to interactions
already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I guess my point here was that it might be cleaner to have the surrogate key defined earlier within the interactions model so it is more obvious what the partition actually represents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to do that of course but the issue I faced was {{ dbt_utils.surrogate_key(['p.page_view_id', coalesce(y.player_id, me.html_id) ]) }} play_id doesn't work. Do you have a workaround for this perhaps? The only thing I can think of is to add another CTE after prep and before the partitioning CTE just for the surrogate key calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you just need to wrap the coalesce in quotes i.e. {{ dbt_utils.surrogate_key(['p.page_view_id', 'coalesce(y.player_id, me.html_id)' ]) }} play_id
.
models/interactions.sql
Outdated
f.event_id, | ||
f.event_type, | ||
f.start_tstamp, | ||
case when f.end_tstamp is null then p.start_tstamp else f.end_tstamp end end_tstamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step really required? In the previous CTE, find_nulls_round_1
, if instead of case when event_type in ('pause', 'paused', 'ended') then coalesce(end_tstamp, start_tstamp) else end_tstamp end end_tstamp
you had coalesce(end_tstamp, start_tstamp)
would you get the same result as this two step NULL
identification?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to highlight it in the comment but essentially part2 was just a placeholder to eventually calculate the end timestamps of events that are not a paused
or ended
event and take them from the web model's pageview table
's end_tstamp
s to get a more accurate view. Not sure if it is worth the extra calculation and dependence though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I guess you could consider either taking the end_tstamp
from the page views table or use the page ping events to calculate the end_tstamp
within this model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it with the simplified version, meaning I will equal the end_timestamp
of the last event as the start_tstamp
because as I found that when I added the end_tstamps
from the page view
table, the values I got was less than the start_tstamp
of the last event.
models/media_stats.sql
Outdated
|
||
from {{ ref("interactions") }} | ||
|
||
where event_type = 'ready' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should include impressions in the plays_by_pageview
model? I know strictly speaking it wouldnt be just plays then but just thinking out loud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we agreed on the group discussion, I have changed the logic to fit the impressions into plays_by_pageview
. This has led me to realise I need to reconsider KPIs for media_stats
in general. Currently I only have a rough idea which metric needs which logic (or more than one) out of the following 4 levels of filtering: let's say for unique users
as a base we could take 1) all interactions including those cases when there is only a ready
event within the page view so no actual interaction with the video 2) all page views where some interaction with the video happened but not necessarily play event 3) interactions within page views with SOME play_time_sec 4) interactions with valid plays (currently longer than 30 min). You will see I opted for calculating both 3) unique_users_with_plays
and 4) unique_users_with_valid_plays
whereas when I look at completion_rate, I only consider option 3) (completion_rate_by_play
)
Thank you both again for the reviews, I have reworked the model taking into consideration the recommendations you suggested. I have removed Roku from the model for now so that it is cleaner and the test don't fail anymore. I have also added documentation for the columns so that we can be clear on the definitions behind the KPIs. There are still a handful of points that might still be open that needs further discussion, I tried to reply to comments everywhere it is needed, feel free to resolve any open points you raised to get a clearer view of what is actually left to discuss / rework. |
Thanks for making this changes. I have started to review the code base again but stopped because I realised it doesn't actually run e2e. There is 1 issue in the |
Apologies, the join is now fixed. The rest of the models run fine. |
I was just about to comment something similar but I'm glad it works now! I could only reproduce your issue if I first run the pivot_base without 10 within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the changes. Good work. I think this is shaping up nicely.
I am still slightly unsure about the play time calculation. I think it is less to do with your logic, more with the underlying data. I have pulled out some rows of data which I had questions on here. I have left comments in the relevant cells.
More generally on this time calculation it seems like we have two approaches to this. I have outlined these below and my initial thoughts on the two. I am not wedded to either method yet but I would be interested to get both your thoughts on this.
1. Using tstamp of media events (Current implementation)
Pros:
- Can calc abs time between events.
- Assuming clean data, should be the most accurate calc.
Cons:
- Relatively expensive to use lag/lead functions across event level data
- The user must be tracking a specific set of events for this to work i.e. play, pause, seek, ended. Unsure how tracking additional media events will affect this calc.
- Reliant on accurate tstamps and sequence of events being correct.
- If there is only a single play event, you wont have any watch time. Could use page pings to accommodate this fact.
- Complex logic, difficult to follow assumptions e.g. how seek events are being treated, what happens if elapsed time > play time
2. Use more granular percentage boundaries (e.g. every 10%)
Pros:
- Less prone to significant error (I think)
- Cheaper to calc
- Would allow for more granular retention curves.
- Only need to capture boundary events for this calc to work.
- Accuracy can be increased by increasing number of boundaries.
Cons:
- Purely an estimate of play time, rather than abs time.
- Increases number of percentage boundary events
- Hard to calc things that require fine grain measurements such as
is_valid_play
which could beplay_time_sec > 1s
I haven't gone into the minutiae on this review as I feel the decision we make on the above point will impact the design a bit.
models/plays_by_pageview.sql
Outdated
@@ -120,21 +87,24 @@ select | |||
d.play_time_sec, | |||
d.play_time_sec_muted, | |||
d.play_time_sec_estimated, | |||
d.retention_rate, | |||
case when d.play_time_sec > 0 then true else false end is_played, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is going to miss plays which just have a single play event and nothing else?! Perhaps it is better to define this on play events. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, we could do a count of plays as the basis of the logic to be 100% correct
models/media_stats.sql
Outdated
min(start_tstamp) as first_play, | ||
max(start_tstamp) as last_play, | ||
count(*) as valid_plays, | ||
count(distinct domain_userid) as unique_users_with_valid_plays, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be tempted to just call this users_w_valid_plays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end I decided to leave this out and only make a distinction bw users
(who clicked play) and users with complete plays as I find it slightly superfluous, but shortened all the key metrics with with
to w
models/pivot_base.sql
Outdated
|
||
{% if not loop.last %} | ||
{% for element in var("snowplow__percent_progress_boundaries") %} | ||
{% if element < 0 or element > 100 %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably cleaner to move these checks into a macro, like get_percentage_boundaries(tracked_boundaries)
, where you validate the inputted boundaries, append 100 to the array if needed, then return the array. You could call this macro from here and iterate over the results as you have done. Shout if this doesn't make sense. Can explain in a bit more detail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done now, thanks for the suggestion.
coalesce(sum(i.play_time_sec_amended)/ nullif(max(i.duration), 0), 0) as retention_rate | ||
|
||
sum(case when i.is_muted then i.play_time_sec_amended else 0 end) as play_time_sec_muted, | ||
sum(i.playback_rate * i.play_time_sec) / nullif(sum(i.play_time_sec), 0) as avg_playback_rate, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to use play_time_sec
for this calc but play_time_sec_amended
elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
play_time_sec_amended
is renamed to play_time_sec
in downstream models. It is probably cleaner if the play_time_sec
in this model is called play_time_sec_original
and call play_time_sec_amended
play_time_sec
right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not relevant anymore. I have only kept one play_time_sec calculation which is the estimate based on the percent progresses reached, as discussed earlier.
models/media_stats.sql
Outdated
group by 1, 2, 3, 4, 5 | ||
|
||
group by 1, 2, 3, 4 | ||
|
||
) | ||
|
||
, valid_play_stats as ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the plan to have this media stats model as drop and recompute i.e. reprocess all plays each time? We might be able to be a cunning and avoid that situation by using {{ this }}
to self reference the table, then adding the new data to the old data however this approach will likely not work for metrics like that require count distincts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try my best to come up with something along these lines. For the drop and recompute model I did not plan to make this change yet.
models/media_stats.sql
Outdated
from {{ ref("plays_by_pageview") }} | ||
select | ||
media_id, | ||
min(start_tstamp) as first_play, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like these should either be called first_valid_play
or we move the first/last play tstamps calc to the prep
CTE i.e. consider both valid and non valid plays when calculating first and last plays. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both could be useful, especially now that we decided to reconsider the playtime calculation so that valid plays will become less accurate in certain scenarios
I have reworked the model according to the less expensive and less error prone |
Closed since we have the incremental model instead. (#2) |
The media model written in a drop and recompute fashion until the integration details are finalised. It is less likely it will be a standalone model, depending on the player it should either use a web model or mobile model (Roku) as a base.
The schema.yml describes what each model does but as a whole the idea is that all relevant event level fields are collected under 'interactions', which is aggregated to a page view level, which is the basis for play events. There are further downstream models which aggregate this data from different perspectives. The most important model is the 'media_stats', which aggregates the page view level data to a video/audio level with the key video analytics KPIs in mind.
Please note that currently it only takes test data from the web model, therefore some key fields from Roku events are currently missing which will result in fails for the 'not_null' tests and incorrect / missing calculations for the relevant data in the models. This should fix itself as soon as we have test data available and populate the correct ids that are currently missing.