add drop and recompute media model #1

agnessnowplow · 2022-02-09T11:39:37Z

The media model written in a drop and recompute fashion until the integration details are finalised. It is less likely it will be a standalone model, depending on the player it should either use a web model or mobile model (Roku) as a base.

The schema.yml describes what each model does but as a whole the idea is that all relevant event level fields are collected under 'interactions', which is aggregated to a page view level, which is the basis for play events. There are further downstream models which aggregate this data from different perspectives. The most important model is the 'media_stats', which aggregates the page view level data to a video/audio level with the key video analytics KPIs in mind.

Please note that currently it only takes test data from the web model, therefore some key fields from Roku events are currently missing which will result in fails for the 'not_null' tests and incorrect / missing calculations for the relevant data in the models. This should fix itself as soon as we have test data available and populate the correct ids that are currently missing.

emielver

Great job Agnes, its coming along really well! I think there's a couple of main points to look out for that are just purely cosmetic:

Standardising indentation across the board
Making sure files end on a newline
Adding as when providing aliases for columns

Then there's the replacing of casting from the :: format to making use of the various {{ dbt_utils.type_X }} to ensure that this model is compatible with all of the different databases that are supported.

I'm also not sure if it's worth implementing a fully incremental logic now, but it seems like a waste to continually recompute so many metrics around users and videos, although if that's out of scope then that's fine too.

I think breaking down the models into different subsections and therefore directories could also help with the organisation of the project, although that's not necessary.

I think we can also add some more variables into dbt_project.yml to parameterise some settings to make it easier to modify.

Some of my comments/questions are probably quite ignorant since I haven't worked with the media dataset before so feel free to ignore if the questions don't make sense at all, but overall I'm excited to see how this project continues to evolve!

dbt_project.yml

emielver · 2022-02-09T18:12:01Z

models/interactions.sql

@@ -0,0 +1,195 @@
+{{ 


Is this just for redshift/postgres? If so, should we specify a subdirectory for redshift within the models folder? Is the initial version only supporting redshift/postgres or everything? If everything, consider clustering and partitioning in the config

initially it is for redshift/postgres but for sure we can add the additional configs to support the rest

models/interactions.sql

models/plays_by_user.sql

models/schema.yml

packages.yml

models/interactions.sql

bill-warner

Overall I think this is a really good effort. Well done.

I have left plenty of comments throughout but here are some points I would like to highlight:

The play time calculations I am still a little unclear on. I suggest we work through the comments, make any changes/clarifications and then revisit this part.
I think we need to think through the session and user level tables. Do we include these in the package as standalone tables, do we provide snippets as to how you might add these cols to the corresponding table in the web package, do we provide models that do this for you. I am not sure at this point. I think we probably need to experiment with that a little. I think what we have already at a sessions and users level is really useful, I think we just need to think how we surface this.
How do we integrate Roku. It is tricky because Roku's session_id is taken from com_snowplowanalytics_snowplow_client_session_1, where as all web traffic's session id is domain_sessionid. As a result, the mobile base module will surface all Roku events, where as the web base module will surface the remaining media events. How we handle that I am not sure at this stage. It might be that we just focus on either or for now.

Once you have had some time to digest the comments and make any required changes, I suggest we sit down as a 3 and discuss any outstanding questions and think through the points above

bill-warner · 2022-02-10T09:30:26Z

models/interactions.sql

+    e.domain_userid,
+    coalesce(y.player_id, me.html_id, r.content_title) as media_id, -- replace content_title with id field when available
+    round(mp.duration) as duration,
+    coalesce(y.player_id, me.html_id, r.content_title) as title, -- replace ids with title fields when available


Is this not just the same as media_id?

for Roku we have a content_id and a content_title as well. As we just decided to split Roku from the web models, I will remove this (which was intended as a placeholder for now) and keep the id only

models/interactions.sql

bill-warner · 2022-02-10T10:04:12Z

models/interactions.sql

+		event_type,
+		start_tstamp,
+		media_start_pos,
+		lead(start_tstamp, 1) over(partition by page_view_id, media_id order by start_tstamp) end_tstamp,


You are essentially partitioning by play_id right? What do you thinking about introducing the surrogate key play_id earlier in this model rather than in the plays_by_pageview model? That way it is exposed as early as possible. We could also use the surrogate_key utils macro rather than the hash macro. V similar under the hood but ensures all cols are casted to strings + replaces Nulls (Which would otherwise result in a null pk)

Thanks for your explanation, it makes sense. In the reworked model it will be added to interactions already.

Sorry I guess my point here was that it might be cleaner to have the surrogate key defined earlier within the interactions model so it is more obvious what the partition actually represents.

I wanted to do that of course but the issue I faced was {{ dbt_utils.surrogate_key(['p.page_view_id', coalesce(y.player_id, me.html_id) ]) }} play_id doesn't work. Do you have a workaround for this perhaps? The only thing I can think of is to add another CTE after prep and before the partitioning CTE just for the surrogate key calculation.

I think you just need to wrap the coalesce in quotes i.e. {{ dbt_utils.surrogate_key(['p.page_view_id', 'coalesce(y.player_id, me.html_id)' ]) }} play_id.

models/interactions.sql

bill-warner · 2022-02-11T12:28:50Z

models/interactions.sql

+		f.event_id,
+    f.event_type,
+    f.start_tstamp,
+		case when f.end_tstamp is null then p.start_tstamp else f.end_tstamp end end_tstamp,


Is this step really required? In the previous CTE, find_nulls_round_1, if instead of case when event_type in ('pause', 'paused', 'ended') then coalesce(end_tstamp, start_tstamp) else end_tstamp end end_tstamp you had coalesce(end_tstamp, start_tstamp) would you get the same result as this two step NULL identification?

I tried to highlight it in the comment but essentially part2 was just a placeholder to eventually calculate the end timestamps of events that are not a paused or ended event and take them from the web model's pageview table's end_tstamps to get a more accurate view. Not sure if it is worth the extra calculation and dependence though.

Yeh I guess you could consider either taking the end_tstamp from the page views table or use the page ping events to calculate the end_tstamp within this model

I left it with the simplified version, meaning I will equal the end_timestamp of the last event as the start_tstamp because as I found that when I added the end_tstamps from the page view table, the values I got was less than the start_tstamp of the last event.

models/user_stats.sql

models/media_stats.sql

bill-warner · 2022-02-15T10:22:18Z

models/media_stats.sql

+
+    from {{ ref("interactions") }}
+
+    where event_type = 'ready'


Do you think we should include impressions in the plays_by_pageview model? I know strictly speaking it wouldnt be just plays then but just thinking out loud

As we agreed on the group discussion, I have changed the logic to fit the impressions into plays_by_pageview. This has led me to realise I need to reconsider KPIs for media_stats in general. Currently I only have a rough idea which metric needs which logic (or more than one) out of the following 4 levels of filtering: let's say for unique users as a base we could take 1) all interactions including those cases when there is only a ready event within the page view so no actual interaction with the video 2) all page views where some interaction with the video happened but not necessarily play event 3) interactions within page views with SOME play_time_sec 4) interactions with valid plays (currently longer than 30 min). You will see I opted for calculating both 3) unique_users_with_plays and 4) unique_users_with_valid_plays whereas when I look at completion_rate, I only consider option 3) (completion_rate_by_play)

dbt_project.yml

agnessnowplow · 2022-03-01T11:09:10Z

Thank you both again for the reviews, I have reworked the model taking into consideration the recommendations you suggested. I have removed Roku from the model for now so that it is cleaner and the test don't fail anymore. I have also added documentation for the columns so that we can be clear on the definitions behind the KPIs. There are still a handful of points that might still be open that needs further discussion, I tried to reply to comments everywhere it is needed, feel free to resolve any open points you raised to get a clearer view of what is actually left to discuss / rework.

bill-warner · 2022-03-02T12:34:42Z

Thanks for making this changes. I have started to review the code base again but stopped because I realised it doesn't actually run e2e. There is 1 issue in the mp_events model where you filter on page_view_id not pvc.id. The remaining issues I am unsure on. Apologies in advance if it is me misconfiguring the model but I have left the configs as is. Once it is running I will have another look :)

agnessnowplow · 2022-03-02T13:29:37Z

Apologies, the join is now fixed. The rest of the models run fine.

bill-warner · 2022-03-02T13:47:25Z

~~I am still getting an error. Any idea why that is?~~

I think it is the way I was running the package. It wasn't executing the pivot_base model as part of the run, which seems slightly odd since it is downstream of the mp_events model but hey. It works now!

agnessnowplow · 2022-03-02T14:46:52Z

I was just about to comment something similar but I'm glad it works now! I could only reproduce your issue if I first run the pivot_base without 10 within snowplow__percent_progress_boundaries and then try to run plays by page view with a different percent progress configuration list where 10 is added to the list within snowplow__percent_progress_boundaries. They are indeed dependent.

bill-warner

I like the changes. Good work. I think this is shaping up nicely.

I am still slightly unsure about the play time calculation. I think it is less to do with your logic, more with the underlying data. I have pulled out some rows of data which I had questions on here. I have left comments in the relevant cells.

More generally on this time calculation it seems like we have two approaches to this. I have outlined these below and my initial thoughts on the two. I am not wedded to either method yet but I would be interested to get both your thoughts on this.

1. Using tstamp of media events (Current implementation)

Pros:

Can calc abs time between events.
Assuming clean data, should be the most accurate calc.

Cons:

Relatively expensive to use lag/lead functions across event level data
The user must be tracking a specific set of events for this to work i.e. play, pause, seek, ended. Unsure how tracking additional media events will affect this calc.
Reliant on accurate tstamps and sequence of events being correct.
If there is only a single play event, you wont have any watch time. Could use page pings to accommodate this fact.
Complex logic, difficult to follow assumptions e.g. how seek events are being treated, what happens if elapsed time > play time

2. Use more granular percentage boundaries (e.g. every 10%)

Pros:

Less prone to significant error (I think)
Cheaper to calc
Would allow for more granular retention curves.
Only need to capture boundary events for this calc to work.
Accuracy can be increased by increasing number of boundaries.

Cons:

Purely an estimate of play time, rather than abs time.
Increases number of percentage boundary events
Hard to calc things that require fine grain measurements such as is_valid_play which could be play_time_sec > 1s

I haven't gone into the minutiae on this review as I feel the decision we make on the above point will impact the design a bit.

bill-warner · 2022-03-07T15:12:18Z

models/plays_by_pageview.sql

@@ -120,21 +87,24 @@ select
  d.play_time_sec,
  d.play_time_sec_muted,
  d.play_time_sec_estimated,
-  d.retention_rate,
+  case when d.play_time_sec > 0 then true else false end is_played,


I feel like this is going to miss plays which just have a single play event and nothing else?! Perhaps it is better to define this on play events. WDYT?

sure, we could do a count of plays as the basis of the logic to be 100% correct

dbt_project.yml

docs/markdown/snowplow_media_player_col_definitions.md

models/interactions.sql

bill-warner · 2022-03-07T15:21:33Z

models/media_stats.sql

+    min(start_tstamp) as first_play,
+    max(start_tstamp) as last_play,
+    count(*) as valid_plays,
+    count(distinct domain_userid) as unique_users_with_valid_plays,


I would be tempted to just call this users_w_valid_plays

In the end I decided to leave this out and only make a distinction bw users (who clicked play) and users with complete plays as I find it slightly superfluous, but shortened all the key metrics with with to w

models/interactions.sql

bill-warner · 2022-03-07T16:23:31Z

models/pivot_base.sql

-
-    {% if not loop.last %}
+{% for element in var("snowplow__percent_progress_boundaries") %}
+  {% if element < 0 or element > 100 %}


Probably cleaner to move these checks into a macro, like get_percentage_boundaries(tracked_boundaries), where you validate the inputted boundaries, append 100 to the array if needed, then return the array. You could call this macro from here and iterate over the results as you have done. Shout if this doesn't make sense. Can explain in a bit more detail

this is done now, thanks for the suggestion.

bill-warner · 2022-03-07T16:28:43Z

models/plays_by_pageview.sql

-    coalesce(sum(i.play_time_sec_amended)/ nullif(max(i.duration), 0), 0) as retention_rate
-
+    sum(case when i.is_muted then i.play_time_sec_amended else 0 end) as play_time_sec_muted,
+    sum(i.playback_rate * i.play_time_sec) / nullif(sum(i.play_time_sec), 0) as avg_playback_rate,


What is the reason to use play_time_sec for this calc but play_time_sec_amended elsewhere?

play_time_sec_amended is renamed to play_time_sec in downstream models. It is probably cleaner if the play_time_sec in this model is called play_time_sec_original and call play_time_sec_amended play_time_sec right away.

This is not relevant anymore. I have only kept one play_time_sec calculation which is the estimate based on the percent progresses reached, as discussed earlier.

bill-warner · 2022-03-07T16:34:06Z

models/media_stats.sql

- 
-  group by 1, 2, 3, 4, 5
+
+  group by 1, 2, 3, 4

 )

 , valid_play_stats as (


So is the plan to have this media stats model as drop and recompute i.e. reprocess all plays each time? We might be able to be a cunning and avoid that situation by using {{ this }} to self reference the table, then adding the new data to the old data however this approach will likely not work for metrics like that require count distincts

Will try my best to come up with something along these lines. For the drop and recompute model I did not plan to make this change yet.

bill-warner · 2022-03-07T16:38:52Z

models/media_stats.sql

-    from {{ ref("plays_by_pageview") }}
+  select
+    media_id,
+    min(start_tstamp) as first_play,


I feel like these should either be called first_valid_play or we move the first/last play tstamps calc to the prep CTE i.e. consider both valid and non valid plays when calculating first and last plays. WDYT?

I think both could be useful, especially now that we decided to reconsider the playtime calculation so that valid plays will become less accurate in certain scenarios

agnessnowplow · 2022-03-11T13:32:56Z

I have reworked the model according to the less expensive and less error prone play_time_sec calculation, and that is based on the percentage progresses reached. The more the users track, the more accurate the estimates should be. This simplified the model quite a bit after all, but it also meant that I reconsidered the KPIs (some I renamed, some I removed, some I changed the logic for) I have changed the logic for avg_retention_rate which will be the average of the latest percentage progress reached before any seek events to see where the interest drops. This is different from the 'avg_percentage_played' which is the total play_time_sec (including rewatches) divided by the media duration.

emielver · 2022-05-13T12:30:09Z

Closed since we have the incremental model instead. (#2)

add drop and recompute media model

3824828

agnessnowplow requested review from emielver and bill-warner February 9, 2022 11:39

emielver suggested changes Feb 9, 2022

View reviewed changes

bill-warner reviewed Feb 15, 2022

View reviewed changes

rework based on pr

5bed64a

fix join in mp_events

4fcf14a

bill-warner reviewed Mar 7, 2022

View reviewed changes

switch to perc based playtimesec calc

a1c6b64

emielver closed this May 13, 2022

emielver deleted the dev/drop_and_recompute_media_model branch August 26, 2022 14:16

add drop and recompute media model #1

add drop and recompute media model #1

Conversation

agnessnowplow commented Feb 9, 2022

emielver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnessnowplow Feb 22, 2022 • edited Loading

Choose a reason for hiding this comment

bill-warner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnessnowplow commented Mar 1, 2022

bill-warner commented Mar 2, 2022

agnessnowplow commented Mar 2, 2022

bill-warner commented Mar 2, 2022 • edited Loading

agnessnowplow commented Mar 2, 2022

bill-warner left a comment

Choose a reason for hiding this comment

1. Using tstamp of media events (Current implementation)

Pros:

Cons:

2. Use more granular percentage boundaries (e.g. every 10%)

Pros:

Cons:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agnessnowplow commented Mar 11, 2022

emielver commented May 13, 2022 • edited Loading

agnessnowplow Feb 22, 2022 •

edited

Loading

bill-warner commented Mar 2, 2022 •

edited

Loading

emielver commented May 13, 2022 •

edited

Loading