High Flow and Occupancy Detection and Correction #506

thehanggit · 2025-01-02T07:52:37Z

This is related to Issue #278

…ers for flow and occupancy

ian-r-rose

From a high-level, this logic seems reasonable to me. I do wonder if it would be worth breaking out the outlier removal into a separate intermediate model: I could imagine that more than one "performance" model would want to use the version with outliers removed. What do you think @thehanggit ?

ian-r-rose · 2025-01-02T20:46:59Z

...sform/models/intermediate/performance/int_performance__detector_metrics_agg_five_minutes.sql

+monthly_stats as (
+    select
+        detector_id,
+        date_trunc('month', sample_date) as month,
+        avg(volume_sum) as volume_mean,
+        stddev(volume_sum) as volume_stddev,
+        -- consider using max_capacity
+        percentile_cont(0.95) within group (order by volume_sum) as volume_95th,
+        percentile_cont(0.95) within group (order by occupancy_avg) as occupancy_95th
+    from five_minute_agg
+    group by detector_id, date_trunc('month', sample_date)
+),


I don't think this covers the intended time period: five_minute_agg is an incremental model, so in the normal usage it only covers the past two days. For the most part, this "monthly" aggregate will only have two days of data in it.

In that case, I can use the preceding month's statistics (mean, std) to predict outlier for the current month.

I think that's fine, but the incremental logic would have to change: the logic here doesn't really work for any date because you are not actually ever computing monthly stats (five_minute_agg only has two days of data in it)

The updated approach will be to create a separate high flow outlier detection model in the clearinghouse schema that will need to account for the incremental logic scenario @ian-r-rose mentions in the previous comment. I can see a scenario where incrementality is not needed but we will see what @thehanggit comes up with. Looking forward to seeing the update!

Sounds good!

thehanggit · 2025-01-02T22:47:36Z

From a high-level, this logic seems reasonable to me. I do wonder if it would be worth breaking out the outlier removal into a separate intermediate model: I could imagine that more than one "performance" model would want to use the version with outliers removed. What do you think @thehanggit ?

Thank you @ian-r-rose for taking a look! I talked with @kengodleskidot just now and we may want to detect and fix the true outliers in the clearinghouse. As the outliers in the performance model might be attributed to imputation, which are the "fake" outliers, and we want to leave that part to the imputation model. I will develop a separate outlier removal in the clearinghouse folder because it would also influence the g-factor speed calculation.

…nd fixed the logic for incremental model. Please check the logic and see if it's reasonable so we can move forward to connecting the imputed data to downstream models.

…-pems into hang_high_flow

kengodleskidot

Nice work @thehanggit! I don't want my comments to hold up this PR since I do not believe it would change the output of the model but feel free to reach out if you would like to discuss any of my comments.

kengodleskidot · 2025-01-13T18:21:13Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

+        occupancy_avg
+    from {{ ref('int_clearinghouse__detector_agg_five_minutes') }}
+    where
+        sample_date >= dateadd(month, -1, date_trunc('month', current_date))


I believe one of these values is a timestamp and the other is a date. Do you run into any issues when preforming this comparison?

kengodleskidot · 2025-01-13T18:25:12Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

+        case
+            when
+                fa.volume_sum - ms.volume_mean / nullifzero(ms.volume_stddev) > 3
+                then coalesce(ms.volume_95th, 173)


I recommend adding a note here explaining why a value of 173 is being used.

kengodleskidot · 2025-01-13T18:26:57Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

+        case
+            when
+                fa.occupancy_avg > ms.occupancy_95th
+                then coalesce(ms.occupancy_95th, 0.8)


Adding a similar note here on why 0.8 is being used would be beneficial.

kengodleskidot · 2025-01-13T18:27:29Z

...m/models/intermediate/clearinghouse/int_clearinghouse__detector_outlier_agg_five_minutes.sql

+        detector_id,
+        sample_date
+    from {{ ref('int_diagnostics__detector_status') }}
+    where status = 'Good'


Should you add a date timeframe to this where statement similar to the previous CTE? Otherwise, you are probably grabbing a very large data set that may affect your model performance.

kengodleskidot · 2025-01-13T18:28:36Z

...sform/models/intermediate/clearinghouse/int_clearinghouse__detector_g_factor_based_speed.sql

@@ -16,11 +16,10 @@ detector_agg as (
        station_id,
        lane,
        station_type,
-        volume_sum,
-        occupancy_avg,
-        speed_weighted,


I see you are removing speed_weighted from this model, will this cause any downstream issues?

…o commit

ian-r-rose · 2025-01-17T01:27:20Z

@thehanggit this model timed out after taking 10 hours today -- I think we may need disable it and take a closer look at the performance characteristics.

thehanggit · 2025-01-17T04:11:12Z

@thehanggit this model timed out after taking 10 hours today -- I think we may need disable it and take a closer look at the performance characteristics.

For sure, I will take some time to check the queries tomorrow. I think the mean and variance calculation based on one month data might contribute to it.

thehanggit · 2025-01-17T23:42:20Z

@ian-r-rose I checked the query profile by running a test in worksheet. It turns out the aggregate (not sure but it may relate to the var calculate?) is expensive. I switched from monthly data to weekly data, reducing the size by half. Another way is to use 95th percentile only to detect outliers, which I tested before and they also have a good performance for high flow value detection.

update imputation_detector_metrics_agg_five_minutes by removing outli…

ea1b093

…ers for flow and occupancy

thehanggit added this to the VDS Data Modeling: Imputation Modeling milestone Jan 2, 2025

thehanggit self-assigned this Jan 2, 2025

thehanggit requested a review from kengodleskidot January 2, 2025 16:45

ian-r-rose reviewed Jan 2, 2025

View reviewed changes

thehanggit added 6 commits January 6, 2025 22:15

Transferred the outlier detection algorithm to clearinghouse folder a…

3a48707

…nd fixed the logic for incremental model. Please check the logic and see if it's reasonable so we can move forward to connecting the imputed data to downstream models.

Fix errors

0d32873

commit one change

24d9188

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

5702507

…-pems into hang_high_flow

calculate g factor speed based on outlier_removed data

35857b6

commit

326dc8f

kengodleskidot approved these changes Jan 13, 2025

View reviewed changes

thehanggit added 3 commits January 14, 2025 17:54

revise based on ken's comments and add descriptions in yml file

fef404f

add yml file description, resolved authentication issue and be able t…

992dee8

…o commit

trimming whitespace

d93e2f3

thehanggit merged commit 1655e15 into main Jan 15, 2025
3 checks passed

thehanggit deleted the hang_high_flow branch January 15, 2025 23:16

jkarpen mentioned this pull request Jan 16, 2025

High Flow Values #278

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Flow and Occupancy Detection and Correction #506

High Flow and Occupancy Detection and Correction #506

thehanggit commented Jan 2, 2025

ian-r-rose left a comment

ian-r-rose Jan 2, 2025

thehanggit Jan 2, 2025

ian-r-rose Jan 2, 2025

kengodleskidot Jan 3, 2025

ian-r-rose Jan 3, 2025

thehanggit commented Jan 2, 2025

kengodleskidot left a comment

kengodleskidot Jan 13, 2025

kengodleskidot Jan 13, 2025

kengodleskidot Jan 13, 2025

kengodleskidot Jan 13, 2025

kengodleskidot Jan 13, 2025

ian-r-rose commented Jan 17, 2025

thehanggit commented Jan 17, 2025

thehanggit commented Jan 17, 2025

High Flow and Occupancy Detection and Correction #506

High Flow and Occupancy Detection and Correction #506

Conversation

thehanggit commented Jan 2, 2025

ian-r-rose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thehanggit commented Jan 2, 2025

kengodleskidot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-r-rose commented Jan 17, 2025

thehanggit commented Jan 17, 2025

thehanggit commented Jan 17, 2025