Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Flow and Occupancy Detection and Correction #506

Merged
merged 10 commits into from
Jan 15, 2025
Merged

Conversation

thehanggit
Copy link
Contributor

This is related to Issue #278

Copy link
Member

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high-level, this logic seems reasonable to me. I do wonder if it would be worth breaking out the outlier removal into a separate intermediate model: I could imagine that more than one "performance" model would want to use the version with outliers removed. What do you think @thehanggit ?

Comment on lines 42 to 53
monthly_stats as (
select
detector_id,
date_trunc('month', sample_date) as month,
avg(volume_sum) as volume_mean,
stddev(volume_sum) as volume_stddev,
-- consider using max_capacity
percentile_cont(0.95) within group (order by volume_sum) as volume_95th,
percentile_cont(0.95) within group (order by occupancy_avg) as occupancy_95th
from five_minute_agg
group by detector_id, date_trunc('month', sample_date)
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this covers the intended time period: five_minute_agg is an incremental model, so in the normal usage it only covers the past two days. For the most part, this "monthly" aggregate will only have two days of data in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I can use the preceding month's statistics (mean, std) to predict outlier for the current month.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, but the incremental logic would have to change: the logic here doesn't really work for any date because you are not actually ever computing monthly stats (five_minute_agg only has two days of data in it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated approach will be to create a separate high flow outlier detection model in the clearinghouse schema that will need to account for the incremental logic scenario @ian-r-rose mentions in the previous comment. I can see a scenario where incrementality is not needed but we will see what @thehanggit comes up with. Looking forward to seeing the update!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@thehanggit
Copy link
Contributor Author

From a high-level, this logic seems reasonable to me. I do wonder if it would be worth breaking out the outlier removal into a separate intermediate model: I could imagine that more than one "performance" model would want to use the version with outliers removed. What do you think @thehanggit ?

Thank you @ian-r-rose for taking a look! I talked with @kengodleskidot just now and we may want to detect and fix the true outliers in the clearinghouse. As the outliers in the performance model might be attributed to imputation, which are the "fake" outliers, and we want to leave that part to the imputation model. I will develop a separate outlier removal in the clearinghouse folder because it would also influence the g-factor speed calculation.

…nd fixed the logic for incremental model. Please check the logic and see if it's reasonable so we can move forward to connecting the imputed data to downstream models.
Copy link
Contributor

@kengodleskidot kengodleskidot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @thehanggit! I don't want my comments to hold up this PR since I do not believe it would change the output of the model but feel free to reach out if you would like to discuss any of my comments.

occupancy_avg
from {{ ref('int_clearinghouse__detector_agg_five_minutes') }}
where
sample_date >= dateadd(month, -1, date_trunc('month', current_date))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe one of these values is a timestamp and the other is a date. Do you run into any issues when preforming this comparison?

case
when
fa.volume_sum - ms.volume_mean / nullifzero(ms.volume_stddev) > 3
then coalesce(ms.volume_95th, 173)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a note here explaining why a value of 173 is being used.

case
when
fa.occupancy_avg > ms.occupancy_95th
then coalesce(ms.occupancy_95th, 0.8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a similar note here on why 0.8 is being used would be beneficial.

detector_id,
sample_date
from {{ ref('int_diagnostics__detector_status') }}
where status = 'Good'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you add a date timeframe to this where statement similar to the previous CTE? Otherwise, you are probably grabbing a very large data set that may affect your model performance.

@@ -16,11 +16,10 @@ detector_agg as (
station_id,
lane,
station_type,
volume_sum,
occupancy_avg,
speed_weighted,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you are removing speed_weighted from this model, will this cause any downstream issues?

@thehanggit thehanggit merged commit 1655e15 into main Jan 15, 2025
3 checks passed
@thehanggit thehanggit deleted the hang_high_flow branch January 15, 2025 23:16
@jkarpen jkarpen mentioned this pull request Jan 16, 2025
@ian-r-rose
Copy link
Member

@thehanggit this model timed out after taking 10 hours today -- I think we may need disable it and take a closer look at the performance characteristics.

@thehanggit
Copy link
Contributor Author

@thehanggit this model timed out after taking 10 hours today -- I think we may need disable it and take a closer look at the performance characteristics.

For sure, I will take some time to check the queries tomorrow. I think the mean and variance calculation based on one month data might contribute to it.

@thehanggit
Copy link
Contributor Author

@ian-r-rose I checked the query profile by running a test in worksheet. It turns out the aggregate (not sure but it may relate to the var calculate?) is expensive. I switched from monthly data to weekly data, reducing the size by half. Another way is to use 95th percentile only to detect outliers, which I tested before and they also have a good performance for high flow value detection.

image
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants