Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add funnel plot methods #812

Merged
merged 19 commits into from
Oct 27, 2022
Merged

Add funnel plot methods #812

merged 19 commits into from
Oct 27, 2022

Conversation

Zhongkai-Wang
Copy link
Contributor

@Zhongkai-Wang Zhongkai-Wang commented Oct 17, 2022

Overview

Fix #750
Fix #751
Fix #752

  • New functions created:
    Analyze_Binary()
    Analyze_Rate()
    Flag_Funnel()
    Analyze_Binary_PredictBounds()
    Analyze_Rate_PredictBounds()

  • Assessment and workflow are updated to use new methods as default:
    Need more work to clean and update across the board (for example, in *_Assess(), Visualize_Scatter(), tests).

Test Notes/Sample Code

# LB ----------------------------------------------------------------------
dfInput <- LB_Map_Raw()

dfTransformed <- Transform_Rate(
  dfInput,
  strGroupCol = "SiteID",
  strNumeratorCol = "Count",
  strDenominatorCol = "Total"
)

dfAnalyzed <- Analyze_Binary(dfTransformed)

dfFlagged <- Flag_Funnel(dfAnalyzed, vThreshold = c(-3, -2, 2, 3))

dfBounds <- Analyze_Binary_PredictBounds(dfTransformed, vThreshold = c(-3, -2, 2, 3))

# Disp --------------------------------------------------------------------
dfInput <- Disp_Map_Raw()

dfTransformed <- Transform_Rate(
  dfInput,
  strGroupCol = "SiteID",
  strNumeratorCol = "Count",
  strDenominatorCol = "Total"
)

dfAnalyzed <- Analyze_Binary(dfTransformed)

dfFlagged <- Flag_Funnel(dfAnalyzed, vThreshold = c(-3, -2, 2, 3))

dfBounds <- Analyze_Binary_PredictBounds(dfTransformed, vThreshold = c(-3, -2, 2, 3))

# AE ----------------------------------------------------------------------
dfInput <- AE_Map_Raw() %>% na.omit()

dfTransformed <- Transform_Rate(
  dfInput,
  strGroupCol = "SiteID",
  strNumeratorCol = "Count",
  strDenominatorCol = "Exposure"
)
dfAnalyzed <- Analyze_Rate(dfTransformed)

dfFlagged <- Flag_Funnel(dfAnalyzed, vThreshold = c(-3, -2, 2, 3))

dfBounds <- Analyze_Rate_PredictBounds(dfTransformed, vThreshold = c(-3, -2, 2, 3))

# PD ----------------------------------------------------------------------
dfInput <- PD_Map_Raw() 

dfTransformed <- Transform_Rate(
  dfInput,
  strGroupCol = "SiteID",
  strNumeratorCol = "Count",
  strDenominatorCol = "Exposure"
)

dfAnalyzed <- Analyze_Rate(dfTransformed)

dfFlagged <- Flag_Funnel(dfAnalyzed, vThreshold = c(-3, -2, 2, 3))

dfBounds <- Analyze_Rate_PredictBounds(dfTransformed, vThreshold = c(-3, -2, 2, 3))

@Zhongkai-Wang Zhongkai-Wang added this to the v1.3.0 milestone Oct 17, 2022
@Zhongkai-Wang Zhongkai-Wang self-assigned this Oct 17, 2022
@gwu05 gwu05 self-requested a review October 18, 2022 18:02
@Zhongkai-Wang Zhongkai-Wang marked this pull request as ready for review October 19, 2022 13:39
@@ -69,7 +69,8 @@ AE_Map_Adam <- function(
dfInput <- dfs$dfADSL %>%
mutate(
SubjectID = .data[[lMapping$dfADSL$strIDCol]],
Exposure = as.numeric(.data[[lMapping$dfADSL$strEndCol]] - .data[[lMapping$dfADSL$strStartCol]]) + 1) %>%
Exposure = as.numeric(.data[[lMapping$dfADSL$strEndCol]] - .data[[lMapping$dfADSL$strStartCol]]) + 1
) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems accident enter press (probably not a change needed?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an automated change @gwu05 - I ran the code formatter prior to my review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see it makes sense

Copy link
Contributor

@samussiah samussiah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so cool seeing theory applied to real data - love it @Zhongkai-Wang! I reviewed the formulas in the paper and I kind of get what's going on but @gwu05 should probably have the final sign off on the stats.


dfAnalyzed <- dfTransformed %>%
mutate(
z_0 = (.data$Metric - mean(.data$Metric)) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the estimate of p0 here should be: sum(.data$Numerator) / sum(.data$Denominator) as per Zink?

image

Copy link
Contributor Author

@Zhongkai-Wang Zhongkai-Wang Oct 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sum(.data$Numerator) / sum(.data$Denominator)

Good catch! Thank you, George! They are updated.

R/Analyze_Rate.R Outdated

dfAnalyzed <- dfTransformed %>%
mutate(
z_0 = (.data$Metric - mean(.data$Metric)) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies for the rate :
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

#' The input data (`dfTransformed`) for Analyze_Poisson is typically created using
#' \code{\link{Transform_Rate}} and should be one record per site with columns for:
#' - `GroupID` - Unique subject ID
#' - `Numerator` - Number of Events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the predictbounds, I'm seeing the output Numerator is actually the Metric, and we need to calculate out the actual numerator. I think may be ideal to have both generated from the dfBounds, so have both Numerator, Denominator, and Metric. This applies to Rate Predict Bounds as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the error and inconsistency. They are updated.

@gwu05
Copy link
Contributor

gwu05 commented Oct 25, 2022

Thanks! Looks good - for dfBounds, let's add in the issue we'll want to either output the 'center' line (i.e. the overall metric - 1 number when numerator is metric, and a line when y-axis is numerator) or have the viz code to plot it

@Zhongkai-Wang
Copy link
Contributor Author

for dfBounds, let's add in the issue we'll want to either output the 'center' line (i.e. the overall metric - 1 number when numerator is metric, and a line when y-axis is numerator) or have the viz code to plot it

Thanks! Added to #751.

Copy link
Contributor

@jwildfire jwildfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks great! Nice job! Marking approved since functionality is working great for me.

Some (slightly annoying, but needed) naming changes below. These can be done here or in a new PR:

  • Combine Analyze_Binary and Analyze_Rate into Analyze_NormalApprox with a strMethod param with options for binary and rate
  • Change strMethod="funnel" to "normalApprox" in all Assess functions

@@ -55,14 +55,14 @@

AE_Assess <- function(dfInput,
vThreshold = NULL,
strMethod = "poisson",
strMethod = "funnel",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zhongkai-Wang @gwu05 - Can we call this "normalApprox" or just "normal" or something like that? A Funnel plot is really just a type of visualization, not a method for analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not going to comment on this repeatedly, but it will require a bunch of minor changes)

#'
#' @export

Analyze_Binary <- function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would Analyze_Normal_Binary or something similar be more descriptive?

Comment on lines +52 to +57
vMu = sum(.data$Numerator) / sum(.data$Denominator),
z_0 = (.data$Metric - .data$vMu) /
sqrt(.data$vMu * (1 - .data$vMu) / .data$Denominator),
phi = mean(.data$z_0^2),
z_i = (.data$Metric - .data$vMu) /
sqrt(.data$phi * .data$vMu * (1 - .data$vMu) / .data$Denominator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean and easy to follow. Will be good to explain this (and link to references) in the proposed stat vignette.

Numerator = .data$Metric * .data$Denominator
) %>%
# Only positive percentages are meaningful bounds
filter(.data$Numerator >= 0) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dropping negative values is fine (and creates better visualizations) per our previous discussion, but might be worth considering whether there are edge cases where this will create unexpected results (like entire boundary lines being missing).

#'
#' @export

Analyze_Rate <- function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyze_Normal_Rate?

@jwildfire
Copy link
Contributor

jwildfire commented Oct 27, 2022

image

@gwu05 @Zhongkai-Wang We're seeing lots of flag values of 1 and 2, but very few (only 1) negative flag. Is there a chance these new methods are missing situations where under-reporting is an issue? (Charts are from MakeSnapshot()$results_summary on the PR branch)

@@ -116,7 +118,10 @@ AE_Assess <- function(
if (!bQuiet) cli::cli_alert_success("{.fn Transform_Rate} returned output with {nrow(lData$dfTransformed)} rows.")

# dfAnalyzed --------------------------------------------------------------
if (strMethod == "poisson") {
if (strMethod == "funnel") {
lData$dfAnalyzed <- gsm::Analyze_Rate(lData$dfTransformed, bQuiet = bQuiet)
Copy link
Contributor

@jwildfire jwildfire Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lData$dfAnalyzed <- gsm::Analyze_Rate(lData$dfTransformed, bQuiet = bQuiet)
lData$dfAnalyzed <- gsm::Analyze_NormalApprox(lData$dfTransformed, strType="rate", bQuiet = bQuiet)

@gwu05
Copy link
Contributor

gwu05 commented Oct 28, 2022

image

@gwu05 @Zhongkai-Wang We're seeing lots of flag values of 1 and 2, but very few (only 1) negative flag. Is there a chance these new methods are missing situations where under-reporting is an issue? (Charts are from MakeSnapshot()$results_summary on the PR branch)

@jwildfire @Zhongkai-Wang
This one is an interesting observation - I tried running using our prior setup and got:

image

I think it's worthwhile to do comparison plots for each just to see what's going on in more detail. Based on the results so far, one guess is that there is more room for variability on the up-side, so the variance isn't symmetric but the normal approximation assumes so, which makes it more difficult to call under-reporting etc. (we can probably see this if we drew the boundaries [rather than making them 0] for the lower boundary, and we'll find that a large section of isn't a possible outcome for the lower boundary side)

Here, I drew an example:
image

non-jittered version:
image

@Zhongkai-Wang
Copy link
Contributor Author

I think it's worthwhile to do comparison plots for each just to see what's going on in more detail. Based on the results so far, one guess is that there is more room for variability on the up-side, so the variance isn't symmetric but the normal approximation assumes so, which makes it more difficult to call under-reporting etc. (we can probably see this if we drew the boundaries [rather than making them 0] for the lower boundary, and we'll find that a large section of isn't a possible outcome for the lower boundary side)

Thanks, George. We quickly looked at a few of these scatter plots at scrum yesterday and this is what we suspected as well. I will create a summary and some comparisons.
Exactly same as what you are pointing out here, an initial thought is the funnel plot is always assuming symmetric limits around the overall mean regardless the variance at each sample size. However, in reality with the actual data, the distribution at each sample size isn't symmetric and also it's truncated at zero.

@samussiah samussiah deleted the fix-750 branch December 19, 2022 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants