Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequency Method COI = 1 threshold #21

Closed
arisp99 opened this issue Feb 24, 2022 · 1 comment
Closed

Frequency Method COI = 1 threshold #21

arisp99 opened this issue Feb 24, 2022 · 1 comment
Milestone

Comments

@arisp99
Copy link
Member

arisp99 commented Feb 24, 2022

The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.

On reflection re code design, rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.

We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.

With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just NA_real_ or NaN). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.

Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?

I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.

Originally posted by @arisp99 in #17 (comment)

@arisp99
Copy link
Member Author

arisp99 commented Feb 24, 2022

In an effort to explore adjusting this threshold, I explored several different solutions for determining the confidence interval (CI) for the expected number of variant sites given the PLMAF. While there is no difference between the four techniques tested, by examining the 99% CI, we can slightly decrease the lower bound of the expected number of variant sites.

library(coiaf)

# Define the number of loci and the distribution of minor allele frequencies
L <- 1e3
plmaf <- stats::rbeta(L, 1, 5)
plmaf[plmaf > 0.5] <- 1 - plmaf[plmaf > 0.5]

# Compute expected number of variant sites using Hardy-Weinberg
hardy_weinberg <- 2 * plmaf * (1 - plmaf)
n_loci <- length(plmaf)

# What we currently do in the package
Hmisc::binconf(sum(hardy_weinberg), n_loci) * n_loci
#>  PointEst    Lower    Upper
#>  235.9207 210.6474 263.2151

# Alternate solution #1
prop.test(sum(hardy_weinberg), n_loci)$conf.int * 1000
#> [1] 210.1685 263.7323
#> attr(,"conf.level")
#> [1] 0.95

# Alternate solution #2
binom.test(round(sum(hardy_weinberg)), n_loci)$conf.int * 1000
#> [1] 209.9908 263.5732
#> attr(,"conf.level")
#> [1] 0.95

# Alternate solution #3
purrr::rerun(1e5, sum(ifelse(runif(L) <= hardy_weinberg, TRUE, FALSE))) %>%
  unlist() %>%
  quantile(c(0.05, 0.95))
#>  5% 95% 
#> 215 257

# Produce the 99% confidence interval
Hmisc::binconf(sum(hardy_weinberg), n_loci, alpha = 0.01) * n_loci
#>  PointEst   Lower    Upper
#>  235.9207 203.148 272.1746

Created on 2022-02-24 by the reprex package (v2.0.1)

@arisp99 arisp99 added this to the 1.0.0 milestone Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant