You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current check_freq_method is I think too conservative. E.g. the following sample that based on WSMAF vs PLMAF is almost certainly a COI of 2, but it would fail the check_freq_method as it has too few loci (it has 6550 but the 95% is 6560) - this is an extreme example but there are others i came across where it was say 3000 loci but was clearly COI of 2 but relatedness meant we had fewer loci than expected.
On reflection re code design, rather than have coiaf return COI of 1, get it instead to execute as normal but then have the note in place. Then the end user can decide if the COI returned by coiaf should be taken on face value or should in fact be 1. As it currently works though we are making too many COI = 2 samples that have some relatedness (which the frequency method is less affected by) return as COI = 1 when if left to coiaf they would return as COI = 2.
We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.
With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just NA_real_ or NaN). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.
Maybe one option would be to have anything that Method 1 discrete returns as COI = 1 have then Method 2 return as COI of 1 with a note, rather than the check_freq_method?
I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.
In an effort to explore adjusting this threshold, I explored several different solutions for determining the confidence interval (CI) for the expected number of variant sites given the PLMAF. While there is no difference between the four techniques tested, by examining the 99% CI, we can slightly decrease the lower bound of the expected number of variant sites.
library(coiaf)
# Define the number of loci and the distribution of minor allele frequenciesL<-1e3plmaf<-stats::rbeta(L, 1, 5)
plmaf[plmaf>0.5] <-1-plmaf[plmaf>0.5]
# Compute expected number of variant sites using Hardy-Weinberghardy_weinberg<-2*plmaf* (1-plmaf)
n_loci<- length(plmaf)
# What we currently do in the packageHmisc::binconf(sum(hardy_weinberg), n_loci) *n_loci#> PointEst Lower Upper#> 235.9207 210.6474 263.2151# Alternate solution #1
prop.test(sum(hardy_weinberg), n_loci)$conf.int*1000#> [1] 210.1685 263.7323#> attr(,"conf.level")#> [1] 0.95# Alternate solution #2
binom.test(round(sum(hardy_weinberg)), n_loci)$conf.int*1000#> [1] 209.9908 263.5732#> attr(,"conf.level")#> [1] 0.95# Alternate solution #3purrr::rerun(1e5, sum(ifelse(runif(L) <=hardy_weinberg, TRUE, FALSE))) %>%
unlist() %>%
quantile(c(0.05, 0.95))
#> 5% 95% #> 215 257# Produce the 99% confidence intervalHmisc::binconf(sum(hardy_weinberg), n_loci, alpha=0.01) *n_loci#> PointEst Lower Upper#> 235.9207 203.148 272.1746
We could try loosening the threshold for setting the COI = 1 by using a higher confidence interval instead of the 95% confidence interval. It may, as you mentioned, be better to just let the algorithm run all the way through and add a note if we did not have enough variant sites. If we use this approach, then we will also see a lot more samples for which we estimate the maximum COI. I do worry that users will not notice the note and not take the uncertainty surrounding the estimation into consideration. I think that the concern whether the users will see the note holds regardless of which strategy we employ.
With that in mind, I am starting to think that the best course of action is to return a special value which makes it clear that there is uncertainty regarding the calculation (perhaps just
NA_real_
orNaN
). I think we can then add attributes to this result. We can then run our estimation on the data and add a note saying the COI could be 1 or it could be the estimated value. This gives more advanced users the opportunity to choose how to handle these samples while preventing more basic users from making unintentional assumptions about the results.I think it is better to leave the two methods separate from one another and not have the Frequency Method call the Variant Method in its estimation.
Originally posted by @arisp99 in #17 (comment)
The text was updated successfully, but these errors were encountered: