-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter based on disease-level #18
Comments
Assessing the issueI've generated a rmarkdown report assessing the scope of this issue. As you can see, the number of diseases/phenotype can range from 0 to 4125. That presents some complications when trying to link phenotype-celltype associations back to particular diseases. Potential solutions
|
I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout.
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Brian M. Schilder ***@***.***>
Sent: Monday, March 20, 2023 12:18:52 PM
To: neurogenomics/RareDiseasePrioritisation ***@***.***>
Cc: Skene, Nathan G ***@***.***>; Mention ***@***.***>
Subject: Re: [neurogenomics/RareDiseasePrioritisation] Filter based on disease-level (Issue #18)
This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
Assessing the issue
I've generated a rmarkdown report assessing the scope of this issue.
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/HPO_annotations
As you can see, the number of diseases/phenotype can range from 0 to 4125. That presents some complications when trying to link phenotype-celltype associations back to particular diseases.
Potential solutions
1. Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with.
2. Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340")
3. Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes).
—
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE7AUDSLCDXLIXGRA5DW5BDKZANCNFSM6AAAAAAWA7NJQE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Ok, that's potentially a rather strong assumption, considering you can get the same phenotype via different molecular mechanisms. What is your justification for this? |
I just don’t under the alternative. I don’t see that “diseases” even exist. They are a mental construct. The real thing is the phenotypes. Diseases are just clusters of phenotypes.
From: Brian M. Schilder ***@***.***>
Date: Monday, 20 March 2023 at 12:30
To: neurogenomics/RareDiseasePrioritisation ***@***.***>
Cc: Skene, Nathan G ***@***.***>, Mention ***@***.***>
Subject: Re: [neurogenomics/RareDiseasePrioritisation] Filter based on disease-level (Issue #18)
This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout.
Ok, that's potentially a rather strong assumption, considering you can get the same phenotype via different molecular mechanisms. What is your justification for this?
—
Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE2U37KPIDWJEQIGHV3W5BEVDANCNFSM6AAAAAAWA7NJQE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I've given several here:
Perhaps in the abstract, but think about it from a methodological perspective in the context of this study and the data we have available. In the HPO, genes get associated with phenotypes via diseases. You can observe this yourself by looking at the annotations: So even if you believe that "diseases" don't exist, you can think of each phenotype-disease combination as its own "subphenotype", each with semi-overlapping gene sets. Testing the assumptionI can quantify the effects of this "1 gene set per phenotype" assumption by computing correlations. If the disease where the gene lists came from does not matter, the mean correlations within groups of subphenotypes (belonging to a particular phenotype) should be high. If they are low, it means the subphenotypes are very different from one another (despite all being associated with the same phenotype). results <- MultiEWCE::load_example_results()
length(unique(results$HPO_ID))
annot <- HPOExplorer::load_phenotype_to_genes(1)
annot[,HPO_ID.DatabaseID:=paste(HPO_ID,LinkID,sep=".")]
length(unique(annot$HPO_ID.DatabaseID))
#### Get combos with >4 genes ####
gene_counts <- annot[,list(n=length(unique(Gene))),
by="HPO_ID.DatabaseID"]
gene_counts <- cbind(gene_counts,
data.table::data.table(stringr::str_split(gene_counts$HPO_ID.DatabaseID,"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID")))
gene_counts_valid <- gene_counts[n>=4]
nrow(gene_counts_valid)/nrow(gene_counts)*100
X <- HPOExplorer::hpo_to_matrix(phenotype_to_genes = annot[HPO_ID.DatabaseID %in% gene_counts_valid$HPO_ID.DatabaseID,],
formula = "Gene ~ HPO_ID.DatabaseID")
#### Parse names ####
name_map <- cbind(HPO_ID.DatabaseID=colnames(X_cor),
data.table::data.table(stringr::str_split(colnames(X_cor),"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID"))) |>
data.table::setkeyv(cols = "HPO_ID.DatabaseID")
#### Aggregate corr/phenotype group ####
group_cor <- lapply(stats::setNames(unique(name_map$HPO_ID),
unique(name_map$HPO_ID)),
function(id){
idx <- name_map[HPO_ID==id]$HPO_ID.DatabaseID
X_sub <- X[,idx, drop=FALSE]
#### Get number of overlapping genes #####
rs <- Matrix::rowSums(X_sub)
intersect_size <- sum(rs==ncol(X_sub))
union_size <- sum(rs>0)
max_overlap <- max(rs, na.rm = TRUE)
#### Compute corr ####
X_cor <- WGCNA::cor(x = X_sub)
diag(X_cor) <- NA
rm <- Matrix::rowMeans(X_cor, na.rm = TRUE)
data.table::data.table(
intersect_size=intersect_size,
union_size=union_size,
max_overlap=max_overlap,
mean_cor=mean(rm, na.rm=TRUE),
sd=sd(rm, na.rm = TRUE)
)
}) |> data.table::rbindlist(use.names = TRUE, idcol = "HPO_ID")
hist(group_cor$mean_cor, 50) I found that the subphenotype gene lists are not very strongly correlated with one another. Since the presence/absence of a gene is binary, correlation basically equates to "proportion of genes that overlap" In fact, if you compute correlations using only the subset of genes that appear in >=1 subphenotype within each phenotype group, the correlations are even anti-correlated. This is quite strange, but perhaps it has something to do with how the HPO is annotated. I'll summarise this as a report and ask Peter/Ben about it. In light of this, this means that the gene lists for each subphenotype don't overlap very much Methodological drawbacksThat said, even given the low correlation between subphenotypes within a given phenotype, there are some drawbacks to this subphenotype-focused approach:
So theoretical debates aside, it seems that the benefit of aggregating gene lists to the level of phenotypes is that it increases our coverage of testable HPO terms (at least with EWCE) and decreases our multiple testing burden. That said, we still need to contend with the annotation merging issue. I'm leaning towards option 3 here, but need to figure out the details: |
Naming conventions: subphenotype vs. symptomI'm going to switch all reference of "subphenotype" to "symptom", as I think this better describes the presentation of a phenotype in the context of a particular disease. It also reduces confusion between phenotypes of different levels in the HPO ontology, which the term "subphenotype" accidentally implies. |
ConclusionsEvaluating different strategies
Option 1 was never a good idea, bc it meant any celltype associated with a phenotype could be inherited by any of the diseases. Option 4 was impractical bc it didn't allow us to test enough symptoms/subphenotypes (phenotype in a particular disease). Instead, I implemented a strategy that was a combination of Options 2 & 3. Basically, I ran really simple gene enrichment analyses on all possibly symptoms x 77 celltypes. After some optimisation for parallelising these operations, I was able to get the tests run all the way through without exploding the memory usage and crashing our private cloud: Then, I filtered the celltype-symptom associations to only those that had at least 1 overlapping gene, merged it with our celltype-phenotype enrichment results from before (joining on the HPO_ID + CellType columns), and uploaded the new data to our Releases. In the What have we gained?This new dataset now has all of the disease-phenotype links as mediated by specific celltypes and genes! For example, it lets us find things like the association between AD (the disease, not the HPO term) + microglia. This was previously not showing up bc the HPO term for Alzheimer disease did not have enough genes by itself to get enrichment in microglia (or any celltype). But we can see here that microglia are indeed implicated in AD via Neurofibrillary tangles: |
@NathanSkene @KittyMurphy
To date, I've been aggregating metadata attributes like age of death, severity, and age of onset to the phenotype (HPO_ID) level. This is because each phenotype can be associated with multiple diseases, which is where these annotations primarily come from (although sometimes the annotations are indeed referring to the phenotype, which is confusing).
Given this, it may make more sense to filter metadata attr at the disease level first. I've refactored
HPOExplorer
andMultiEWCE
to do just this.However, a problem arises when you go to merge the metadata at certain steps within the
prioritise_targets
pipeline:Error:
Internally, the problem arises at this step:
This happens because:
So now there is ambiguity when trying to match up our celtype-specific enrichment results back to the disease-level metadata. @bobGSmith could you confirm whether this explanation makes sense?
Session info
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] HPOExplorer_0.99.7 dplyr_1.1.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.6.1 usethis_2.1.6 devtools_2.4.5
[5] httr_1.4.5 rprojroot_2.0.3 tools_4.2.1 profvis_0.3.7
[9] utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2 BiocGenerics_0.44.0
[13] colorspace_2.1-0 urlchecker_1.0.1 Exact_3.2 tidyselect_1.2.0
[17] prettyunits_1.1.1 processx_3.8.0 ontologyIndex_2.10 compiler_4.2.1
[21] graph_1.76.0 cli_3.6.0 scKirby_0.1.0 Biobase_2.58.0
[25] BiocCheck_1.34.3 expm_0.999-7 network_1.18.1 plotly_4.10.1
[29] scales_1.2.1 mvtnorm_1.1-3 proxy_0.4-27 callr_3.7.3
[33] RBGL_1.74.0 stringr_1.5.0 digest_0.6.31 stringdist_0.9.10
[37] pkgconfig_2.0.3 htmltools_0.5.4 sessioninfo_1.2.2 fastmap_1.1.1
[41] readxl_1.4.2 htmlwidgets_1.6.1 rlang_1.1.0 rstudioapi_0.14
[45] shiny_1.7.4 generics_0.1.3 jsonlite_1.8.4 statnet.common_4.8.0
[49] RCurl_1.98-1.10 magrittr_2.0.3 ggnetwork_0.5.12 Matrix_1.5-3
[53] Rcpp_1.0.10 DescTools_0.99.48 munsell_0.5.0 fansi_1.0.4
[57] lifecycle_1.0.3 stringi_1.7.12 rootSolve_1.8.2.3 MASS_7.3-58.3
[61] pkgbuild_1.4.0 biocViews_1.66.3 grid_4.2.1 parallel_4.2.1
[65] promises_1.2.0.1 lmom_2.9 crayon_1.5.2 miniUI_0.1.1.1
[69] lattice_0.20-45 knitr_1.42 ps_1.7.2 pillar_1.8.1
[73] RUnit_0.4.32 gld_2.6.6 boot_1.3-28.1 codetools_0.2-19
[77] stats4_4.2.1 pkgload_1.3.2 XML_3.99-0.13 glue_1.6.2
[81] data.table_1.14.8 remotes_2.4.2 BiocManager_1.30.20 vctrs_0.6.0
[85] httpuv_1.6.9 cellranger_1.1.0 gtable_0.3.1 purrr_1.0.1
[89] tidyr_1.3.0 cachem_1.0.7 ggplot2_3.4.1 xfun_0.37
[93] mime_0.12 xtable_1.8-4 devoptera_0.99.0 e1071_1.7-13
[97] coda_0.19-4 later_1.3.0 class_7.3-21 viridisLite_0.4.1
[101] tibble_3.2.0 memoise_2.0.1 ellipsis_0.3.2 here_1.0.1
The text was updated successfully, but these errors were encountered: