Filter based on disease-level #18

bschilder · 2023-03-20T11:59:16Z

@NathanSkene @KittyMurphy
To date, I've been aggregating metadata attributes like age of death, severity, and age of onset to the phenotype (HPO_ID) level. This is because each phenotype can be associated with multiple diseases, which is where these annotations primarily come from (although sometimes the annotations are indeed referring to the phenotype, which is confusing).

Given this, it may make more sense to filter metadata attr at the disease level first. I've refactored HPOExplorer and MultiEWCE to do just this.

However, a problem arises when you go to merge the metadata at certain steps within the prioritise_targets pipeline:

devoptera::args2vars(MultiEWCE::prioritise_targets) ##initialise all variables

  results <- HPOExplorer::add_hpo_id(phenos = results,
                                     phenotype_to_genes = phenotype_to_genes,
                                     hpo = hpo,
                                     verbose = verbose)
  results <- HPOExplorer::add_ancestor(phenos = results,
                                       hpo = hpo,
                                       remove_descendants = remove_descendants,
                                       verbose = verbose)
 if(!is.null(q_threshold)){
    messager("Filtering @ q-value <=",q_threshold,v=verbose)
    results <- results[q<=q_threshold,]
  }
 if(!is.null(fold_threshold)){
    messager("Filtering @ fold-change >=",fold_threshold,v=verbose)
    results <- results[fold_change>=fold_threshold,]
  }
 results <- HPOExplorer::add_ont_lvl(phenos = results,
                                      absolute = TRUE,
                                      keep_ont_levels = keep_ont_levels,
                                      verbose = verbose)

#### keep_onsets #### <----- Errors first occur here
  results <- HPOExplorer::add_onset(phenos = results,
                                    keep_onsets = keep_onsets,
                                    allow.cartesian = TRUE,
                                    verbose = verbose)

Error:

Annotating phenos with Onset.
Annotating phenos with Disease
Importing existing file: ... phenotype.hpoa
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 995602 rows; more than 261047 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Internally, the problem arises at this step:

   results <- add_disease(phenos = results,
                          allow.cartesian = FALSE,
                          verbose = verbose)

This happens because:

each phenotype can map to multiple diseases.
each phenotype can be enriched in >1 celltype.
In the original EWCE analyses, we did not distinguish between gene sets of a phenotype that come from one disease vs. another disease. Instead, we took all genes that each phenotype can possibly be associated with (across all diseases) and took that as the signature.

So now there is ambiguity when trying to match up our celtype-specific enrichment results back to the disease-level metadata. @bobGSmith could you confirm whether this explanation makes sense?

Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] HPOExplorer_0.99.7 dplyr_1.1.0

loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.6.1 usethis_2.1.6 devtools_2.4.5
[5] httr_1.4.5 rprojroot_2.0.3 tools_4.2.1 profvis_0.3.7
[9] utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2 BiocGenerics_0.44.0
[13] colorspace_2.1-0 urlchecker_1.0.1 Exact_3.2 tidyselect_1.2.0
[17] prettyunits_1.1.1 processx_3.8.0 ontologyIndex_2.10 compiler_4.2.1
[21] graph_1.76.0 cli_3.6.0 scKirby_0.1.0 Biobase_2.58.0
[25] BiocCheck_1.34.3 expm_0.999-7 network_1.18.1 plotly_4.10.1
[29] scales_1.2.1 mvtnorm_1.1-3 proxy_0.4-27 callr_3.7.3
[33] RBGL_1.74.0 stringr_1.5.0 digest_0.6.31 stringdist_0.9.10
[37] pkgconfig_2.0.3 htmltools_0.5.4 sessioninfo_1.2.2 fastmap_1.1.1
[41] readxl_1.4.2 htmlwidgets_1.6.1 rlang_1.1.0 rstudioapi_0.14
[45] shiny_1.7.4 generics_0.1.3 jsonlite_1.8.4 statnet.common_4.8.0
[49] RCurl_1.98-1.10 magrittr_2.0.3 ggnetwork_0.5.12 Matrix_1.5-3
[53] Rcpp_1.0.10 DescTools_0.99.48 munsell_0.5.0 fansi_1.0.4
[57] lifecycle_1.0.3 stringi_1.7.12 rootSolve_1.8.2.3 MASS_7.3-58.3
[61] pkgbuild_1.4.0 biocViews_1.66.3 grid_4.2.1 parallel_4.2.1
[65] promises_1.2.0.1 lmom_2.9 crayon_1.5.2 miniUI_0.1.1.1
[69] lattice_0.20-45 knitr_1.42 ps_1.7.2 pillar_1.8.1
[73] RUnit_0.4.32 gld_2.6.6 boot_1.3-28.1 codetools_0.2-19
[77] stats4_4.2.1 pkgload_1.3.2 XML_3.99-0.13 glue_1.6.2
[81] data.table_1.14.8 remotes_2.4.2 BiocManager_1.30.20 vctrs_0.6.0
[85] httpuv_1.6.9 cellranger_1.1.0 gtable_0.3.1 purrr_1.0.1
[89] tidyr_1.3.0 cachem_1.0.7 ggplot2_3.4.1 xfun_0.37
[93] mime_0.12 xtable_1.8-4 devoptera_0.99.0 e1071_1.7-13
[97] coda_0.19-4 later_1.3.0 class_7.3-21 viridisLite_0.4.1
[101] tibble_3.2.0 memoise_2.0.1 ellipsis_0.3.2 here_1.0.1

</details>

The text was updated successfully, but these errors were encountered:

bschilder · 2023-03-20T12:18:40Z

Assessing the issue

I've generated a rmarkdown report assessing the scope of this issue.
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/HPO_annotations

As you can see, the number of diseases/phenotype can range from 0 to 4125. That presents some complications when trying to link phenotype-celltype associations back to particular diseases.

Potential solutions

Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. In practice, this means setting allow.cartesion=TRUE when merging disease annotations, which explodes the number of potential combinations.
Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340")
Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes).
Run EWCE tests for each disease, so we can get a list of celltypes per disease to link back to phenotype-celltype associations. The drawback here is that only 345/8460 diseases have >=4 genes, meaning we wouldn't be able to link the majority of diseases to celltype-level results using this strategy.

NathanSkene · 2023-03-20T12:27:39Z

I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Brian M. Schilder ***@***.***> Sent: Monday, March 20, 2023 12:18:52 PM To: neurogenomics/RareDiseasePrioritisation ***@***.***> Cc: Skene, Nathan G ***@***.***>; Mention ***@***.***> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Filter based on disease-level (Issue #18) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Assessing the issue I've generated a rmarkdown report assessing the scope of this issue. https://neurogenomics.github.io/RareDiseasePrioritisation/reports/HPO_annotations As you can see, the number of diseases/phenotype can range from 0 to 4125. That presents some complications when trying to link phenotype-celltype associations back to particular diseases. Potential solutions 1. Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. 2. Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340") 3. Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes). — Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE7AUDSLCDXLIXGRA5DW5BDKZANCNFSM6AAAAAAWA7NJQE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bschilder · 2023-03-20T12:29:58Z

I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout.

Ok, that's potentially a rather strong assumption, considering you can get the same phenotype via different molecular mechanisms. What is your justification for this?

NathanSkene · 2023-03-20T12:40:02Z

I just don’t under the alternative. I don’t see that “diseases” even exist. They are a mental construct. The real thing is the phenotypes. Diseases are just clusters of phenotypes. From: Brian M. Schilder ***@***.***> Date: Monday, 20 March 2023 at 12:30 To: neurogenomics/RareDiseasePrioritisation ***@***.***> Cc: Skene, Nathan G ***@***.***>, Mention ***@***.***> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Filter based on disease-level (Issue #18) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout. Ok, that's potentially a rather strong assumption, considering you can get the same phenotype via different molecular mechanisms. What is your justification for this? — Reply to this email directly, view it on GitHub<#18 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE2U37KPIDWJEQIGHV3W5BEVDANCNFSM6AAAAAAWA7NJQE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bschilder · 2023-03-20T14:35:53Z

I just don’t under the alternative.

I've given several here:
#18 (comment)

I don’t see that “diseases” even exist. They are a mental construct. The real thing is the phenotypes. Diseases are just clusters of phenotypes. From: Brian M. Schilder

Perhaps in the abstract, but think about it from a methodological perspective in the context of this study and the data we have available.

In the HPO, genes get associated with phenotypes via diseases. You can observe this yourself by looking at the annotations:
https://hpo.jax.org/app/data/annotations

So even if you believe that "diseases" don't exist, you can think of each phenotype-disease combination as its own "subphenotype", each with semi-overlapping gene sets.

Testing the assumption

I can quantify the effects of this "1 gene set per phenotype" assumption by computing correlations.

If the disease where the gene lists came from does not matter, the mean correlations within groups of subphenotypes (belonging to a particular phenotype) should be high. If they are low, it means the subphenotypes are very different from one another (despite all being associated with the same phenotype).

results <- MultiEWCE::load_example_results()
length(unique(results$HPO_ID))

annot <- HPOExplorer::load_phenotype_to_genes(1)
annot[,HPO_ID.DatabaseID:=paste(HPO_ID,LinkID,sep=".")] 
length(unique(annot$HPO_ID.DatabaseID))

#### Get combos with >4 genes ####
gene_counts <- annot[,list(n=length(unique(Gene))),
                     by="HPO_ID.DatabaseID"]
gene_counts <- cbind(gene_counts,
      data.table::data.table(stringr::str_split(gene_counts$HPO_ID.DatabaseID,"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID")))
gene_counts_valid <- gene_counts[n>=4]
nrow(gene_counts_valid)/nrow(gene_counts)*100


X <- HPOExplorer::hpo_to_matrix(phenotype_to_genes = annot[HPO_ID.DatabaseID %in% gene_counts_valid$HPO_ID.DatabaseID,], 
                                formula =  "Gene ~ HPO_ID.DatabaseID")  

#### Parse names ####
name_map <- cbind(HPO_ID.DatabaseID=colnames(X_cor),
      data.table::data.table(stringr::str_split(colnames(X_cor),"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID"))) |>
  data.table::setkeyv(cols = "HPO_ID.DatabaseID")
#### Aggregate corr/phenotype group ####
group_cor <- lapply(stats::setNames(unique(name_map$HPO_ID),
                                    unique(name_map$HPO_ID)),
                    function(id){
  idx <- name_map[HPO_ID==id]$HPO_ID.DatabaseID
  X_sub <- X[,idx, drop=FALSE]
  #### Get number of overlapping genes #####
  rs <- Matrix::rowSums(X_sub) 
  intersect_size <- sum(rs==ncol(X_sub))
  union_size <- sum(rs>0)
  max_overlap <- max(rs, na.rm = TRUE) 
  #### Compute corr ####
  X_cor <- WGCNA::cor(x = X_sub) 
  diag(X_cor) <- NA
  rm <- Matrix::rowMeans(X_cor, na.rm = TRUE)
  data.table::data.table(
    intersect_size=intersect_size,
    union_size=union_size,
    max_overlap=max_overlap,
    mean_cor=mean(rm, na.rm=TRUE),
    sd=sd(rm, na.rm = TRUE)
  )
}) |> data.table::rbindlist(use.names = TRUE, idcol = "HPO_ID")

hist(group_cor$mean_cor, 50)

I found that the subphenotype gene lists are not very strongly correlated with one another. Since the presence/absence of a gene is binary, correlation basically equates to "proportion of genes that overlap"
(mean of all within-phenotype group mean cors is 0.0658).

In fact, if you compute correlations using only the subset of genes that appear in >=1 subphenotype within each phenotype group, the correlations are even anti-correlated. This is quite strange, but perhaps it has something to do with how the HPO is annotated. I'll summarise this as a report and ask Peter/Ben about it.

In light of this, this means that the gene lists for each subphenotype don't overlap very much

Methodological drawbacks

That said, even given the low correlation between subphenotypes within a given phenotype, there are some drawbacks to this subphenotype-focused approach:

Each subphenotype potentially has a smaller set of genes that the phenotype. This means that more gene lists will be omitted from EWCE due to the >4 gene rule.
The number of gene lists tested increases from 6,170 phenotypes (our current results) to 33,542 subphenotypes (only 3,946 phenotypes across 335 diseases).

So theoretical debates aside, it seems that the benefit of aggregating gene lists to the level of phenotypes is that it increases our coverage of testable HPO terms (at least with EWCE) and decreases our multiple testing burden.

That said, we still need to contend with the annotation merging issue. I'm leaning towards option 3 here, but need to figure out the details:
#18 (comment)

bschilder · 2023-03-20T20:43:29Z

Full report here:
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/symptoms

bschilder · 2023-03-22T12:08:16Z

Naming conventions: subphenotype vs. symptom

I'm going to switch all reference of "subphenotype" to "symptom", as I think this better describes the presentation of a phenotype in the context of a particular disease. It also reduces confusion between phenotypes of different levels in the HPO ontology, which the term "subphenotype" accidentally implies.

bschilder · 2023-03-22T17:19:23Z

Conclusions

Evaluating different strategies

Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. In practice, this means setting allow.cartesion=TRUE when merging disease annotations, which explodes the number of potential combinations.

Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340")

Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes).

Run EWCE tests for each disease, so we can get a list of celltypes per disease to link back to phenotype-celltype associations. The drawback here is that only 345/8460 diseases have >=4 genes, meaning we wouldn't be able to link the majority of diseases to celltype-level results using this strategy.

Option 1 was never a good idea, bc it meant any celltype associated with a phenotype could be inherited by any of the diseases.

Option 4 was impractical bc it didn't allow us to test enough symptoms/subphenotypes (phenotype in a particular disease).

Instead, I implemented a strategy that was a combination of Options 2 & 3. Basically, I ran really simple gene enrichment analyses on all possibly symptoms x 77 celltypes.
This meant: 747,584 symptoms x 77 celltypes = 57,563,968 tests!

After some optimisation for parallelising these operations, I was able to get the tests run all the way through without exploding the memory usage and crashing our private cloud:
neurogenomics/MSTExplorer#8

Then, I filtered the celltype-symptom associations to only those that had at least 1 overlapping gene, merged it with our celltype-phenotype enrichment results from before (joining on the HPO_ID + CellType columns), and uploaded the new data to our Releases.
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/symptoms

In the MultiEWCE::prioritise_targets pipeline, I also added some new filters to remove celltype-symptom associations with nominal p-value >0.05. You can also filter by q-value, but this might be too stringent given the fact that we did 57 million tests. Besides, it's just meant as a way to link diseases to phenotypes in celltype/gene-specific manner. Any spurious celltype enrichment results will still be removed when we filter by the celltype-phenotype EWCE test q-values (which prioritise_targets does by default).

What have we gained?

This new dataset now has all of the disease-phenotype links as mediated by specific celltypes and genes!
Basically, we can now go down the chain of causality from diseases -> phenotypes -> celltypes -> genes with a decent level of confidence.

For example, it lets us find things like the association between AD (the disease, not the HPO term) + microglia. This was previously not showing up bc the HPO term for Alzheimer disease did not have enough genes by itself to get enrichment in microglia (or any celltype). But we can see here that microglia are indeed implicated in AD via Neurofibrillary tangles:

bschilder self-assigned this Mar 20, 2023

bschilder added the help wanted Extra attention is needed label Mar 20, 2023

bschilder closed this as completed Mar 22, 2023

This was referenced Mar 24, 2023

Assess whether cell types predict clinical course #23

Closed

Repeat symptom-level enrichment analyses using EWCE #25

Closed

bschilder added this to the Publish rare disease celltyping manuscript milestone Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter based on disease-level #18

Filter based on disease-level #18

bschilder commented Mar 20, 2023

bschilder commented Mar 20, 2023 •

edited

Loading

NathanSkene commented Mar 20, 2023 via email

bschilder commented Mar 20, 2023

NathanSkene commented Mar 20, 2023 via email

bschilder commented Mar 20, 2023 •

edited

Loading

bschilder commented Mar 20, 2023 •

edited

Loading

bschilder commented Mar 22, 2023 •

edited

Loading

bschilder commented Mar 22, 2023 •

edited

Loading

Filter based on disease-level #18

Filter based on disease-level #18

Comments

bschilder commented Mar 20, 2023

Session info

bschilder commented Mar 20, 2023 • edited Loading

Assessing the issue

Potential solutions

NathanSkene commented Mar 20, 2023 via email

bschilder commented Mar 20, 2023

NathanSkene commented Mar 20, 2023 via email

bschilder commented Mar 20, 2023 • edited Loading

Testing the assumption

Methodological drawbacks

bschilder commented Mar 20, 2023 • edited Loading

bschilder commented Mar 22, 2023 • edited Loading

Naming conventions: subphenotype vs. symptom

bschilder commented Mar 22, 2023 • edited Loading

Conclusions

Evaluating different strategies

What have we gained?

bschilder commented Mar 20, 2023 •

edited

Loading

bschilder commented Mar 20, 2023 •

edited

Loading

bschilder commented Mar 20, 2023 •

edited

Loading

bschilder commented Mar 22, 2023 •

edited

Loading

bschilder commented Mar 22, 2023 •

edited

Loading