Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter based on disease-level #18

Closed
bschilder opened this issue Mar 20, 2023 · 8 comments
Closed

Filter based on disease-level #18

bschilder opened this issue Mar 20, 2023 · 8 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@bschilder
Copy link
Contributor

@NathanSkene @KittyMurphy
To date, I've been aggregating metadata attributes like age of death, severity, and age of onset to the phenotype (HPO_ID) level. This is because each phenotype can be associated with multiple diseases, which is where these annotations primarily come from (although sometimes the annotations are indeed referring to the phenotype, which is confusing).

Given this, it may make more sense to filter metadata attr at the disease level first. I've refactored HPOExplorer and MultiEWCE to do just this.

However, a problem arises when you go to merge the metadata at certain steps within the prioritise_targets pipeline:

devoptera::args2vars(MultiEWCE::prioritise_targets) ##initialise all variables

  results <- HPOExplorer::add_hpo_id(phenos = results,
                                     phenotype_to_genes = phenotype_to_genes,
                                     hpo = hpo,
                                     verbose = verbose)
  results <- HPOExplorer::add_ancestor(phenos = results,
                                       hpo = hpo,
                                       remove_descendants = remove_descendants,
                                       verbose = verbose)
 if(!is.null(q_threshold)){
    messager("Filtering @ q-value <=",q_threshold,v=verbose)
    results <- results[q<=q_threshold,]
  }
 if(!is.null(fold_threshold)){
    messager("Filtering @ fold-change >=",fold_threshold,v=verbose)
    results <- results[fold_change>=fold_threshold,]
  }
 results <- HPOExplorer::add_ont_lvl(phenos = results,
                                      absolute = TRUE,
                                      keep_ont_levels = keep_ont_levels,
                                      verbose = verbose)

#### keep_onsets #### <----- Errors first occur here
  results <- HPOExplorer::add_onset(phenos = results,
                                    keep_onsets = keep_onsets,
                                    allow.cartesian = TRUE,
                                    verbose = verbose)

Error:

Annotating phenos with Onset.
Annotating phenos with Disease
Importing existing file: ... phenotype.hpoa
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 995602 rows; more than 261047 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Internally, the problem arises at this step:

   results <- add_disease(phenos = results,
                          allow.cartesian = FALSE,
                          verbose = verbose)

This happens because:

  1. each phenotype can map to multiple diseases.
  2. each phenotype can be enriched in >1 celltype.
  3. In the original EWCE analyses, we did not distinguish between gene sets of a phenotype that come from one disease vs. another disease. Instead, we took all genes that each phenotype can possibly be associated with (across all diseases) and took that as the signature.

So now there is ambiguity when trying to match up our celtype-specific enrichment results back to the disease-level metadata. @bobGSmith could you confirm whether this explanation makes sense?

Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] HPOExplorer_0.99.7 dplyr_1.1.0

loaded via a namespace (and not attached):
[1] bitops_1.0-7 fs_1.6.1 usethis_2.1.6 devtools_2.4.5
[5] httr_1.4.5 rprojroot_2.0.3 tools_4.2.1 profvis_0.3.7
[9] utf8_1.2.3 R6_2.5.1 lazyeval_0.2.2 BiocGenerics_0.44.0
[13] colorspace_2.1-0 urlchecker_1.0.1 Exact_3.2 tidyselect_1.2.0
[17] prettyunits_1.1.1 processx_3.8.0 ontologyIndex_2.10 compiler_4.2.1
[21] graph_1.76.0 cli_3.6.0 scKirby_0.1.0 Biobase_2.58.0
[25] BiocCheck_1.34.3 expm_0.999-7 network_1.18.1 plotly_4.10.1
[29] scales_1.2.1 mvtnorm_1.1-3 proxy_0.4-27 callr_3.7.3
[33] RBGL_1.74.0 stringr_1.5.0 digest_0.6.31 stringdist_0.9.10
[37] pkgconfig_2.0.3 htmltools_0.5.4 sessioninfo_1.2.2 fastmap_1.1.1
[41] readxl_1.4.2 htmlwidgets_1.6.1 rlang_1.1.0 rstudioapi_0.14
[45] shiny_1.7.4 generics_0.1.3 jsonlite_1.8.4 statnet.common_4.8.0
[49] RCurl_1.98-1.10 magrittr_2.0.3 ggnetwork_0.5.12 Matrix_1.5-3
[53] Rcpp_1.0.10 DescTools_0.99.48 munsell_0.5.0 fansi_1.0.4
[57] lifecycle_1.0.3 stringi_1.7.12 rootSolve_1.8.2.3 MASS_7.3-58.3
[61] pkgbuild_1.4.0 biocViews_1.66.3 grid_4.2.1 parallel_4.2.1
[65] promises_1.2.0.1 lmom_2.9 crayon_1.5.2 miniUI_0.1.1.1
[69] lattice_0.20-45 knitr_1.42 ps_1.7.2 pillar_1.8.1
[73] RUnit_0.4.32 gld_2.6.6 boot_1.3-28.1 codetools_0.2-19
[77] stats4_4.2.1 pkgload_1.3.2 XML_3.99-0.13 glue_1.6.2
[81] data.table_1.14.8 remotes_2.4.2 BiocManager_1.30.20 vctrs_0.6.0
[85] httpuv_1.6.9 cellranger_1.1.0 gtable_0.3.1 purrr_1.0.1
[89] tidyr_1.3.0 cachem_1.0.7 ggplot2_3.4.1 xfun_0.37
[93] mime_0.12 xtable_1.8-4 devoptera_0.99.0 e1071_1.7-13
[97] coda_0.19-4 later_1.3.0 class_7.3-21 viridisLite_0.4.1
[101] tibble_3.2.0 memoise_2.0.1 ellipsis_0.3.2 here_1.0.1

</details>
@bschilder bschilder self-assigned this Mar 20, 2023
@bschilder bschilder added the help wanted Extra attention is needed label Mar 20, 2023
@bschilder
Copy link
Contributor Author

bschilder commented Mar 20, 2023

Assessing the issue

I've generated a rmarkdown report assessing the scope of this issue.
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/HPO_annotations

As you can see, the number of diseases/phenotype can range from 0 to 4125. That presents some complications when trying to link phenotype-celltype associations back to particular diseases.

Potential solutions

  1. Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. In practice, this means setting allow.cartesion=TRUE when merging disease annotations, which explodes the number of potential combinations.
  2. Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340")
  3. Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes).
  4. Run EWCE tests for each disease, so we can get a list of celltypes per disease to link back to phenotype-celltype associations. The drawback here is that only 345/8460 diseases have >=4 genes, meaning we wouldn't be able to link the majority of diseases to celltype-level results using this strategy.

@NathanSkene
Copy link

NathanSkene commented Mar 20, 2023 via email

@bschilder
Copy link
Contributor Author

I think we assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. That has been my assumption throughout.

Ok, that's potentially a rather strong assumption, considering you can get the same phenotype via different molecular mechanisms. What is your justification for this?

@NathanSkene
Copy link

NathanSkene commented Mar 20, 2023 via email

@bschilder
Copy link
Contributor Author

bschilder commented Mar 20, 2023

I just don’t under the alternative.

I've given several here:
#18 (comment)

I don’t see that “diseases” even exist. They are a mental construct. The real thing is the phenotypes. Diseases are just clusters of phenotypes. From: Brian M. Schilder

Perhaps in the abstract, but think about it from a methodological perspective in the context of this study and the data we have available.

In the HPO, genes get associated with phenotypes via diseases. You can observe this yourself by looking at the annotations:
https://hpo.jax.org/app/data/annotations

So even if you believe that "diseases" don't exist, you can think of each phenotype-disease combination as its own "subphenotype", each with semi-overlapping gene sets.

Testing the assumption

I can quantify the effects of this "1 gene set per phenotype" assumption by computing correlations.

If the disease where the gene lists came from does not matter, the mean correlations within groups of subphenotypes (belonging to a particular phenotype) should be high. If they are low, it means the subphenotypes are very different from one another (despite all being associated with the same phenotype).

results <- MultiEWCE::load_example_results()
length(unique(results$HPO_ID))

annot <- HPOExplorer::load_phenotype_to_genes(1)
annot[,HPO_ID.DatabaseID:=paste(HPO_ID,LinkID,sep=".")] 
length(unique(annot$HPO_ID.DatabaseID))

#### Get combos with >4 genes ####
gene_counts <- annot[,list(n=length(unique(Gene))),
                     by="HPO_ID.DatabaseID"]
gene_counts <- cbind(gene_counts,
      data.table::data.table(stringr::str_split(gene_counts$HPO_ID.DatabaseID,"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID")))
gene_counts_valid <- gene_counts[n>=4]
nrow(gene_counts_valid)/nrow(gene_counts)*100


X <- HPOExplorer::hpo_to_matrix(phenotype_to_genes = annot[HPO_ID.DatabaseID %in% gene_counts_valid$HPO_ID.DatabaseID,], 
                                formula =  "Gene ~ HPO_ID.DatabaseID")  

#### Parse names ####
name_map <- cbind(HPO_ID.DatabaseID=colnames(X_cor),
      data.table::data.table(stringr::str_split(colnames(X_cor),"\\.", simplify = TRUE)) |> `names<-`(c("HPO_ID","DatabaseID"))) |>
  data.table::setkeyv(cols = "HPO_ID.DatabaseID")
#### Aggregate corr/phenotype group ####
group_cor <- lapply(stats::setNames(unique(name_map$HPO_ID),
                                    unique(name_map$HPO_ID)),
                    function(id){
  idx <- name_map[HPO_ID==id]$HPO_ID.DatabaseID
  X_sub <- X[,idx, drop=FALSE]
  #### Get number of overlapping genes #####
  rs <- Matrix::rowSums(X_sub) 
  intersect_size <- sum(rs==ncol(X_sub))
  union_size <- sum(rs>0)
  max_overlap <- max(rs, na.rm = TRUE) 
  #### Compute corr ####
  X_cor <- WGCNA::cor(x = X_sub) 
  diag(X_cor) <- NA
  rm <- Matrix::rowMeans(X_cor, na.rm = TRUE)
  data.table::data.table(
    intersect_size=intersect_size,
    union_size=union_size,
    max_overlap=max_overlap,
    mean_cor=mean(rm, na.rm=TRUE),
    sd=sd(rm, na.rm = TRUE)
  )
}) |> data.table::rbindlist(use.names = TRUE, idcol = "HPO_ID")

hist(group_cor$mean_cor, 50)

I found that the subphenotype gene lists are not very strongly correlated with one another. Since the presence/absence of a gene is binary, correlation basically equates to "proportion of genes that overlap"
(mean of all within-phenotype group mean cors is 0.0658).
mean_cor

In fact, if you compute correlations using only the subset of genes that appear in >=1 subphenotype within each phenotype group, the correlations are even anti-correlated. This is quite strange, but perhaps it has something to do with how the HPO is annotated. I'll summarise this as a report and ask Peter/Ben about it.

mean_cor_sub

In light of this, this means that the gene lists for each subphenotype don't overlap very much

Methodological drawbacks

That said, even given the low correlation between subphenotypes within a given phenotype, there are some drawbacks to this subphenotype-focused approach:

  1. Each subphenotype potentially has a smaller set of genes that the phenotype. This means that more gene lists will be omitted from EWCE due to the >4 gene rule.
  2. The number of gene lists tested increases from 6,170 phenotypes (our current results) to 33,542 subphenotypes (only 3,946 phenotypes across 335 diseases).

So theoretical debates aside, it seems that the benefit of aggregating gene lists to the level of phenotypes is that it increases our coverage of testable HPO terms (at least with EWCE) and decreases our multiple testing burden.

That said, we still need to contend with the annotation merging issue. I'm leaning towards option 3 here, but need to figure out the details:
#18 (comment)

@bschilder
Copy link
Contributor Author

bschilder commented Mar 20, 2023

@bschilder
Copy link
Contributor Author

bschilder commented Mar 22, 2023

Naming conventions: subphenotype vs. symptom

I'm going to switch all reference of "subphenotype" to "symptom", as I think this better describes the presentation of a phenotype in the context of a particular disease. It also reduces confusion between phenotypes of different levels in the HPO ontology, which the term "subphenotype" accidentally implies.

@bschilder
Copy link
Contributor Author

bschilder commented Mar 22, 2023

Conclusions

Evaluating different strategies

  1. Assume that each phenotype always has the same underlying genes and celltype, no matter what disease it is associated with. In practice, this means setting allow.cartesion=TRUE when merging disease annotations, which explodes the number of potential combinations.
  2. Rerun all phenotype EWCE analyses while distinguishing between genes sets that come from different disease (e.g. "HP:0011097.OMIM:619340", "HP:0011097. OMIM:619340")
  3. Figure out some sort of post-hoc heuristic for inferring the disease each phenotype-celltype enrichment results is most relevant to (e.g. gene intersection between phenotype genes + celltype-specific genes + disease genes).
  4. Run EWCE tests for each disease, so we can get a list of celltypes per disease to link back to phenotype-celltype associations. The drawback here is that only 345/8460 diseases have >=4 genes, meaning we wouldn't be able to link the majority of diseases to celltype-level results using this strategy.

Option 1 was never a good idea, bc it meant any celltype associated with a phenotype could be inherited by any of the diseases.

Option 4 was impractical bc it didn't allow us to test enough symptoms/subphenotypes (phenotype in a particular disease).

Instead, I implemented a strategy that was a combination of Options 2 & 3. Basically, I ran really simple gene enrichment analyses on all possibly symptoms x 77 celltypes.
This meant: 747,584 symptoms x 77 celltypes = 57,563,968 tests!

After some optimisation for parallelising these operations, I was able to get the tests run all the way through without exploding the memory usage and crashing our private cloud:
neurogenomics/MSTExplorer#8

Then, I filtered the celltype-symptom associations to only those that had at least 1 overlapping gene, merged it with our celltype-phenotype enrichment results from before (joining on the HPO_ID + CellType columns), and uploaded the new data to our Releases.
https://neurogenomics.github.io/RareDiseasePrioritisation/reports/symptoms

In the MultiEWCE::prioritise_targets pipeline, I also added some new filters to remove celltype-symptom associations with nominal p-value >0.05. You can also filter by q-value, but this might be too stringent given the fact that we did 57 million tests. Besides, it's just meant as a way to link diseases to phenotypes in celltype/gene-specific manner. Any spurious celltype enrichment results will still be removed when we filter by the celltype-phenotype EWCE test q-values (which prioritise_targets does by default).

What have we gained?

This new dataset now has all of the disease-phenotype links as mediated by specific celltypes and genes!
Basically, we can now go down the chain of causality from diseases -> phenotypes -> celltypes -> genes with a decent level of confidence.

For example, it lets us find things like the association between AD (the disease, not the HPO term) + microglia. This was previously not showing up bc the HPO term for Alzheimer disease did not have enough genes by itself to get enrichment in microglia (or any celltype). But we can see here that microglia are indeed implicated in AD via Neurofibrillary tangles:
Screenshot 2023-03-22 at 16 58 21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants