Use of background argument #359

Domoun · 2022-01-19T02:34:40Z

Domoun
Jan 19, 2022

Hi,
First of all, thank you so much for this great tool!

I would like to get your advice about the use of the raw matrix as background.

I am working on scRNAseq data from mouse tissue. My major issue is that there is a high ambient RNA contamination (for some samples, the fraction of reads assigned to cells is as low as 30% on the 10X summary), especially by acinar transcripts (our acinar cells have a huge RNA content and do not appreciate the dissociation process).

I tried DecontX on “clean” data (low quality cells were filtered out based on nFeature) and then chose to remove all cells with a score > 0.7 for downstream analysis. I found that this strategy improved clustering by removing cells I previously identified as “junk”.
However, “playing” with the z argument (using my Seurat clusters) can sometimes dramatically change the set of cells identified as “highly contaminated” (which makes sense, though) and I had some concerns about eventual changes made to gene expression that I would miss.

So, I wanted to use the raw matrix (containing empty droplets) as background to estimate and reduce the contamination in a more objective way. The problem is that the contamination score obtained with the background is very low and seems underestimated in actual bad samples (compared to score obtained in high quality samples). This prevents me from finding a proper unique threshold to filter out cells based on the DecontX score. It also seems a bit less efficient in removing contamination.
Here is an example +/- background, showing the lower score and lower correction of acinar markers in all clusers when using the background argument:

But, in the meantime, I noticed that some clusters were highly corrected without background (but not with background, see cluster 2 above). Although these clusters had mixed markers, I am worried about over-correction in the case without background.
This happened also with high quality samples (cells were FACS-sorted and the fraction of reads assigned to cells is > 97%), as highlighted with black rectangles on the example below:

Based on your experience, does the use of background argument tend to under-correct the contamination?
Do you think I could I use an arbitrary threshold for the decontX score even if it is that low?
If I do not use the background argument, how can I assess a potential over-correction of gene expression in some clusters?

I sincerely thank you for your help!

joshua-d-campbell · 2022-01-19T13:49:50Z

joshua-d-campbell
Jan 19, 2022
Maintainer

Hi @Domoun, thanks for some great questions! Just for some background info, the clustering (and z label) is needed in both cases, with and without the background/raw matrix. This clustering is used to get a sense of what a “true” cell should look like for each cell population. The major difference is how the contamination distribution is calculated for each cell population. When not using a background matrix, there is a separate contamination distribution calculated for each cell population which is a weighted mixture of all other cell populations present in the data. This works well in many scenarios, but it does assume that the proportion of cell types captured in your data largely reflects the proportion of cell types that were present in the cell suspension. If one particular cell type is underrepresented in the data (maybe due to biases in the dissociation process or because cell sorting was performed and removed some cell types), then markers for that cell type may be present in the cell data but will not be subtracted out in this case. When the background matrix is supplied, a single contamination distribution is calculated for all cell type using the empty droplets. This has the advantage of capturing UMIs/reads from contamination of cell types that were not adequately represented in the cell data (i.e. the non-empty droplets).

With regards to less correction happening when a background matrix is supplied. One technical question, what was the class of the matrix that was supplied to decontX. You can find this by typing ‘class(mat)’ or ‘class(counts(mat))’ if using a SingleCellExperiment object (where ‘mat’ is the name of your object). We were having one technical problem where a matrices of in “dgTMatrix” were not being converted to “dgCMatrix” properly and the contamination was under calculated. Also, can you tell me how you are reading the matrices (both raw and filtering) into R, what version of celda you are using, and what technology generated your data (e.g. 10X)?

For the last question about a single cluster being more heavily affected when not using background matrix, I have seen this before when that cluster was small and likely to be a doublet cluster. Do you think that is the case for your cluster that shows dramatic differences? Does that cluster express high levels of multiple cell type markers?

0 replies

Domoun · 2022-01-19T19:59:46Z

Domoun
Jan 19, 2022
Author

Hi,

Thanks for the very prompt reply!

I initially did not get that, because on the documentation it is written: “If you supply a raw matrix via the background parameter, then the z parameter will not have an effect as clustering will not be performed.” But I just ran DecontX with background and saw “Generating UMAP and estimating cell types” so, it seems like clustering is indeed performed. Is this an update? Is it possible now to use the background parameter along with the z argument (providing my Seurat clusters)?
Sorry, I should have provided this information in the first place:
Cells were processed with Chromium Single Cell 3' solution v3.
I am pretty new to scRNAseq analysis and got used to Seurat so, I first create my Seurat object, add metadata of interest, subset and then create a SCE object:

#prep filtered matrix with all low quality cells removed: clean
clean <- Read10X(data.dir="path/outs/filtered_feature_bc_matrix")
clean <- CreateSeuratObject(counts = clean, min.cells = 3, min.features=200)
clean <- PercentageFeatureSet(clean, pattern = "^mt-", col.name = "percent.mt")
clean <- subset(clean, subset=nFeature_RNA>500 & percent.mt<10)
clean <- as.SingleCellExperiment(clean)
genes.list <- rownames(clean)

#prep sce object with raw matrix
raw <- Read10X(data.dir="path/outs/raw_feature_bc_matrix")
raw <- CreateSeuratObject(counts = raw, min.cells = 3, min.features=0)
raw <- subset(raw, features= genes.list)
raw <- as.SingleCellExperiment(raw)

#run decontX with background argument
clean.bg <- decontX(clean, assayName="counts", background=raw, bgAssayName="counts", verbose=TRUE)

As requested, I checked and both sce objects are “dgCMatrix”.

I use celda v1.8.1. Here is my entire session info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] kableExtra_1.3.4 dplyr_1.0.7 patchwork_1.1.1 scater_1.20.1
[5] scuttle_1.2.1 ggplot2_3.3.5 SeuratObject_4.0.4 Seurat_4.0.6
[9] singleCellTK_2.2.0 DelayedArray_0.18.0 Matrix_1.3-4 SingleCellExperiment_1.14.1
[13] SummarizedExperiment_1.22.0 Biobase_2.52.0 GenomicRanges_1.44.0 GenomeInfoDb_1.28.4
[17] IRanges_2.26.0 S4Vectors_0.30.2 BiocGenerics_0.38.0 MatrixGenerics_1.4.3
[21] matrixStats_0.61.0 celda_1.8.1

loaded via a [1] utf8_1.2.2 [5] htmlwidgets_1.5.4 [9] Rtsne_0.15 [13] codetools_0.2-18 [17] future_1.23.0 [21] knitr_1.37 [25] tensor_1.5 [29] GSVAdata_1.28.0 [33] parallelly_1.30.0 [37] fishpond_1.8.0 [41] rsvd_1.0.5 [45] rhdf5filters_1.4.0 [49] promises_1.2.0.1 [53] beachmat_2.8.1 [57] systemfonts_1.0.3 [61] reshape2_1.4.4 [65] ellipsis_0.3.2 [69] Rcpp_1.0.7 [73] purrr_0.3.4 [77] deldir_1.0-6 [81] zoo_1.8-9 [85] RSpectra_0.16-0 [89] lmtest_0.9-39 [93] mime_0.12 [97] tibble_3.1.6 [101] htmltools_0.5.2 [105] MCMCprecision_0.4.0 [109] assertive.numbers_0.0-2 [113] igraph_1.2.11 [117] xml2_1.3.3 [121] dqrng_0.3.0 [125] stringr_1.4.0 [129] spatstat.data_2.1-2 [133] uwot_0.1.11 [137] gtools_3.9.2 [141] jsonlite_1.7.2 [145] limma_3.48.3 [149] fastmap_1.1.0 [153] png_0.1-7 [157] assertive.properties_0.0-4 [161] irlba_2.3.5 namespace (and not attached):
reticulate_1.22 R.utils_2.11.0 tidyselect_1.1.1
grid_4.1.1 combinat_0.0-8 BiocParallel_1.26.2
DropletUtils_1.12.3 munsell_0.5.0 ScaledMatrix_1.0.0
ica_1.0-2 statmod_1.4.36 scran_1.20.1
miniUI_0.1.1.1 withr_2.4.3 colorspace_2.0-2
rstudioapi_0.13 ROCR_1.0-11 assertive.base_0.0-9
listenv_0.8.0 labeling_0.4.2 GenomeInfoDbData_1.2.6
polyclip_1.10-0 farver_2.1.0 rhdf5_2.36.0
vctrs_0.3.8 generics_0.1.1 xfun_0.29
R6_2.5.1 doParallel_1.0.16 ggbeeswarm_0.6.0
RcppEigen_0.3.3.9.1 locfit_1.5-9.4 bitops_1.0-7
spatstat.utils_2.3-0 gridGraphics_0.5-1 assertthat_0.2.1
scales_1.1.1 beeswarm_0.4.0 gtable_0.3.0
globals_0.14.0 goftest_1.2-3 rlang_0.4.12
splines_4.1.1 lazyeval_0.2.2 spatstat.geom_2.3-1
abind_1.4-5 httpuv_1.6.5 tools_4.1.1
spatstat.core_2.3-2 RColorBrewer_1.1-2 ggridges_0.5.3
plyr_1.8.6 sparseMatrixStats_1.4.2 zlibbioc_1.38.0
RCurl_1.98-1.5 dbscan_1.1-8 rpart_4.1-15
pbapply_1.5-0 viridis_0.6.2 cowplot_1.1.1
ggrepel_0.9.1 cluster_2.1.2 magrittr_2.0.1
data.table_1.14.2 magick_2.7.3 scattermore_0.7
RANN_2.6.1 fitdistrplus_1.1-6 evaluate_0.14
xtable_1.8-4 gridExtra_2.3 compiler_4.1.1
KernSmooth_2.23-20 crayon_1.4.2 R.oo_1.24.0
mgcv_1.8-36 later_1.3.0 tidyr_1.1.4
DBI_1.1.2 assertive.files_0.0-2 MASS_7.3-54
assertive.types_0.0-3 R.methodsS3_1.8.1 metapod_1.0.0
pkgconfig_2.0.3 plotly_4.10.0 spatstat.sparse_2.1-0
foreach_1.5.1 svglite_2.0.0 vipor_0.4.5
webshot_0.5.2 XVector_0.32.0 rvest_1.0.2
digest_0.6.29 sctransform_0.3.2 RcppAnnoy_0.0.19
rmarkdown_2.11 leiden_0.3.9 enrichR_3.0
edgeR_3.34.1 DelayedMatrixStats_1.14.3 shiny_1.7.1
rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-152
Rhdf5lib_1.14.2 BiocNeighbors_1.10.0 viridisLite_0.4.0
fansi_0.5.0 pillar_1.6.4 lattice_0.20-44
httr_1.4.2 survival_3.2-11 glue_1.6.0
iterators_1.0.13 multipanelfigure_2.1.2 bluster_1.2.1
stringi_1.7.6 HDF5Array_1.20.0 BiocSingular_1.8.1
future.apply_1.8.1

Actually, I have an extra question. Does subsetting the filtered matrix provided by CellRanger affect decontX performance with background ? (because in this case, I guess all barcodes not present in the “clean” matrix are considered as empty, although some were not identified as such by EmptyDrops during processing by CellRanger).

As for the last question, the cells labeled by DecontX as heavily contaminated are overall consistent with what I manually identified as contaminated droplets or obvious doublets. On the first dataset, cluster 2 indeed contained some doublets and contaminated droplets, but was also a real cell type. Expression of mixed markers was not especially higher than in other clusters, but I do not expect it to express the fibroblasts markers I provided (which in this case are under-corrected with background - if my hypotheses are correct).
My concerns mostly came from the FACS-sorted cells (immune populations only) results (last figure in my previous message). Intellectually, I would expect the approach using the background to be more reliable. However, on the second dataset, genes expressed by macrophages, T cells and NK cells were corrected in the B cell cluster 6 only in the case without background. I am actually surprised that, for same clustering (default mode, no z was provided) the method with background found less contamination… In the meantime, other packages did not label this cluster 6 as containing many doublets. But in immune populations it is sometimes hard to know for sure about cluster identity/gene expression specificity… so, I am not sure whether I am over-correcting my data in the case without background or, if using the raw matrix fails to identify doublets and low quality cells.

Sorry for the long message, and thank you again for your help!

0 replies

joshua-d-campbell · 2022-01-22T01:48:51Z

joshua-d-campbell
Jan 22, 2022
Maintainer

Yes, that is a good catch. The documentation needs to be updated. Sorry for the confusion! The z labels are still used even when the raw matrix is supplied. So you can supply your own clustering labels if you want.
Thanks for providing your code, it helps a lot! You should definitely provide the full filtered matrix into decontX. So just you can just remove the "subset" command before converting to an SCE object. The filtering will definitely produce some differences between the results of using the background matrix or not. I would be curious to know what the correlation is between in the contamination estimates when using or not using the background matrix in your data after you rerun it without filtering (in both datasets).
I would expect the same as yourself - that if flow sorting was applied and some cell populations were selected for or against, then the background matrix would be more appropriate choice. We are still learning about how to best use the background matrix in different scenarios. If you continue to get smaller contamination estimates with the background matrix and are able to share the dataset in some capacity, that may help us troubleshoot the issues and improve the approach. @yuan-yin-truly

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of background argument #359

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Use of background argument #359

Domoun Jan 19, 2022

Replies: 3 comments

joshua-d-campbell Jan 19, 2022 Maintainer

Domoun Jan 19, 2022 Author

joshua-d-campbell Jan 22, 2022 Maintainer

Domoun
Jan 19, 2022

joshua-d-campbell
Jan 19, 2022
Maintainer

Domoun
Jan 19, 2022
Author

joshua-d-campbell
Jan 22, 2022
Maintainer