Predicts the cell type of a cluster from expression values.
Download the analyseMarkers.py script and run it on your output files.
If you have a list with marker genes for each cluster (list(clusterID0=..., clusterID1=...)
) then the best workflow to get expression values and a compatible data frame is: from within R/Seurat run the following script to generate a dataframe containing both marker gene values and expression data (code not yet tested ...):
makesummary = function(a, suffix)
{
out = {}
out["num"] = length(a)
if (length(a) == 0)
{
f = c(0,0,0,0,0)
meanA = 0
} else {
f = fivenum(a)
meanA = mean(a)
}
out["min"] = f[1]
out["lower_hinge"] = f[2]
out["median"] = f[3]
out["upper_hinge"] = f[4]
out["max"] = f[5]
out["mean"] = meanA
names(out) = paste(names(out), suffix, sep=".")
return(out)
}
getExprData = function(markerObj, markerCells, sampleSuffix, slot="data")
{
expTable = GetAssayData(object = subset(x=markerObj, cells=markerCells), slot = slot)
allgenes = rownames(expTable)
cellnames = colnames(expTable)
expt.r = as(expTable, "dgTMatrix")
expt.df = data.frame(r = expt.r@i + 1, c = expt.r@j + 1, x = expt.r@x)
DT <- data.table(expt.df)
res = DT[, as.list(makesummary(x, sampleSuffix)), by = r]
res[[paste("anum", sampleSuffix, sep=".")]] = length(cellnames)
res$gene = allgenes[res$r]
res = res[,r:=NULL]
return(res)
}
getDEXpressionDF = function ( scdata, markers, assay="SCT" )
{
outDF = NULL
DefaultAssay(object=scdata) = assay
clusterIDs = as.character(sort(unique(Idents(scdata))))
scCells = Idents(scdata)
scCells = names(scCells)
scCells = unlist(as.character(scCells))
for (clusterID in clusterIDs){
print(clusterID)
cellIdents = Idents(scdata)
cellIdents.c = names(cellIdents[cellIdents == clusterID])
cellIdents.c = unlist(lapply(cellIdents.c, as.character))
expvals = getExprData(scdata, cellIdents.c)
modmarkers = markers[[clusterID]]
modmarkers$gene = rownames(modmarkers)
markerdf = as.data.frame(modmarkers)
if ((nrow(markerdf) > 0) && (nrow(expvals) > 0))
{
expvals = merge(markerdf, expvals, all.x=T, by.x="gene", by.y = "gene")
}
expvals = as.data.frame(cbind(clusterID, expvals))
if (!is.data.frame(outDF) || nrow(outDF)==0)
{
outDF = expvals
} else {
outDF = as.data.frame(rbind(outDF, expvals))
}
}
return(outDF)
}
makeDEResults = function(inobj, assay="SCT", test="wilcox")
{
clusterIDs = as.character(sort(unique(Idents(inobj))))
retList = list()
for (clusterID in clusterIDs)
{
cellIdents = Idents(inobj)
cellIdents.c = names(cellIdents[cellIdents == clusterID])
cellIdents.c = unlist(lapply(cellIdents.c, as.character))
print(paste("Processing cluster", clusterID, "with a total of", length(cellIdents.c), "cells"))
deMarkers = FindMarkers(inobj, assay=assay, ident.1 = cellIdents.c, test.use=test)
retList[[clusterID]] = deMarkers
}
return(retList)
}
deRes = makeDEResults(seurat_obj, assay="RNA", test="MAST")
exprdf = getDEXpressionDF(seurat_obj, deRes, assay="RNA")
write.table(exprdf, "example/marker_genes_single_human_all.tsv", sep="\t", row.names=F, quote = F)
Then you can simply call
python3 analyseMarkers.py --markers example/marker_genes_single_human_all.tsv --predictions 1
or for newer versions: python3 analyseMarkers.py --logfc avg_log2FC --expr-mean mean.cluster --expressing-cell-count num.cluster --cluster-cell-count anum.cluster --markers example/marker_genes_single_human_all.tsv --predictions 1
Did not find panglao file. Downloading it now
Downloading from: https://panglaodb.se/markers/PanglaoDB_markers_27_Mar_2020.tsv.gz
in compressed format
Starting analysis
0 Fibroblasts;Connective tissue 3.7979322878001667 62 179
1 Smooth muscle cells;Smooth muscle 3.209013271632466 38 82
2 Pancreatic stellate cells;Pancreas 2.2069340605035954 20 29
3 Macrophages;Immune system 5.050242002842474 63 153
4 Smooth muscle cells;Smooth muscle 0.9633089909462268 23 82
5 Endothelial cells;Vasculature 7.505313125200414 84 195
6 Fibroblasts;Connective tissue 2.8734895530493767 55 179
7 T cells;Immune system 1.4604815939740858 35 107
8 Fibroblasts;Connective tissue 0.7440482333103114 34 179
9 Fibroblasts;Connective tissue 3.883677668408906 63 179
10 Pericytes;Vasculature 0.803543270719957 22 64
11 Smooth muscle cells;Smooth muscle 1.985453728430751 37 82
12 B cells naive;Immune system 0.9207522036965257 19 69
13 Plasma cells;Immune system 13.783575029154143 39 86
14 Dendritic cells;Immune system 3.998198909943591 49 133
15 Kupffer cells;Liver 1.9725414603905738 16 49
16 Pancreatic stellate cells;Pancreas 3.883768523344435 18 29
17 Endothelial cells;Vasculature 5.854839914806603 68 195
18 Endothelial cells;Vasculature 6.300630147493595 78 195
19 Schwann cells;Brain 2.201608826888472 21 48
20 Fibroblasts;Connective tissue 0.5834923037164054 37 179
21 NK cells;Immune system 3.665464137900602 40 98
23 Mast cells;Immune system 1.9691749668080014 21 162
and you will receive 1 prediction per cluster. In a real life scenario you might want to check more than just one.
The output format is as follow: cluster -> cell_type -> score -> accepted_marker_genes -> marker_genes_of_celltype
With the --seurat
flag, this tool will generate string that can easily be pasted into your R session:
new.cluster.ids <- c("Fibroblasts;Connective tissue", "Smooth muscle cells;Smooth muscle", "Fibroblasts;Connective tissue", "Macrophages;Immune system", "Smooth muscle cells;Smooth muscle", "Endothelial cells;Vasculature", "Fibroblasts;Connective tissue", "T memory cells;Immune system", "Fibroblasts;Connective tissue", "Fibroblasts;Connective tissue", "Pericytes;Vasculature", "Smooth muscle cells;Smooth muscle", "B cells;Immune system", "Plasma cells;Immune system", "Macrophages;Immune system", "Macrophages;Immune system", "Fibroblasts;Connective tissue", "Endothelial cells;Vasculature", "Endothelial cells;Vasculature", "Schwann cells;Brain", "Fibroblasts;Connective tissue", "Gamma delta T cells;Immune system", "Mesothelial cells;Epithelium", "Mast cells;Immune system")
orignames = Idents(seurat_obj)
names(new.cluster.ids) <- levels(orignames)
levels(orignames) = new.cluster.ids
seurat_obj$cellnames = orignames
You can visualize the assigned cell types in a UMAP plot with
DimPlot(obj.integrated, group.by="cellnames", reduction = "umap", label=T)
First you have to prepare a data frame containing the marker genes (ttest_df
) as well as a data frame containing expression statistics for all genes in the groups the marker genes were calculated for (fsgrouped
). Particularly generating the last data frame may take a while, depending on the data set.
sc.tl.rank_genes_groups(adata, 'leiden_0.6', method='t-test', key_added = "t-test")
ttest_df = sc.get.rank_genes_groups_df(adata, group=None, key='t-test')
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
def anum():
def anum_(x):
return (x > 0).sum()
anum_.__name__ = 'anum'
return anum_
def num():
def num_(x):
return x.size
num_.__name__ = 'num'
return num_
genedf = sc.get.obs_df(
adata,
keys=["leiden_0.6", *[x for x in sc.get.var_df(adata).index]]
)
grouped = genedf.groupby("leiden_0.6")
fsgrouped = grouped.agg(["min", percentile(25), np.median, percentile(75), "max", "mean", anum(), num()])
fsgrouped.to_csv("expression_mean_df.tsv", sep="\t")
These two data frame then need to be merged.
def merge_expressions(fsgrouped, markerDF, idCol):
fsgrouped = fsgrouped.copy()
markerDF = markerDF.copy()
fsmelt = pd.melt(fsgrouped.reset_index(), id_vars=[idCol])
fsmelt2 = fsmelt.pivot_table(index=[idCol, 'variable_0'], columns='variable_1')
fsmelt3 = fsmelt2.reset_index()
lev0Names = fsmelt3.columns.get_level_values(0)
lev1Names = fsmelt3.columns.get_level_values(1)
newnames = [lev1Names[i] if lev1Names[i] != '' else lev0Names[i] for i in range(0, len(lev0Names))]
for i in range(0, len(newnames)):
if newnames[i] == "variable_0":
newnames[i] = "gene"
elif newnames[i].startswith(("leiden", "louvain")):
newnames[i] = "group"
fsmelt4 = fsmelt3.droplevel(0, axis=1)
fsmelt4.columns = newnames
markerDF["gene"] = markerDF["names"]
del markerDF["names"]
return pd.merge(markerDF, fsmelt4, on=["group", "gene"])
outdf = merge_expressions(fsgrouped, ttest_df)
outdf.to_csv("expr_ttest.tsv", sep="\t", index = False)
The output data frame (expr_ttest.tsv
) then serves as input for cPred:
python3 analyseMarkers.py --markers expr_ttest.tsv --gene gene --cluster group --logfc logfoldchanges --pvaladj pvals_adj --expr-mean mean --scanpy
With the --scanpy
flag, this tool will generate string that can easily be pasted into your R session:
group2cellname = {"0": "Monocytes;Immune system","1": "Dendritic cells;Immune system","2": "NK cells;Immune system","3": "Macrophages;Immune system","4": "Dendritic cells;Immune system","5": "Macrophages;Immune system","6": "Macrophages;Immune system","7": "Mast cells;Immune system","8": "NK cells;Immune system","9": "Mast cells;Immune system","10": "Mast cells;Immune system","11": "Dendritic cells;Immune system","12": "Monocytes;Immune system","13": "Plasma cells;Immune system","14": "NK cells;Immune system","15": "Plasma cells;Immune system","16": "Airway goblet cells;Lungs","17": "Plasma cells;Immune system","18": "B cells;Immune system","19": "Gamma delta T cells;Immune system","20": "B cells;Immune system","21": "Macrophages;Immune system","22": "Gamma delta T cells;Immune system","23": "Plasma cells;Immune system","24": "Macrophages;Immune system","25": "T cells;Immune system","26": "Plasma cells;Immune system" }
adata.obs['new_clusters'] = (
adata.obs['leiden_0.6']
.map(group2cellname)
.astype('category')
)
You can visualize the assigned cell types in a UMAP plot with
sc.pl.umap(adata, save="_cpred.png", color=["new_clusters"])
With the --aorta3
flag, this tool will generate a hdf5 object for input to Aorta3D.
python3 analyseMarkers.py --markers example/marker_genes_single_human_all.tsv --predictions 1 --aorta3d aorta3d/wirka
The cPred cell type prediction method makes use of the marker genes provided by PanglaoDB [1]. Together with the reported sensitivity and specificity reported by them as well, the provided marker genes per cell-type and tissue are important. The script will download this marker table automatically.
In their original publication [1] Franzén et al. propose a weight sum for determining the cell type in scRNAseq experiments, however do not provide an implementation that could be executed on the data directly.
Here a slightly modified version of the score is implemented.
The prediction of the cell type for a specfic cluster is achieved using a weighted sum.
For each cell type
is calculated.
Here meanSens refers to the sensitivity with which the gene is expressed in the cell type, likewise meanSpec refers to the associated specificity.
Contributing to this score is also the prevalence of the gene in the cluster, prevInCluster - a measure that is deducted from the cluster annotation, namely the number of cells which express the gene and the total cells in the cluster.
The average expression avgExpr relates to the mean expression of the specific gene in the cluster.
The importance of the gene impRC in all reference clusters is defined as
.
This gene score is summed up for each accepted gene of a cluster such that the totalScore
for all significant marker genes MG_k of cluster k.
The final cluster score is then
where accUniquerefers to the in cluster k accepted unique genes for cell type j, and allUnique to all unique genes of that specific cell type.
Likewise accGenes is the number of accepted genes in cluster k for cell type j, and allGenes the number of all marker genes for cell type
[1] O. Franzén, L.-M. Gan, and J. L. M. Björkegren, “PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data,” Database, vol. 2019, Jan. 2019.