Nextflow Pipeline for Single-Cell Data Processing and Classification

Nextflow pipeline designed to automatically annotate cell types from single-cell data loaded into the Gemma database. Cell types are assigned using a random forest classifier trained on scvi embeddings from the CellxGene data corpus [1][2][3].

Features

Downloads SCVI models based on organism and census version.
Processes query datasets using SCVI models.
Pulls reference datasets from CellxGene census data given an oranism and colletion name.
Performs cell type classification of query datasets using a random forest model.
Saves runtime parameters and outputs in a specified directory.

Requirements

Nextflow (=24.10.0)
Conda (for environment management)
My own conda environments are currently hard-coded into the pipeline (I will set up singularity environments in the future)

Installation

Clone this repository:

git clone https://github.com/rachadele/cell_annotation_cortex.git

Usage

The pipeline can be run with the following options:

nextflow run main.nf -profile conda \
  --organism <organism_name> \
  --census_version <version> \
  --outdir <output_directory> \
  --studies_dir <path_to_studies> \
  --subsample_ref <subsample_per_cell_type> \
  --ref_collections <list_of_collections> \
  --seed <random_seed> \
  --cutoff <classification_probability_cutoff>

Default parameters are as follows:

nextflow run main.nf -profile conda \
  --organism mus_musculus \
  --census_version 2024-07-01 \
  --outdir <organism>_subsample_ref_<subsample_ref> \
  --studies_dir /space/scratch/gemma-single-cell-data-ensembl-id/ \
  --subsample_ref 50 \
  --ref_collections ["A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation"] \
  --seed 42 \
  --cutoff 0

To run with defaults, simply run:

nextflow run main.nf -profile conda

Nextflow parameters begin with - (e.g. -profile; pipeline-specific parameters can be changed on the CLI with --).

To resume from the last completed step after an error, run:

nextflow run main.nf -profile conda -resume

Input

Input single-cell data should be dumped from Gemma in MEX format with ENSEMBL ids like so:

gemma-cli-sc getSingleCellDataMatrix -e <experiment_id> \
          --format mex \
          --scale-type count 
          --use-ensembl-ids \
          -o /space/scratch/gemma-single-cell-data-ensembl-id/<experiment_id>

I am working on incorporating this into the pipeline. Do this as many times as you'd like for single-cell datasets, and collect them into a parent directory (e.g. /space/scratch/gemma-single-cell-data-ensembl-id/). Be sure to check which organism the data comes from.

As of right now, experimental factors such as tissue or batch are not incorporated into the label transfer. The sample accession (i.e. each set of .mex files) is taken as the batch_key for the scvi forward pass.

Parameters

Parameter	Description
`organism`	The species being analyzed (one of `homo_sapiens`, `mus_musculus`).
`census_version`	The version of the single-cell census to use (do not change from default)
`outdir`	Directory where output files will be saved.
`studies_dir`	Path to the directory containing the input single-cell query datasets.
`subsample_ref`	Number of cells per cell type to subsample in reference.
`ref_collections`	A space-separated list of quoted reference collection names to use for annotation.
`seed`	Random seed for reproducibility of subsampling and processing.
`cutoff`	Minimum confidence score for assigning a cell type during classification (default = 0).

See Usage for for default parameters.

Please note that to change the organism to homo_sapiens, you should also change --ref_collections to:

"Transcriptomic cytoarchitecture reveals principles of human neocortex organization" \
"SEA-AD: Seattle Alzheimer’s Disease Brain Cell Atlas"

(or one of the two). You can also change these parameters directly in nextflow.config, e.g.:

params.organism = "homo_sapiens"
params.ref_collections = ["Transcriptomic cytoarchitecture reveals principles of human neocortex organization", "SEA-AD: Seattle Alzheimer’s Disease Brain Cell Atlas"]

Output

For each run, an output directory with the following structure will be written:

.
└── mus_musculus_subsample_ref_50
    ├── GSE152715
    │   └── GSE152715_predicted_celltype.tsv
    ├── GSE198014
    │   └── GSE198014_predicted_celltype.tsv
    ├── params.txt
    └── refs
        └── cortex_and_hippocampus_-_10x_3_v3_and_Smart-seq_V4.h5ad

one params.txt file stores parameters for cell type classification tasks on all of the given studies (e.g. GSE198014). Likewise, one reference dataset is used for each batch of automatic annotation (stored in refs/).

Workflow Description

References

Lim N., et al., Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021.
CZI Single-Cell Biology Program, Shibla Abdulla, Brian Aevermann, Pedro Assis, Seve Badajoz, Sidney M. Bell, Emanuele Bezzi, et al. “CZ CELL×GENE Discover: A Single-Cell Data Platform for Scalable Exploration, Analysis and Modeling of Aggregated Data,” November 2, 2023. https://doi.org/10.1101/2023.10.30.563174.
Lopez, Romain, Jeffrey Regier, Michael B. Cole, Michael I. Jordan, and Nir Yosef. “Deep Generative Modeling for Single-Cell Transcriptomics.” Nature Methods 15, no. 12 (December 2018): 1053–58. https://doi.org/10.1038/s41592-018-0229-2.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bin		bin
envs		envs
meta		meta
results/mus_musculus_subsample_ref_50		results/mus_musculus_subsample_ref_50
.gitignore		.gitignore
README.md		README.md
dag.png		dag.png
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nextflow Pipeline for Single-Cell Data Processing and Classification

Table of Contents

Features

Requirements

Installation

Usage

Input

Parameters

Output

Workflow Description

References

About

Releases

Packages

Languages

rachadele/cell_annotation_cortex.nf

Folders and files

Latest commit

History

Repository files navigation

Nextflow Pipeline for Single-Cell Data Processing and Classification

Table of Contents

Features

Requirements

Installation

Usage

Input

Parameters

Output

Workflow Description

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages