Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed info custom gene lists #140

Merged
merged 1 commit into from
Nov 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 65 additions & 46 deletions docs/usage/gene_list_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@ ribosomal genes, or excluding genes from HVG selection such as those constitutin

We provide an example of a preformatted gene lists file in [resources/qc_genelist_1.0.csv](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv).

All <sup>[1](#footnote1)</sup> files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. The group column is used to distinguish different gene groups.
All Custom Gene Lists files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. The group column is used to distinguish different gene groups.


**mod**: the modality for the feature in use
**feature**: feature name, i.e. gene
**group**: the group the gene belongs to
- **mod**: the modality for the feature in use
- **feature**: feature name, i.e. gene
- **group**: the group the gene belongs to

| mod | feature | group |
| --- | ------- | ------- |
Expand All @@ -34,9 +34,8 @@ are stored in [resources/cell_cycle_genes.csv](https://github.com/DendrouLab/pan

Differently from the other custom gene file, the cell cycle file should be a **tab separated file with two columns**:

**gene_name**: the name of the gene

**cc_phase**: which phase of the cell cycle is the gene expression indicative of.
- **gene_name**: the name of the gene
- **cc_phase**: which phase of the cell cycle is the gene expression indicative of.


| gene_name | cc_phase |
Expand All @@ -48,70 +47,90 @@ Differently from the other custom gene file, the cell cycle file should be a **t
| CKAP2L | g2m |
| ... | ... |

## Explaining actions
## Using custom gene lists to calculate QC metrics

Panpipes uses "actions" to define which tasks to use which gene list for.
Specify the "group" name of the genes you want to use to apply the action i.e. calc_proportion: mt will calculate
proportion of reads mapping to the genes whose group is "mt"
Panpipes uses "actions" to define in which tasks to use the provided gene list.
We encode three main actions to use gene lists to describe qualities of the cells and populate the metadata (`.obs`) of the object with the newly calculated cell QC metric.
Specify the "group" name of the genes you want to use to apply a specific action to calculate the cell QC metric

If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data)

The genes are scored for each modality using
### Suppying custom gene lists to calculate QC metrics

- [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html#scanpy.pp.calculate_qc_metrics).
For example, for the rna modality, including a list of mitochondiral
genes in the group `mt`, will add `pct_counts_mt`, and `total_counts_mt`
to the `mdata["rna"].obs` assay.
The custom genelist file can be supplied by the user in two worflows to perform the three main actions:

- [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html).
1. **Ingest workflow**


(for pipeline_ingest.py)
**calc_proportions:** calculate proportion of reads mapping to X genes over total number of reads, per cell
**score_genes:** using scanpy.tl.score_genes function, the average expression of a set of genes, subtracted of the average expression of a reference set of genes. First introduced in Satija et al. Nature Biotechnology (2015).
pipeline_ingest config file: (pipeline.yml)

(for pipeline_preprocess.py)
**exclude:** exclude these genes from the HVG selection, if they are deemed Highly Variable.
```
custom_genes_file: resources/qc_genelist_1.0.csv
```

For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`)
2. **Preprocess workflow**

If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data)

### Cell cycle action
pipeline_preprocess config file: (pipeline.yml)

**ccgenes:**
Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in panpipes/resources/cell_cicle_genes.tsv
```
exclude_file: resources/qc_genelist_1.0.csv
```

Users can create their own list, and need to specify the path to this new file in in the `ccgenes` param to score the cells with their custom list.
*Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*

If left blank, the cellcycle score will not be calculated.

### Explaining custom gene lists actions

1. **Ingest workflow** (pipeline_ingest.py)

- **calc_proportions:** calculate proportion of reads mapping to X genes over total number of reads, per cell, using [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html#scanpy.pp.calculate_qc_metrics).

## Supplying custom gene lists

The custom genelist file can be supplied by the user in three worflows:
For example, for the rna modality, including a list of mitochondiral
genes in the group `mt`, and setting

calc_proportion: mt

Ingest workflow
-------------
will calculate the proportion of reads mapping to the genes whose group is "mt" and and will add `pct_counts_mt`, and `total_counts_mt`
to the `mdata["rna"].obs` assay.

pipeline_ingest config file: (pipeline.yml)

```
custom_genes_file: resources/qc_genelist_1.0.csv
```
- **score_genes:** using [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html), it calculates the average expression of a set of genes, subtracted of the average expression of a reference set of genes. First introduced in Satija et al. Nature Biotechnology (2015).
This action will generate a column **GroupName_score** and add it to the `mdata[MOD].obs` assay. For example for the rna modality, including a list of Markers for a cell type in the group 'MarkersNeutro' and setting

Preprocess workflow
----------------
score_genes: MarkersNeutro

will score the cells for the list of genes provided and add a column `MarkersNeutro_score` to the `mdata["rna"].obs` assay.

pipeline_preprocess config file: (pipeline.yml)

```
exclude_file: resources/qc_genelist_1.0.csv
```

*Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*
2. **Preprocess workflow** (pipeline_preprocess.py)
- **exclude:** exclude these genes from the HVG selection, if they are deemed Highly Variable.

For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`)

Vizualization workflow

### Cell cycle actions

As described before, we also rely on a user-supplied list of genes to calculate the cell cycle phase of a cell. We believe that this choice offers the maximum flexibility to use a trusted gene-set for the calculation of this metric.
The cell cycle scoring happens in the `ingest` workflow using the `ccgenes` parameter. The cell cycle action performed using `scanpy.tl.score_genes_cell_cycle`

**ccgenes:**
Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in panpipes/resources/cell_cicle_genes.tsv. Using this file, this action will produce at least 3 columns in the `mdata["rna"].obs` assay, namely 'S_score', 'G2M_score', 'phase'.

Users can create their own list, and need to specify the path to this new file in in the `ccgenes` param to score the cells with their custom list.

If left blank, the cellcycle score will not be calculated.


Using Custom Gene lists to plot: the Vizualization workflow
---------------
Users may also supply custom gene lists to plot markers using standard visualizations, such as UMAPs or dotplots of gene expressions.
We have designated an entire workflow just for this purpouse. The Vizualization workflow accepts custom gene files in the [same 3-column format as above]().

These files can be specified in the `viz` configuration file as follows:

pipeline_vis config file: (pipeline.yml)

Expand All @@ -126,7 +145,7 @@ minimal:

```

These require the same 3-column format as above.

Generally in the visualisation pipeline all gene groups in the input are plotted. In heatmaps and dot
plots, one dotplot per group is plotted. For umaps, one plot per gene is
plotted, and a new file is saved per group.
Expand Down
2 changes: 1 addition & 1 deletion panpipes/python_scripts/run_scanpyQC_rna.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@
L.info("saving anndata and obs in a metadata tsv file")
write_obs(mdata, output_prefix=args.sampleprefix,
output_suffix="_cell_metadata.tsv")
# CRITICAL to do WORK OUT WHICH QC SCRIPT TO USE TO SAVE THE MDATA OR ANNDATA

mdata.write(args.outfile)

L.info("done")
Expand Down