DendrouLab · bio-la · Nov 27, 2023 · Nov 27, 2023
diff --git a/docs/usage/gene_list_format.md b/docs/usage/gene_list_format.md
@@ -8,12 +8,12 @@ ribosomal genes, or excluding genes from HVG selection such as those constitutin
 
 We provide an example of a preformatted gene lists file in [resources/qc_genelist_1.0.csv](https://github.com/DendrouLab/panpipes/blob/main/panpipes/resources/qc_genelist_1.0.csv).
 
-All <sup>[1](#footnote1)</sup> files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. The group column is used to distinguish different gene groups.
+All Custom Gene Lists files provided to the pipeline should be in a 3 columns format, where the column headers are "mod" (modality: "rna", "prot", or "atac"), feature and group. The group column is used to distinguish different gene groups.
 
 
-**mod**: the modality for the feature in use 
-**feature**: feature name, i.e. gene
-**group**: the group the gene belongs to
+- **mod**: the modality for the feature in use 
+- **feature**: feature name, i.e. gene
+- **group**: the group the gene belongs to
 
 | mod | feature | group   |
 | --- | ------- | ------- |
@@ -34,9 +34,8 @@ are stored in [resources/cell_cycle_genes.csv](https://github.com/DendrouLab/pan
 
 Differently from the other custom gene file, the cell cycle file should be a **tab separated file with two columns**:
 
-**gene_name**:  the name of the gene
-
-**cc_phase**: which phase of the cell cycle is the gene expression indicative of. 
+- **gene_name**:  the name of the gene
+- **cc_phase**: which phase of the cell cycle is the gene expression indicative of. 
 
 
 | gene_name | cc_phase |
@@ -48,70 +47,90 @@ Differently from the other custom gene file, the cell cycle file should be a **t
 | CKAP2L    | g2m      |
 | ...       | ...      |
 
-## Explaining actions
+## Using custom gene lists to calculate QC metrics
 
-Panpipes uses "actions" to define which tasks to use which gene list for.
-Specify the "group" name of the genes you want to use to apply the action i.e. calc_proportion: mt will calculate
-proportion of reads mapping to the genes whose group is "mt"
+Panpipes uses "actions" to define in which tasks to use the provided gene list.
+We encode three main actions to use gene lists to describe qualities of the cells and populate the metadata (`.obs`) of the object with the newly calculated cell QC metric.
+Specify the "group" name of the genes you want to use to apply a specific action to calculate the cell QC metric
 
+If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data)
 
-The genes are scored for each modality using 
+### Suppying custom gene lists to calculate QC metrics
 
-- [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html#scanpy.pp.calculate_qc_metrics).
-For example, for the rna modality, including a list of mitochondiral
-genes in the group `mt`, will add `pct_counts_mt`, and `total_counts_mt`
-to the `mdata["rna"].obs` assay.
+The custom genelist file can be supplied by the user in two worflows to perform the three main actions: 
 
-- [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html).
+1. **Ingest workflow**
 
 
-(for pipeline_ingest.py)
-**calc_proportions:** calculate proportion of reads mapping to X genes over total number of reads, per cell
-**score_genes:** using scanpy.tl.score_genes function, the average expression of a set of genes, subtracted of the average expression of a reference set of genes. First introduced in Satija et al. Nature Biotechnology (2015).
+    pipeline_ingest config file: (pipeline.yml)
 
-(for pipeline_preprocess.py)
-**exclude:** exclude these genes from the HVG selection, if they are deemed Highly Variable.
+    ```
+    custom_genes_file: resources/qc_genelist_1.0.csv
+    ```
 
-For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`)
+2. **Preprocess workflow**
 
-If left blank, these actions will not be performed (i.e. no calculation of % of mt genes per cell will be included in the ingestion of the data)
 
-### Cell cycle action
+    pipeline_preprocess config file: (pipeline.yml)
 
-**ccgenes:**  
-Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in panpipes/resources/cell_cicle_genes.tsv
+    ```
+    exclude_file: resources/qc_genelist_1.0.csv
+    ```
 
-Users can create their own list, and need to specify the path to this new file in in the `ccgenes` param to score the cells with their custom list.
+*Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*
 
-If left blank, the cellcycle score will not be calculated.
 
+### Explaining custom gene lists actions
+
+1. **Ingest workflow** (pipeline_ingest.py)
+
+- **calc_proportions:** calculate proportion of reads mapping to X genes over total number of reads, per cell, using [scanpy.pp.calculate_qc_metrics](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.calculate_qc_metrics.html#scanpy.pp.calculate_qc_metrics).
 
-## Supplying custom gene lists
 
-The custom genelist file can be supplied by the user in three worflows: 
+    For example, for the rna modality, including a list of mitochondiral
+    genes in the group `mt`, and setting 
+
+        calc_proportion: mt 
 
-Ingest workflow
--------------
+    will calculate the proportion of reads mapping to the genes whose group is "mt" and and will add `pct_counts_mt`, and `total_counts_mt`
+    to the `mdata["rna"].obs` assay.
 
-pipeline_ingest config file: (pipeline.yml)
 
-```
-custom_genes_file: resources/qc_genelist_1.0.csv
-```
+- **score_genes:** using [scanpy.tl.score_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.score_genes.html), it calculates the average expression of a set of genes, subtracted of the average expression of a reference set of genes. First introduced in Satija et al. Nature Biotechnology (2015).
+  
+  This action will generate a column **GroupName_score** and add it to the `mdata[MOD].obs` assay. For example for the rna modality, including a list of Markers for a cell type in the group 'MarkersNeutro' and setting
 
-Preprocess workflow
-----------------
+      score_genes: MarkersNeutro
+
+  will score the cells for the list of genes provided and add a column `MarkersNeutro_score` to the `mdata["rna"].obs` assay.
 
-pipeline_preprocess config file: (pipeline.yml)
 
-```
-exclude_file: resources/qc_genelist_1.0.csv
-```
 
-*Note that we have formatted an example file containing all genes to use in both workflows, and therefore supply the same file to both workflows but users can have independent files for each of them.*
+2. **Preprocess workflow** (pipeline_preprocess.py)
+- **exclude:** exclude these genes from the HVG selection, if they are deemed Highly Variable.
+
+    For the exclude action, if set to `default` the workflow will look for genes whose group is set to `exclude` in the supplied qc_genelist file. Alternatively, if you are specifying your custom gene list and you want to exclude another set of genes, for example a group you call `TCR_genes`, specify this group (i.e. `exclude: TCR_genes`)
 
-Vizualization workflow
+
+### Cell cycle actions
+
+As described before, we also rely on a user-supplied list of genes to calculate the cell cycle phase of a cell. We believe that this choice offers the maximum flexibility to use a trusted gene-set for the calculation of this metric.
+The cell cycle scoring happens in the `ingest` workflow using the `ccgenes` parameter. The cell cycle action performed using `scanpy.tl.score_genes_cell_cycle`
+
+**ccgenes:**  
+Setting the `ccgenes` param to `default` in the ingest workflow will calculate the phase of the cell cycle in which the cell is by using `scanpy.tl.score_genes_cell_cycle` using the file provided in panpipes/resources/cell_cicle_genes.tsv. Using this file, this action will produce at least 3 columns in the `mdata["rna"].obs` assay, namely 'S_score', 'G2M_score', 'phase'.
+
+Users can create their own list, and need to specify the path to this new file in in the `ccgenes` param to score the cells with their custom list.
+
+If left blank, the cellcycle score will not be calculated.
+
+
+Using Custom Gene lists to plot: the Vizualization workflow
 ---------------
+Users may also supply custom gene lists to plot markers using standard visualizations, such as UMAPs or dotplots of gene expressions.
+We have designated an entire workflow just for this purpouse. The Vizualization workflow accepts custom gene files in the [same 3-column format as above]().
+
+These files can be specified in the `viz` configuration file as follows:
 
 pipeline_vis config file: (pipeline.yml)
 
@@ -126,7 +145,7 @@ minimal:
 
 ```
 
-These require the same 3-column format as above.
+
 Generally in the visualisation pipeline all gene groups in the input are plotted. In heatmaps and dot
 plots, one dotplot per group is plotted. For umaps, one plot per gene is
 plotted, and a new file is saved per group.

diff --git a/panpipes/python_scripts/run_scanpyQC_rna.py b/panpipes/python_scripts/run_scanpyQC_rna.py
@@ -162,7 +162,7 @@
 L.info("saving anndata and obs in a metadata tsv file")
 write_obs(mdata, output_prefix=args.sampleprefix, 
         output_suffix="_cell_metadata.tsv")
-# CRITICAL to do WORK OUT WHICH QC SCRIPT TO USE  TO SAVE THE MDATA OR ANNDATA
+
 mdata.write(args.outfile)
 
 L.info("done")