Skip to content

Latest commit

 

History

History
32 lines (27 loc) · 8.27 KB

README.md

File metadata and controls

32 lines (27 loc) · 8.27 KB

sc-eQTL

This repository contains the scripts to parse the single-cell RNA-sequencing (scRNA-seq) data and perform downstream analyses like variance partitioning and eQTL mapping from the manuscript [Refining the resolution of the yeast genotype-phenotype map using single-cell RNA-sequencing](BioRxiv link). The supplementary files "Matrix_gene_expression_barcodes_1_to_9000.csv" and "Matrix_gene_expression_barcodes_9001_to_18233.csv" contain the expression profile of the single cells. The gene names are listed in the file "Table_expressed_genes.csv". As for the file "Table_single_cell_barcodes_mapping_to_reference_panel_strains_0_based_index.csv", it contains the single cell assignment to the reference panel strains (0-based index). In the latter table, the column best_match indicates the 0-based index of the closest reference panel strains to each single cell while the column significant_best_match represents the same information but with missing values (NA) when the relatedness between the single cell and its closest strain is not statistically significant (N'Guessan et al., 2023).

Pipeline

I. Expression count

To obtain the expression profile of each single cell, filter out noisy single cells (empty droplets, low number of reads, etc) and obtain the read mapping that allows to perform the allele count, run the following scripts in the specified order:

  1. submit_cellCOUNT_BY_ref.sh: In this script file, replace the arguments values strating with "$" by the appropriate value described at https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count#cr-count where $BY-reference is the path to the reference transcriptome (e.g. https://github.com/arnaud00013/sc-eQTL/tree/main/BY_reference). Replace #HEADER with the allocated ressources if you are running the script on a server.
  2. submit_cellCOUNT_RM_ref.sh: In this script file, replace the arguments values strating with "$" by the appropriate value described at https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/count#cr-count where $RM-reference is the path to the reference transcriptome (e.g. https://github.com/arnaud00013/sc-eQTL/tree/main/RM_reference). Replace #HEADER with the allocated ressources if you are running the script on a server.

II. Creating single-cell imputed and corrected genotypes

To create single-cell imputed and corrected genotypes from raw single-cell reads count, run the following scripts in the specified order:

  1. submit_Estimate_perrs_HMM.sh: Make sure to run the script in the directory sc-eQTL/II_scRNA-seq_genotyping/!!! Replace the following arguments in the script with the corresponding value: $workspace_path (path to the directory containing the file indincating the list of barcodes and the data/ sub-directory, which containts the bam file), $bam_filename, $list_of_barcodes_filename, $number_of_minimum_mismatch_within_the_same_read_for_index_swapping, $minimum_coverage_per_site and $number_of_cpus. Replace #HEADER with the allocated ressources if you are running the script on a server.
  2. submit_scRNAseq_genotyping_pipeline.sh: Make sure to run the script in the directory sc-eQTL/II_scRNA-seq_genotyping/!!! Replace the following arguments in the script with the corresponding value: $Nb_cells (the number of barcodes or cells) and $Nb_cpus (number of cpus allocated for this task). Replace #HEADER with the allocated ressources if you are running the script on a server.

III. Genotype analysis

To determine the closest reference panel strain to each of the single cell and to estimate the number of breakpoints per genotype, run the following scripts in the specified order:

  1. submit_get_uncorrected_gen_dist_mtx.sh: Make sure to run the script in the directory sc-eQTL/III_Genotype_analysis/!!! Replace the following arguments in the script with the corresponding value: $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I), $Nb_cpus (number of cpus allocated for this task) and $Number_of_subsampes_for_lineage_assignment (recommended value is 500). The reference panel genotype can be downloaded at https://datadryad.org/stash/dataset/doi:10.5061/dryad.1rn8pk0vd. You can then search "Import reference panel genotypes" in the python script generate_dist_cell_to_lineage_uncorrected_gen.py and edit the code to import the reference panel genotypes with the correct file names. Replace #HEADER with the allocated ressources if you are running the script on a server.
  2. submit_get_corrected_gen_dist_mtx.sh: Make sure to run the script in the directory sc-eQTL/III_Genotype_analysis/!!! Replace the following arguments in the script with the corresponding value: $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I), $Nb_cpus (number of cpus allocated for this task) and $Number_of_subsampes_for_lineage_assignment (recommended value is 500). The reference panel genotype can be downloaded at https://datadryad.org/stash/dataset/doi:10.5061/dryad.1rn8pk0vd. You can then search "Import reference panel genotypes" in the python script generate_dist_cell_to_lineage_corrected_gen.py and edit the code to import the reference panel genotypes with the correct file names. Replace #HEADER with the allocated ressources if you are running the script on a server.
  3. submit_Count_nb_breakpoints.sh: Make sure to run the script in the directory sc-eQTL/III_Genotype_analysis/!!! Replace the following arguments in the script with the corresponding value: $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I) and $Nb_cpus (number of cpus allocated for this task). Replace #HEADER with the allocated ressources if you are running the script on a server.

IV. Variance partitioning

To evaluate the association between the genotype, gene expression and the phenotype, run the following scripts in the specified order:

  1. submit_Varpart_Pheno_vs_Genotype_and_Expression.sh: Make sure to run the script in the directory sc-eQTL/IV_variance_partitioning/!!! Replace the following arguments in the script with the corresponding value: $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I) and $Nb_cpus (number of cpus allocated for this task). Replace #HEADER with the allocated ressources if you are running the script on a server.
  2. submit_generate_pca_expression_partitions.sh: Make sure to run the script in the directory sc-eQTL/IV_variance_partitioning/!!! Replace the following arguments in the script with the corresponding value: $nb_expression_pcs_partitions (This script allows to form groups/partitions of expression principal components and this argument determine the number of partitions selected by the user), $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I) and $Nb_cpus (number of cpus allocated for this task). Replace #HEADER with the allocated ressources if you are running the script on a server.
  3. submit_Varpart_Pheno_vs_Genotype_and_Expression.sh: This script performs the correlation between a single partition of expression PCs and the corresponding genotypes. Make sure to run the script in the directory sc-eQTL/IV_variance_partitioning/!!! Replace the following arguments in the script with the corresponding value: $the_ind_partition (0-based index of the partition), $nb_expression_pcs_partitions (Total number of expression PCs partitions), $nb_expression_PCs (Number of expression principal components explaining 99% of expression variance), $workspace (the path to the directory containing the data/ sub-directory), $cellranger_outs_folder (the path to the cellranger output directory generated at the step I) and $Nb_cpus (number of cpus allocated for this task). Replace #HEADER with the allocated ressources if you are running the script on a server.