SeCNV, a single cell copy number profiling tool.
Please install Git LFS and use the following command to clone the repository.
git lfs clone https://github.com/deepomicslab/SeCNV.git
The scripts are written in Python3. Following Python packages should be installed:
- numpy
- pandas
- scipy
- sklearn
- linecache
Since there are some Bioinformatic pre-processing steps in SeCNV pipeline, following Bioinformatic tools shoud be installed and set environment variables.
- bwa
- samtools
- bedtools
- bigWigAverageOverBed
- pyfaidx
- picard
To run SeCNV, the bigwig files, hg19_mappability.bigWig and hg38_mappability.bigWig, should be downloaded from Zenodo and put under the Script folder.
The reference hg19 or hg38, which can be downloaded from NCBI, should be prepared and built index.
For alignment, sorting, adding read group, and deduplication, we recommend the following steps.
- Align FASTQ files to the reference genome
bwa mem -M -t 8 hg19.fa file_name.fastq.gz > file_name.sam
samtools view -bS file_name.sam >file_name.bam
- Sort
java -Xmx30G -jar picard.jar SortSam INPUT=file_name.bam OUTPUT=file_name.sorted.bam SORT_ORDER=coordinate
- Add read Group
java -Xmx40G -jar picard.jar AddOrReplaceReadGroups I=file_name.sorted.bam O=file_name.sorted.rg.bam RGID=file_name RGLB=NAVIN_Et_Al RGPL=ILLUMINA RGPU=machine RGSM=file_name
- Dedup
java -Xmx40G -jar picard.jar MarkDuplicates REMOVE_DUPLICATES=true I=file_name.sorted.rg.bam O=file_name.sorted.rg.dedup.bam METRICS_FILE=file_name.sorted.rg.dedup.metrics.txt PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_VERSION=null PROGRAM_GROUP_NAME=MarkDuplicates
java -jar picard.jar BuildBamIndex I=file_name.sorted.rg.dedup.bam
Please change hg19.fa to your reference location and file_name to your FASTQ file name.
Next, SeCNV takes the bam files as input to profile copy number.
cd Scripts
python SeCNV.py input_fold output_fold ref_file
Input_fold is where the bam files are, output_file is where the output files will be (an empty fold is recommended), and the ref_file is the path of the indexed reference hg19 or hg38. Other parameters are shown bellow:
- -r or --ref: The reference used (hg19 or hg38) [default: hg19].
- -b or --bin_size: The length of bin [default: 500000].
- -min or --min_ploidy: The minimal ploidy [default: 1.5].
- -max or --max_ploidy: The maximal ploidy [default: 5].
- -p or --pattern: The pattern of bam file names [default: *dedup.bam].
- -K or --topK: The K largest distances used to construct adjacency matrix [default: auto_set].
- -s or --sigma: The standard deviation of the Gaussian kernel function [default: auto_set].
- -n or --normal_cell The file with normal cell IDs [default: None].
For more information, please use python SeCNV.py -h or python SeCNV.py --help.
WANG Ruohan ruohawang2-c@my.cityu.edu.hk
@article{ruohan2022resolving,
title={Resolving single-cell copy number profiling for large datasets},
author={Ruohan, Wang and Yuwei, Zhang and Mengbo, Wang and Xikang, Feng and Jianping, Wang and Shuai Cheng, Li},
journal={Briefings in Bioinformatics},
volume={23},
number={4},
pages={bbac264},
year={2022},
publisher={Oxford University Press}
}