Vector stability analysis

This reposotory provides the code and describes the analysis steps for assesing vector stability and integration using long-read sequencing. The first step is to create the appropriate reference for mapping that incudes the vector sequence. The rest of the steps are performed by a snakemake pipeline. Description of the steps are found here. This pipeline borrows from the AAV analysis by Elizabeth Tseng (Magdoll).

Pipeline

Installation

Depedencies

miniconda
The rest of the dependencies (including snakemake) are installed via conda through the environment.yml file. This includes the tools used for all analysis steps.

Installation process

Clone the directory:

git clone --recursive https://github.com/egustavsson/vector-analysis.git

Create conda environment for the pipeline which will install all the dependencies:

cd vector-analysis
conda env create -f environment.yml

Input

PacBio CCS reads in FASTA format.
Reference genome assembly in FASTA format. Described in the analysis tutorial.

How to use

Edit config.yml to set up the working directory and input files/directories. snakemake command should be issued from within the pipeline directory. Please note that before you run any of the snakemake commands, make sure to first activate the conda environment using the command conda activate vector-analysis.

cd vector-analysis
conda activate vector-analysis
snakemake --use-conda -j <num_cores> all

It is a good idea to do a dry run (using -n parameter) to view what would be done by the pipeline before executing the pipeline.

snakemake --use-conda -n all

To exit a running snakemake pipeline, hit ctrl+c on the terminal. If the pipeline is running in the background, you can send a TERM signal which will stop the scheduling of new jobs and wait for all running jobs to be finished.

killall -TERM snakemake

To deactivate the conda environment:

conda deactivate

Analysis steps

1. Preparing the genome and annotation file

The following fasta files (if available) should be combined into a single "genome" fasta file:

host genome (ex: hg38). Can be downloaded from Ensembl, GENCODE or NCBI
vector (including the vector + plasmid backbone as a single sequence)

NOTE the sequence IDs should be free of blank spaces and symbols. Stick with numbers, alphabet letters, and _ and -. If necessary, rename the sequence IDs in the combined fasta file.

Create a annotation.txt file according to the following format:

NAME=<sequence id>;TYPE={vector|helper|repcap|host|lambda};REGION=<start>-<end>;

Only the vector annotation is required and must be marked with REGION=. All other types are optional. For example:

NAME=my_plasmid;TYPE=vector;REGION=100-2000;
NAME=chr1;TYPE=host;
NAME=chr2;TYPE=host;
NAME=chr3;TYPE=host;

IMPORTANT!!! you must have exactly the same number of chromosomes in the reference fasta file as annotations file. This is especially common if you are including human genome (hg38) which has a lot of alternative chromosomes. It is recommended that you use a version of hg38 that only lists the major chromosomes.

Combinig the host genome and the vector into a single fasta can be done by:

paste host.fasta vector.fasta > combined.fasta

2. BAM to FASTA

PacBio HiFi reads are generally generated as an unmapped BAM file. This only needs to be doen once regardless of reference used for mapping or other changes while mapping. To convert BAM to FASTA format use this example:

bam2fasta -u -o out in.bam

3. Mapping reads and calling structural variants

After fallowing steps 1 and 2, generating required genomes references and FASTA input are done, make sure the config.yml is edited. These are the parameters:

Parameter	Description
pipeline	this will be the folder with the output under the workdir
workdir	Set path to working directory
sample_name	sample name which will be the prefix of output files. Default is `Sample`
genome	genome fasta that will be used to map against
CCS_fasta	The HiFi CCS fasta file
minimap_opts	Optional options passed to minimap for minimap for mapping
sniffles_opts	Optional options passed to sniffles for SV calling
threads	Number of threads to use. Default is 10

Mapping is done using minimap2. For detailed minimap2 instructions, see their manual. Structural variants are called using Sniffles2. For Sniffles2 instructions see their github repo or run:

sniffles --help

The mapping step and SV calling is done by running the snakemake as previously described:

cd vector-analysis
snakemake --use-conda -j <num_cores> all

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
schema		schema
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
environment.yml		environment.yml
snakefile		snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector stability analysis

Pipeline

Installation

Depedencies

Installation process

Input

How to use

Analysis steps

1. Preparing the genome and annotation file

2. BAM to FASTA

3. Mapping reads and calling structural variants

About

Releases

Packages

License

egustavsson/vector-analysis

Folders and files

Latest commit

History

Repository files navigation

Vector stability analysis

Pipeline

Installation

Depedencies

Installation process

Input

How to use

Analysis steps

1. Preparing the genome and annotation file

2. BAM to FASTA

3. Mapping reads and calling structural variants

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages