Recombinant Population Analysis

This repository represents the largest portion of my Bachelor thesis. I graduated in Genomics at University of Bologna. This project started in summer 2023 during the Biozentrum research summer project, at NeherLab. I graduated in July 2024, with a final score of 110/110 cum laude, along with a honourable mention.

The main use of this repository processes the data produced by an Aionostat experiment, a machine that allows automatic experimental evolution of phages. You can have details on the specific experiment discussed in the thesis by exploring its conents. Right now, this repository aims to be a general tool that can be applied to analyse any heterogeneous population of recombinant molecular entities sequenced with ONT. The experimental requirements are the following:

The recombinant population has to arise from just two ancestral species that mixed.
The two ancestral species have to be fairly similar, allowing homologous recombination.
Only homologous recombination is detected

The pipeline follows the following schematic workflow:

The two references corresponding to the ancestral phages are combined in a hybrid reference. This reference can be used to align the reads of the recombinant population with minimap2. For each read, the obtained alignment is approximated as being the MSA of the 2 references + the recombinant read. From the MSA the evidences of the read belonging to ancestral sequence 1 or 2 are extracted and feeded to the HMM model. To have more details on the HMM model see here

Configuration

You can run the pipeline by properly setting up the run_config.yml file and by creating a folder with the input data.

Input folder

The input folder should have the following structure:

data/ reads/ [replicate_code]_[timestep_code].fastq.gz ... references.fasta

Each fastq file should be named with two codes, one identifying the experimental replicate and one progressively numbering successive timestep (in case of a time series analysis).

The two reference genomes should be included in the same fasta file named "references.fasta".

run_config.yml

The run_config file has 4 sections:

run_config: describes the file configuration of the pipeline run. Write down the name of the two references and of the replicates and timesteps that have to be analyzed.
alignments: set the length threshold below which the reads will be ignored and not aligned to the hybrid reference.
HMM: define the HMM parameters. To have more details see here
plots: set the coverage threshold below which the inferences carried out on the site will not be shown in the plots.

Running the pipeline

Local execution

snakemake --profile local --configfile run_config.yml

HPC execution

snakemake --profile cluster --configfile run_config.yml

Ouput

Two plots are produced by the pipeline:

Coverage plot

After gathering the inference carried out on all reads, for each site of the hybrid reference the fraction of reads assigned to ancestral sequence 1 and 2 is plotted.

Example:

Recombination plot

After gathering the inference carried out on all reads, all the recombination events (i.e. position of a recombinant read where it is inferred the switch from a reference to the other) are plotted on the hybrid reference genome, normalised for the total amount of reads mapped on each position.

Example:

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
cluster		cluster
conda_envs		conda_envs
documentation		documentation
local		local
notes		notes
rules		rules
scripts		scripts
test		test
thesis		thesis
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
run_config.yml		run_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recombinant Population Analysis

Configuration

Input folder

run_config.yml

Running the pipeline

Local execution

HPC execution

Ouput

Coverage plot

Recombination plot

About

Releases

Packages

Contributors 2

Languages

kcajj/recombinant_population_analysis

Folders and files

Latest commit

History

Repository files navigation

Recombinant Population Analysis

Configuration

Input folder

run_config.yml

Running the pipeline

Local execution

HPC execution

Ouput

Coverage plot

Recombination plot

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages