This repository contains a series of scripts which can be used to perform and visualise macrosynteny analysis on chromosome level genomes of metazoan species.
SyntenyFinder.py
is a python script to run on command line. Given a list of NCBI genome assembly accessions, the script downloads the associated annotated genomes, runs OrthoFinder, and automatically generates the karyotype and coordinate files necessary to run synteny analysis in RIdeogram.
For genomes without a published annotation on NCBI, the folder SyntenyFinder_customisable
contains Synteny_main.ipynb
, which is a fully customisable script which can intake a wide range of annotation formats.
plot_ideogram.R
takes the generated files and creates macrosynteny plots.
File tree for SyntenyFinder:
SyntenyFinder
├── SyntenyFinder.py
├── dependencies
│ └── Synteny_functions.py
├── plot_ideogram
│ └── plot_ideogram.R
├── sample_files
│ ├── RIdeogram_output.png
│ └── sample_figure.png
└── SyntenyFinder_customisable
├── Synteny_main.ipynb
├── dependencies
│ └── Synteny_functions.ipynb
├── example_run.ipynb
├── input_data (*)
│ ├── gene_rows
│ │ └── ...
│ ├── genomes
│ │ └── ...
│ └── proteomes
│ └── ...
└── synteny_v5.5.R
Items marked with a (*)
are not provided in this repository.
SyntenyFinder.py
is a python script which downloads genome assemblies from NCBI, runs OrthoFinder, and generates the karyotype and coordinate files necessary for macrosynteny analysis.dependencies
contains the files needed to runSyntenyFinder.py
Synteny_functions.py
contains the required helper functions.
plot_ideogram
provides the file tree necessary to visualise the macrosynteny using RIdeogram- The karyotype and coordinate output files from
SyntenyFinder
should be copied into theinput
folder. plot_ideogram.R
generates the plots and saves the resulting svg and pdf files inideograms/svg
andideograms/pdf
respectively.
- The karyotype and coordinate output files from
SyntenyFinder_customisable
contains the code and data used to generate the figures for our research article and is customisable, allowing for different sources of intake files. This folder works independently and could be omitted from, or used instead of, the rest of the repository.Synteny_main.ipynb
generates karyotype and coordinate files in a more customisable manner, allowing for a wider variety of intake files.synteny_v5.5.R
generates Oxford dot plots and RIdeogram macrosynteny plots, and additionally analysed mixing rates.
SyntenyFinder.py
relies on OrthoFinder
to identify single-copy orthologous genes. Please first make sure OrthoFinder and all necessary dependencies are correctly installed.
Additionally, SyntenyFinder.py
uses NCBI's command line tools datasets
and dataformat
to download the requested accessions and pull assembly information such as species and number of chromosomes. Please ensure these are installed and running correctly.
The majority of packages used by SyntenyFinder are part of the Python Standard Library; namely subprocess
, os
, re
, zipfile
, argparse
, concurrent.futures
, and io
. These should be available by default.
Additionally, Pandas
and Biopython
are required. These can be installed with the following commands:
pip install pandas
pip install biopython
Use the following command:
python /path/to/SyntenyFinder.py --help
To generate karyotype and coordinate files, run the following command in /SyntenyFinder
:
python SyntenyFinder.py --accessions GCF_910592395.1,GCF_902652985.1 --run_name get_synteny
This command downloads the annotated genome assemblies GCF_910592395.1 for the nemertean Lineus longissimus and GCF_902652985.1 for the scallop Pecten maximus. It runs OrthoFinder and finds single-copy orthologues, then generates karyotype and coordinate files. The folder get_synteny
is created within the working directory /SyntenyFinder
to store output, as well as intermediate run files.
python /path/to/SyntenyFinder.py \
--accessions accession1,accession2,accession3 \
--run_name run_name1 \
--algs first \
--orthofinder path/to/orthofinder \
--threads 20 \
--directory path/to/root/folder
-
Required:
-
--accessions
: Comma separated list of accessions for annotated chromosome level genomes on NCBI. -
--run_name
: Name of run iteration used to create a directory
-
-
Optional:
-
--algs
: Code or accession of species used in run to trace ancestral linkage groups. -
--directory
: Root directory for analysis folder to be created. (Default:./
) -
--orthofinder
: Path to executable OrthoFinder, if not installed. If any errors are encountered using this flag, please ensure the bin containing OrthoFinder is in the path. -
--threads
: Number of threads to use when running OrthoFinder
-
Running SyntenyFinder.py
results in the generation of the following tree of files:
directory (root directory provided)
├── ncbi_downloads
│ ├── <Sp1>
│ │ └── ncbi_dataset
│ │ └── data
│ │ └── accession1
│ ├── <Sp2>
│ │ └── ncbi_dataset
│ │ └── data
│ │ └── accession2
│ └── ...
└── run_name
├── output
│ ├── Sp1_coordinates.tsv
│ ├── Sp1_karyotype.txt
│ ├── Sp2_coordinates.tsv
│ ├── Sp2_karyotype.txt
│ ├── Sp3_coordinates.tsv
│ ├── Sp3_karyotype.txt
│ └── ...
└── run_files
├── orthofinder_output
│ └── Results_MmmDD
│ ├── ...
│ ├── Orthogroups
│ └── ...
└── run_proteomes
The files XXX_coordinates.tsv
and XXX_karyotype.txt
are the intake files necessary for running RIdeogram to generate a macrosynteny figure. Copy them to plot_ideogram/input
and generate the plots using plot_ideogram.R
.
- An ncbi_downloads folder is created in the provided root
directory
, rather than in therun_name
folder. This improves efficiency, as it prevents the same accession from being downloaded multiple times across different run folders. - Species codes are defined by the first letter of an organism's genus name followed by the first two letters of its species name. In runs where multiple species share the same code, codes are defined by the first two letters of an organism's genus name followed by the first three letters of its species name. Unexpected issues may arise if multiple species share the same species code within the same directory provided.
The script SyntenyFinder/plot_ideogram/plot_ideogram.R
takes the karyotype and coordinate files generated by SyntenyFinder and plots them using RIdeogram.
To use this script, first copy the output files from root_directory/run_name/output
to plot_ideogram/input
. Then create the directories indicated with an (*)
as shown below for the output files of plot_ideogram.R
.
SyntenyFinder
└── plot_ideogram
├── ideograms (*)
│ ├── pdf (*)
│ └── svg (*)
├── input (*)
│ └── <copy output from SyntenyFinder>
└── plot_ideogram.R
This script generates SVG and PDF files which can then be modified in illustrator to create the final figures.
Please cite our preprint if you use this pipeline: