LDWeaver accepts a sequence alignment (fasta) and its reference annotation
(genbank or gff) as inputs and identifies linkage disequilibrium (LD) between
pairs of variants (links) that is unusually high given the genomic distance
between the pair. This high LD could be the result of co-selection or epistasis.
Approximate statistical significance is used to rank outlier links and the
output is reported in tsv
format, along with several other helpful annotations and figures.
Additionally, LDWeaver has functions to assist the detection of genomic regions
that have potentially undergone co-selection or epistasis. LDWeaver tsv
output
can be directly used as input for
GWESExplorer
for dynamic link visualisation.
LDWeaver
is available on github. It can be installed with
devtools
install.packages("devtools")
devtools::install_github("Sudaraka88/LDWeaver")
Preferred Method - using mamba
If you haven't already, first, you need to install mamba. Afterwards, create a new environment and install r-ldweaver
:
conda create -n r-ldweaver
conda activate r-ldweaver
mamba install -c conda-forge -c bioconda r-ldweaver
Alternatively, you can use conda.
conda create -n r-ldweaver
conda activate r-ldweaver
conda install -c conda-forge -c bioconda r-ldweaver
To run LDWeaver, all you need is a fasta alignment (can be gz compressed) and the corresponding genbank (preferred, or gff3 also supported) annotation file with the reference sequence. You can also use a SNP-only alignment (can be generated from snp-sites or FastaR) with a numeric vector of SNP positions as per the reference genome.
A toy dataset is provided in the package itself to confirm that everything is setup correctly.
Note For GWES analysis, it is recommended to use an alignment with a large number of sequences (>500). Such alignments can be several gigabytes in size and not suitable to bundle into an R package. A proper example using real data is available below.
The toy alignment comprises the first 50kb of 400 S. pnuemoniae
isolates (randomy selected from the 616 reported
here).
The reads were aligned against the
ATCC
700669 reference genome using
snippy.
Since this is only a part of an alignment, we set
validate_ref_ann_lengths = F
, which forces LDWeaver to ignore the
mismatch in sequence lengths between the genbank reference sequence and the
alignment. Several additional options are also used, see
help(package="LDWeaver")
for more details.
# devtools::install_github("Sudaraka88/LDWeaver")
library(LDWeaver)
dset <- "sample"
aln_path <- system.file("extdata", "sample.aln.gz", package = "LDWeaver")
gbk_path <- system.file("extdata", "sample.gbk", package = "LDWeaver")
snp_filt_method = "relaxed"
LDWeaver(dset = dset, aln_path = aln_path, gbk_path = gbk_path, validate_ref_ann_lengths = F,
num_clusts_CDS = 2, SnpEff_Annotate = F, snp_filt_method = snp_filt_method)
Note If you are using a SNP-only alignment, set
aln_has_all_bases = F
and providepos
, a numeric vector of SNP positions. Each SNP in the SNP-only alignment must have a unique SNP position.
Please cite LDWeaver using: Mallawaarachchi, S. et al. Detecting co-selection through excess linkage disequilibrium in bacterial genomes, NAR genom. bioinform., 6(2),lqae061 (2024): https://doi.org/10.1093/nargab/lqae061
The following analysis demonstrates most of the options available in LDWeaver. The alignment with 616 S. pnuemoniae genomes is available here, and the same sample.gbk annotation was used to generate this alignment (also available here).
Note: Alternatively, you can download and use the SNP-only alignment with the accompanying positions file to generate the same results.
For this example, it is assumed that the current working directory is set to
~/LDWeaver_run
and the
alignment
is available in the same folder. Please note that file paths here are written for Linux/macOS operating systems,
windows users will need to modify as needed.
The following few lines of code can perform the complete LDWeaver analysis.
library(LDWeaver)
### WARNING! Possible runtime > 1h
setwd("~/LDWeaver_run")
dset <- "msch"
aln_path <- "spn23f_msch.aln.gz"
gbk_path <- system.file("extdata", "sample.gbk", package = "LDWeaver")
LDWeaver::LDWeaver(dset = dset,
aln_path = aln_path,
gbk_path = gbk_path,
save_additional_outputs = T)
LDWeaver::LDWeaver()
one-liner is versatile for most
analyses. If previously created outputs are available in <dset>, this
function will load those instead of repeating possibly time and resource heavy analysis.
It is also possible to write customised pipelines using available functions. For a full list of available functions
and options, run: help(package="LDWeaver")
.
library(LDWeaver)
dset <- "msch"
dir.create(dset) # folder to save outputs
aln_path <- "~/LDWeaver_run/spn23f_msch.aln.gz"
gbk_path <- system.file("extdata", "sample.gbk", package = "LDWeaver")
ncores = parallel::detectCores()
snp.dat = LDWeaver::parse_fasta_alignment(aln_path = aln_path) # parse the alignment and extract SNPs
gbk = LDWeaver::parse_genbank_file(gbk_path = gbk_path, g = snp.dat$g) # parse the annotation
cds_var = LDWeaver::estimate_variation_in_CDS(gbk = gbk, snp.dat = snp.dat,
ncores = ncores,
num_clusts_CDS = 3,
clust_plt_path = "msch/CDS_clustering.png")
hdw = LDWeaver::estimate_Hamming_distance_weights(snp.dat = snp.dat) # Hamming distance weights
# Perform MI computation model fitting and ARACNE - this will take some time...
sr_links = LDWeaver::perform_MI_computation(snp.dat = snp.dat, hdw = hdw,
cds_var = cds_var, ncores = ncores,
lr_save_path = "msch/lr_links.tsv",
sr_save_path = "msch/sr_links.tsv",
plt_folder = dset)
# Generate GWES Plots (short-range)
LDWeaver::make_gwes_plots(sr_links = sr_links, plt_folder = dset)
# Identify the top hits by performing snpEff annotations
tophits = LDWeaver::perform_snpEff_annotations(dset_name = dset, annotation_folder = file.path(getwd(), dset),
gbk = gbk, gbk_path = gbk_path, cds_var = cds_var,
links_df = sr_links, snp.dat = snp.dat,
tophits_path = "msch/sr_tophits.tsv")
This will generate several outputs comprising annotations into the <msch> folder, please see Performing Annotations.
# Generate tanglegram
LDWeaver::create_tanglegram(tophits = tophits, gbk = gbk, tanglegram_folder = "msch/SR_Tanglegram")
Above line will create 5 tanglegrams in html format, the first one should look like this:
# Generate GWES Explorer outputs
LDWeaver::write_output_for_gwes_explorer(snp.dat = snp.dat, tophits = tophits,
gwes_explorer_folder = "msch/SR_GWESExplorer")
Above line will create three files in <msch/SR_GWESExplorer/> that can be used as inputs for GWESExplorer. In addition, it is possible to provide a GFF3 annotation file, phenotype data and a core genome phylogeny. The circular GWESExplorer plot should look like this:
Next step is to analyse the long range links
# Analyse long range links
LDWeaver::analyse_long_range_links(dset = dset, lr_links_path = "msch/lr_links.tsv",
sr_links_path = "msch/sr_links.tsv", SnpEff_Annotate = T,
snp.dat = snp.dat, gbk_path = gbk_path, cds_var = cds_var)
By this time, the msch
folder is cluttered with a large number of outputs.
LDWeaver::LDWeaver
one-liner will automatically tidy up these files in to several folders.
This can be done using the cleanup
function.
LDWeaver::cleanup(dset)
It is possible to generate a genomewide LD distribution map using the following:
LDWeaver::genomewide_LDMap(lr_links_path = "msch/Temp/lr_links.tsv",
sr_links_path = "msch/Temp/sr_links.tsv",
plot_save_path = "msch/GWLD.png")
Note The paths have now updated after running LDWeaver::cleanup().
A network plot
for the pbp genes showcasing both the number of links between
sites and their magnitude can be generated using:
# Generate the Network Plot for pbp genes
network = LDWeaver::create_network_for_gene("pbp",
sr_annotated_path = "msch/Annotated_links/sr_links_annotated.tsv",
lr_annotated_path = "msch/Annotated_links/lr_links_annotated.tsv",
level = 2)
LDWeaver::create_network(network,
plot_title = "pbp network",
netplot_path = "msch/pbp_network.png",
plot_w = 2000, plot_h = 2000)
Note With LDWeaver >1.5, you can analyse mega scale datasets with > 2^(32-1) elements. This requires spam and spam64 packages. Set
mega_dset=T
inLDWeaver::LDWeaver()
to use this feature. Warning! This is currently considerably slower than the default mode (mega_dset=F
) and only supports single core operations. There will also be minor discrepancies between the two methods due to floating point errors, however, this should only have a minimal impact on the final link ranking.
If the above steps worked as expected, the following output will be saved to a
folder called sample
, which should be created in the current working directory.
(Working directory can be queried using: getwd()
).
- Figures
- sample/cX_fit.png - shows the distribution and modelling of the background linkage disequilibrium (estimated using weighted Mutual Information) vs. bp-separation within each cluster (X = 1,2 in the example)
- sample/CDS_clustering.png - shows the genome partitioning, based on the CDS diversity (compared to the reference sequence)
- sample/sr_gwes_clust.png - short-range GWES plot for each cluster (2 in this case)
- sample/sr_gwes_combi.png - combined short-range GWES plot (for links with bp positions spanning two clusters, the max srp_value is used)
- sample/lr_gwes.png - Long range GWES plot (similar to the output from SpydrPick)
- Outputs
- sample/sr_links.tsv - tab separated file containing details on short-range links (i.e. links <= sr_dist bp apart)
- sample/lr_links.tsv - tab separated file containing details on long-range links (i.e. links > sr_dist bp apart)
Note The default
sr_dist
value in LDWeaver is 20000bp (user modifiable). This means that by default, links from SNPs <20kb apart are considered short-range.
- Additional Outputs (not generated) - can be used to avoid costly re-computations.
- Additional_Outputs/snp_ACGTN.rds - list comprising sparse SNP data from the alignment
- Additional_Outputs/parsed_gbk.rds - GenBankRecord of the genbank annotation data
- Additional_Outputs/hdw.rds - named vector comprising Hamming distance weights for each sequence
- Additional_Outputs/cds_var.rds - list comprising alignment diversity information
Note For very large datsets, the user has the option to set
save_additional_outputs=T
. When these four files are present in <dset>/Additional_Outputs/, the saved information will be used instead of re-computing.
By default, LDWeaver performs detailed annotations on all link SNPs using
SnpEff.
This will create the following outputs in <dset>. Note that X
here
refers to sr (short range) or lr (long range).
- Outputs
- Annotated_links/X_links_annotated.tsv - tab separated file similar to sample/X_links.tsv with additional SnpEff annotations and allele distribution information
- Tophits/X_tophits.tsv - tab separated file containing the top 250
links (user modifiable with
max_tophipts
) . Several filters are applied to extract the top links from Annotated_links/X_links_annotated.tsv - SR_Tanglegram - folder compirising html tanglegrams to easily visualise links and the corresponding genomic regions
- GWESExplorer/X_GWESExplorer - folder containing the outputs necessary to dynamically explore links using GWESExplorer (X = sr,lr).
Note The default srp_cutoff is 3 (i.e., p=0.001). Short-range links with p>0.001 are automatically discarded, this can be modified using the <srp_cutoff> option. The default max_tophits value is 250, this can be modified using the <max_tophits> option.
- Temporary files created during snpEff annotations. These are all written to <dset>/Temp and can be ignored or safely deleted)
- Temp/snpEff_data - data folder for snpEff
- Temp/snpEff.config - configuration file for snpEff
- Temp/X_annotations.tsv - tab separated file containing full snpEff annotations on each site associated with a short-range GWES link with srp_max > srp_cutoff
- Temp/X_annotataed_stats.genes.txt - annotations and statistics in tab separated format
- Temp/X_annotated_stats.html - annotations and statistics in html format
- Temp/X_snps.vcf, Temp/X_snps_ann.vcf - input and output from the snpEff annotation pipeline