Quantitative and population genetics analyses using pool sequencing data (i.e. SNP data where each sample is a pool or group of individuals, a population or a single polyploid individual).
Build Status | License |
---|---|
-
Install rustup from https://www.rust-lang.org/tools/install.
-
Download this repository
git clone https://github.com/jeffersonfparil/poolgen.git
-
Compile and run
cd poolgen/ cargo build --release ./target/release/poolgen -h
-
Detailed documentation
cargo doc --open
Header line/s and comments should be prepended by '#'.
Summarised or piled up base calls of aligned reads to a reference genome.
- Column 1: name of chromosome, scaffold or contig
- Column 2: locus position
- Column 3: reference allele
- Column 4: coverage, i.e. number of times the locus was sequenced
- Column 5: read codes, i.e. "." ("," for reverse strand) reference allele; "A/T/C/G" ("a/t/c/g" for reverse strand) alternative alleles; "
\[+-][0-9]+[ACGTNacgtn]
" insertions and deletions; "^[" start of read including the mapping quality score; "$" end of read; and "*" deleted or missing locus. - Column 6: base qualities encoded as the
10 ^ -((ascii value of the character - 33) / 10)
- Columns 7 - 3n: coverages, reads, and base qualities of n pools (3 columns per pool).
Canonical variant calling or genotype data format for individual samples. This should include the AD
field (allele depth), and genotype calls are not required since allele frequencies from allele depth will be used. The input vcf file can be generated with bcftools mpileup like: bcftools mpileup -a AD...
. The vcf2sync
utility is expected to work with vcf versions 4.2 and 4.3. See VCFv4.2 and VCFv4.3 for details in the format specifications.
An extension of popoolation2's sync or synchronised pileup file format, which includes a header line prepended by '#' showing the names of each column including the names of each pool. Additional header line/s and comments prepended with '#' may be added anywhere within the file.
- Header line/s: optional header line/s including the names of the pools, e.g.
# chr pos ref pool1 pool2 pool3 pool4 pool5
- Column 1: chromosome or scaffold name
- Column 2: locus position
- Column 3: reference allele, e.g. A, T, C, G
- Column/s 4 to n: colon-delimited allele counts: A:T:C:G:DEL:N, where "DEL" refers to insertion/deletion, and "N" is unclassified. A pool or population or polyploid individual is represented by a single column of this colon-delimited allele counts.
-
A simple delimited file, e.g. "csv" and "tsv" with a column for the individual IDs, and at least one column for the phenotypic values. Header line/s and comments should be prepended by '#'.
-
GWAlpha-compatible text file (i.e. "py"):
- Line 1: phenotype name
- Line 2: standard deviation of the phenotype across pools or for the entire population
- Line 3: minimum phenotype value
- Line 4: maximum phenotype value
- Line 5: cummulative pool sizes percentiles (e.g. 0.2,0.4,0.6,0.8,1.0)
- Line 6: phenotype values corresponding to each percentile (e.g. 0.16,0.20,0.23,0.27,0.42)
Convert pileup from samtools mpileup
into a synchronised pileup format. Pileup from alignments can be generated similar to below:
samtools mpileup \
-b /list/of/samtools/-/indexed/bam/or/cram/files.txt \
-l /list/of/SNPs/in/tab/-/delimited/format/or/bed/-/like.txt \
-d 100000 \
-q 30 \
-Q 30 \
-f /reference/genome.fna \
-o /output/file.pileup
Convert the most widely used genotype data format, variant call format (*.vcf
) into a synchronised pileup format, making use of allele depths to estimate allele frequencies and omitting genotype classes information including genotype likelihoods. This utility should be compatible with vcf versions 4.2 and 4.3.
Convert synchronised pileup format into a matrix (
Perform Fisher's exact test per locus.
Perform Chi-square test per locus.
Calculate correlations between allele frequencies per locus and phenotype data.
Perform ordinary linear least squares regression between allele frequencies and phenotypes per locus, independently.
Perform ordinary linear least squares regression between allele frequencies and phenotypes using a kinship matrix (
Perform linear regression between allele frequencies and phenotypes using maximum likelihood estimation per locus, independently.
Perform linear regression between allele frequencies and phenotypes using maximum likelihood estimation a kinship matrix (
Perform parametric genomewide association study using pool sequencing data, i.e. pool-GWAS. Refer to Fournier-Level, et al, 2017 for more details.
Perform ridge regression between allele frequencies and phenotypes per locus, independently.
Perform genomic prediction cross-validation using various models including ordinary least squares (OLS), ridge regression (RR), least absolute shrinkage and selection operator (LASSO), and elastic-net (glmnet).
Estimate pairwise genetic differentiation between pools using unbiased estimates of heterozygosity (
Estimates per sliding window heterozygosities within populations using the unbiased method discussed above.
Estimates of Watterson's estimator of
Computes Tajima's D per sliding (overlapping/non-overlapping) window.
Genomewide unbiased determination of the modes of convergent evolution. Per population, significant troughs (selective sweeps) and peaks (balancing selection) are detected and the widths of which are measured. Per population pair, significant deviations from mean genomewide Fst within the identified significant Tajima's D peaks and troughs are also identified. Narrow Tajima's D troughs/peaks imply de novo mutation as the source of the genomic segment under selection, while wider ones imply standing genetic variation as the source. At the loci under selection, pairwise Fst which are significantly lower than genomewide Fst imply migration of causal variants between populations, significantly higher implies independent evolution within each population, and non-significantly deviated pairwise Fst implies a shared source of the variants under selection.
The simplest regression model implemented is the ordinary least squares (OLS), where the allele effects are estimated as:
- if
$n >= p$ , then$\hat{\beta} = (X^{T}X)^{-1} X^{T} y$ - if
$n < p$ , then$\hat{\beta} = X^{T} (XX^{T})^{-1} y$
where: f64::EPSILON
).
GWAlpha: genomewide estimation of additive effects based on quantile distributions from pool-sequencing experiments
GWAlpha (Fournier-Level, et al, 2016) iteratively estimates for each locus the effect of each allele on the phenotypic ranking of each pool. This allele effect is defined as $\hat{\alpha} = W {{(\hat{\mu}{0} - \hat{\mu}{1})} \over {\sigma_{y}}}$, where:
-
let
$a$ be the allele in question, and$q$ be the frequency of$a$ , then -
$W = 2 \sqrt{q*(1-q)}$ is the penalisation for low allele frequency, -
$\mu_{0}$ is the mean of the beta distribution representing$a$ across pools, -
$\mu_{1}$ is the mean of the beta distribution representing$1-a$ (i.e. additive inverse of$a$ ) across pools, -
$\sigma_{y}$ is the standard deviation of the phenotype, -
$\beta(\Theta={\theta_{1}, \theta_{2}})$ is used to model the distributions of$a$ and$1-a$ across pools, where:-
$\Theta$ is estimated via maximum likelihood, i.e. $L(\Theta \mid Q) \sim \Pi^{k}{i=1} \beta{pdf}(q_{i} \mid \Theta)$ for the$i^{th}$ pool, -
$Q = {q_{1},...,q_{k}}$ is the cumulative sum of allele frequencies across increasing-phenotypic-value-sorted pools where$k$ is the number of pools, and -
$\beta_{pdf}(q_{i} \mid \Theta)$ is the probability density function for the$\beta$ distribution. -
$q_{i} = \beta_{cdf}(y^{\prime}{i},\Theta) - \beta{cdf}(y^{\prime}{i-1},\Theta)$, where $y^{\prime}{i} \in Y^{\prime}$
-
$Y'$ is the inverse quantile-normalized into phenotype data such that$Y^{\prime} \in [0,1]$ .
-
I have attempted to create a penalisation algorithm similar to elastic-net or glmnet package (Friedman et al, 2010), where ridge regression is implemented if
- Fst and Tajima's D estimation
- Imputation
- Canonical variant call format (vcf) file to sync file conversion
- Simulation of genotype and phenotype data
- Improve genomic prediction cross-validation's use of PredictionPerformance fields, i.e. output the predictions and predictor distributions
- Additional imputation algorithm. See details below:
Performs OLS or elastic-net regression to predict missing allele counts per window for each pool with at least one locus with missing data. This imputation method requires at least one pool without missing data across the window. It follows that to maximise the number of loci we can impute, we need to impose a maximum window size equal to the length of the sequencing read used to generate the data, e.g. 100 bp to 150 bp for Illumina reads.
For each pool with missing data we model the allele frequencies in the locus with some missing data as: $$ y_{p} = X_{p}\beta + \epsilon. $$
We then estimate unbiased estimators,
where
where:
-
$y_{p}$ is the vector of allele counts of one of the pools with missing data at the loci without missing data (length: mₚ non-missing loci × 7 alleles); -
$X_{p}$ is the matrix of allele counts of pools without missing data at the loci without missing data in the other pools (dimensions: mₚ non-missing loci × 7 alleles, nₚ pools without missing loci); -
$\hat{\beta}$ is the vector of estimates of the effects of each pool without missing data on the allele counts of one of the pools with missing data (length: nₚ pools without missing loci); -
$\hat{y_{m}}$ is the vector of imputed allele counts of one of the pools with missing data (length: mₘ missing loci × 7 alleles); and -
$X_{m}$ is the matrix of allele counts of pools without missing data at the loci with missing data in the other pools (dimensions: mₘ non-missing loci × 7 alleles, nₚ pools without missing loci).
Finally, the imputed allele counts are averaged across the windows sliding one locus at a time.