Estimation of the significance of an overlap between two whole genome sets of regions with respect to a third one, by Monte Carlo simulation and normal distribution modeling
AreaSonic2 program estimates the significance (P-value) of a total overlap between two sets of while-genome regions (tracks) with respect to a third track. All three tracks conventianally are described in BED format. As an option for RNA-seq data analysis where three tracks are peaks and promoters of differentially expressed genes (DEGs) and the rest genes, AreaSonic2 takes as input data a BED files for peaks and promoters of all genomes (with an obligatory column of gene ID) and a list IDs for DEGs. The one of the pillars of probability theory, Central limit theorem proves that the overlap between two tracks should possess the normal distribution due to (a) large number of iterations in Monte Carlo simulation and (b) large counts of regions in both tracks. AreaSonic2 fits a normal model of the total overlap length, and applies Monte Carlo simulation to estimate the distribution of expected overlap length. AreaSonic2 takes into account and perfectly models both the distributions of lengths of tracks regions and those of spacers lengths between regions. AreaSonic2 assigns to the first track the label 'fixed', while the second and third ones are marked as 'permuted'. Next, AreaSonic2 combines the first and the second tracks, and the first and the third ones, applying the approach of AreaSonic, see Scheme 1 below. Below we consider the pair first/second. At first step, the observed value 'Real' of total overlap length between two tracks is computed (see Step #1 in 1st scheme below). Now, the first iteration is started: regions and spacers between them for the permuted track are indexed, this gives two arrays {Regions} and {Spacers} (Step #2); indices of each array are swapped (Step #3), and chimeric sequence of regions and spacers is assembled (Step #4); thus, an estimate of the overlap length between the fixed and permuted track is computed at the end of the first itatation (Step #5). Next, hundreeds or thousand iterations are performed (Step #6). These iterations estimate the expected overlap length 'Av' and its standard deviation 'SD' (Step #7). These values are used to compute Z-score estimate as follows: Z-score = (Real - Av) / SD. Positive/negative Z-score imply the enrichment/depletion in the total overlap length between two tracks. Finally, for extremly large Z-scores (Z-score > 14) the asymptotic expansion of the complementary error function is used to compute P-value, otherwise it computed directly as the integral for the stadard normal distribution (Step #8).
Scheme 1 of AreaSonic2 algorithm, permutation algorithm for a pair of tracks
Scheme 2 of AreaSonic2 algorithm, combination of two tracks with respect to another track
Thus, for two pairs of tracks, first/second and first/third, two overlap lengths, two expected overlap lengths, two standard deviations, two Z-scores and two p-values are computed (Real12 and Real13, Av12 and Av13, σ12 and σ13, Z-score12 and Z-score13, p-value12 and p-value13), and three total lengths of tracks L1, L2 and L3. We should check the hypothesis that ratios between (a) the deviations of the overlap length Real12 and Real13 and from their expectations Av12 and Av13, and (b) the total length of tracks L2 and L3, are equal, see Scheme 2 below. Therefore, we check
Z-score = {(Real12 - Av12) / L2 - (Real13 - Av13) / L3} / SQRT{(σ12 * σ12) / (L2 * L2) + (σ13 * σ13) / (L3 * L3)}.
This value allow to compute the final significance of the overlap of the second and third track woth respect to the first track, p-value. It is computed through the calculation of integral or for the large Z-scores with the asypmtoptic approximation, as described above.
Scheme 2 of AreaSonic2 algorithm, combination of three tracks: test of the significance of two track with respect to another track
The Areasonic2 program is based on the AreaSonic algorithm that was successively applied earlier in Khoroshko et al. (2016) and Boldyreva et al. (2017). Another motivation for AreaSonic2 tool came from the CisCross tool, which aimed to reveal the enriched profile of DAP-seq/ChIP-seq peaks significantly enriched in promoters of DEGs compared to the promoters of the rest genes.
AreaSonic2 source code is written in C++ language. To compile exetubables from the source code you need:
- In Linux system, C++ compiler, e.g. GCC compiler
- In Windiws system any VC++ package, e.g. Microsoft Visual Studio Community
Input data include two tracks in BED format. First three columns in a track file are critically important, they represent a chromosome name and left/right positions of genomic regions. Note that regions in a BED file are presumed to be sorted in the ascending order of positions, the overlaps between neighbor regions are forbidden. Two input text files for the list of chromosome names and their lengths are required too, see their ready examples for D. melanogaster (dm5), A. thaliana (at10), M. musculus (mm10) and H. sapiens (hg38) genomes in src folder, e.g. chr_name_dm.txt and chr_length_dm5.txt
Folder src contains files with AreaSonic2 source codes, they respect to decribed below separate modules of pipeline.
- In Linux system:
git clone https://github.com/parthian-sterlet/areasonic2 cd areasonic2/src chmod a+x build.sh ./build.sh
- In Windows system:
separate compilation of all source files in VC++
Modules Partition of a genome track by separate chromosomes and Permutations should run consequently if input files are given as whole-genome track, if they are already partitioned by chromosomes, the second module Permutations can be used alone
Lists of command line arguments for all modules are described below
This is a parsing of a bed formatted file according to values in the first column designating chromosome names. For subsequent separate files for all chromosomes are required tabnslolbik_wordget.cpp
Command line arguments:
- input file, text table
- output file, processed text table
- int number of criteria
- int list of columns numbers, comma separated
- char list of words respecting columns
- int match type: 1 exact coincidence (word in a column exactly equal to given input word), 0 only occurrence (substring in a string is enough)
- int contain type: 0/1 mean searches of lines containing / do not containing words
- int logic: 1 means stringent requirement for criteria in all columns, 0 means mild requirement for only one of criteria in at least one of columns
The first mode of the main part of algorithm performing Monte Carlo simulation, three tracks in BED format area_shuffling_pair_bed.cpp
Command line arguments:
- input file in BED format, permuted track
- input file, the list of chromosome lengths
- input file, list of chromosome names
- input file in BED format, fixed track, test
- input file in BED format, fixed track, control
- integer number of iterations, minimal 100, at least 500 is required for stable results
- output file, statistical estimates for the overlap length between two tracks
- output file, distribution of expected overlap length
The second mode of the main part of algorithm performing Monte Carlo simulation, one track in BED format, promoters of whole genome promoters with gene IDs, list of gene IDs for DEGs area_shuffling_pair_bed.cpp
Command line arguments:
- input file in BED format, permuted track
- input file, the list of chromosome lengths
- input file, list of chromosome names
- input file in BED format, fixed track, promoters with gene IDs, BED format
- input file in BED format, fixed track, list of gene IDs for DEGs
- integer number of iterations, minimal 100, at least 500 is required for stable results
- output file, statistical estimates for the overlap length between two tracks
- output file, distribution of expected overlap length
Files 1.bat and 2.bat presnt example command lines for two modes of input data of AreaSonic2 program (three files in BED format, and one file in BED format, promoters, and list of gene IDs forDEGs), Bat files run in Windows OS. Perl script files test_bed.pl and test_nobed.pl show the example runs of AreaSonic2 for two types of input data (1) three tracks in BED format, and (2) a track in BED format, another track in BED format for all genome promoters with gene IDs, and the list of gene ID for DEGs. Example presents two runs of programs alternatively using first and second tracks as the permuted one. The output file from 8-th command line argument asb.dist shows the expected distribution for the total overlap length the first and second tracks, and between the first and third tracks.
The output file from 7-th command line argument asb.txt lists calculation results as follows (version with three tracks in BED format).
Label | Value | Meaning |
---|---|---|
AreaVar | EIN3_Chang 1230 275.513 | Permuted track (first): file name, total number of regions, total length of regions in kbp (1000 bp) |
AreaConst1 | DHS_Ath_seedling_normal 36064 5465.344 | Fixed track (second): file name, total number of regions, total length of regions in kbp |
AreaConst2 | DHS_Ath_root_normal 62804 9558.864 | Fixed track (third): file name, total number of regions, total length of regions in kbp |
Ncyc | 5000 5000 | Number of iterations between the first and second tracks, and between the first and third tracks |
Real | 98.463 85.477 | Real12, Real13 Observed overlap length between the first and second, the first and third tracks, in kbp |
Av | 12.592590 22.001664 | Av12, Av13 Average expected overlap length between the first and second tracks, the first and third tracks, in kbp |
SD | 1.221718 1.559410 | Standard deviation of expected overlap length, the first and second tracks, the first and third tracks, in kbp |
Z | 70.286589 40.704702 | Z12, Z13 Z-scores of expected overlap length, between the first and second tracks, and the first and third tracks, positive/negative Z-score implies the enrichment/depletion in the total overlap length |
-Log10[Pv] | 1074.7 361.494 | P-value in logarithmic scale, estimation of the significance for the total overlap length between first and second tracks, and between the first and third tracks |
Av | 0.015712 0.006640 | Ratios between (a) the deviations of the overlap length Real12 and Real13 and from their expectations Av12 and Av13, and (b) the total length of tracks L2 and L3, are equal |
SD | 0.000277 | Standard deviation of expected overlap length, in kbp (1000 bp) |
Z | 32.779568 | Z-score of expected overlap length, Z-score12 = (Real12 - Av12) / SD12, positive/negative Z-score implies the enrichment/depletion in the total overlap length between first and third tracks |
-Log10[Pv] | 234.94 | P-value in logarithmic scale, estimation of the significance for the total overlap length between second and third tracks, with respect to first track |
The output file from 7-th command line argument as.txt lists calculation results as follows (one track in BED format, promoters of whole genome promoters with gene IDs, list of gene IDs for DEGs).
Label | Value | Meaning |
---|---|---|
AreaVar | EIN3_Chang 1230 275.513 | Permuted track (first, BED format): file name, total number of regions, total length of regions in kbp |
AreaConst1 | ein3_deg 2139 3210.639 | Fixed track (second, list of IDs for DEGs): file name, total number of regions, total length of regions in kbp |
AreaConst2 | ups1500_protcod_tair10 25063 37619.563 | Fixed track (third, all promoters with gene IDs in BED format, DEGs and non-DEGs): file name, total number of regions, total length of regions in kbp |
Ncyc | 5000 5000 | Number of iterations between the first and second tracks, and between the first and third tracks |
Real | 36.263 145.003 | Real12, Real13 Observed overlap length between the first and second tracks, the first and third tracks, in kbp (1000 bp) |
Av | 7.467108 86.555878 | Av12, Av13 Average expected overlap length between the first and second tracks, the first and third tracks, in kbp (1000 bp) |
SD | 1.459687 4.783833 | Standard deviation of expected overlap length, the first and second tracks, the first and third tracks, in kbp (1000 bp) |
Z | 19.727438 12.217633 | Z12, Z13 Z-scores of expected overlap length, between the first and second tracks, and the first and third tracks, positive/negative Z-score implies the enrichment/depletion in the total overlap length |
-Log10[Pv] | 5.9051 36.813 | P-value in logarithmic scale, estimation of the significance for the total overlap length between first and second tracks, and between the first and third tracks |
Av | 0.008969 0.001554 | Ratios between (a) the deviations of the overlap length Real12 and Real13 and from their expectations Av12 and Av13, and (b) the total length of tracks L2 and L3 |
SD | 0.000472 | Standard deviation of the differnce between two ratios |
Z | 15.707311 | Z-score, positive/negative Z-score implies the enrichment/depletion for the total overlap length between DEGs and non-DEGs, with respect to first track |
-Log10[Pv] | 54.8754 | P-value in logarithmic scale, estimation of the significance for the total overlap length between DEGs and non-DEGs, with respect to first track |