Skip to content

Releases: jsh58/Genrich

Version 0.6.1

17 Mar 12:13
Compare
Choose a tag to compare

Two bugs fixed:

  • analyzing singleton alignments with no alignment scores (#30)
  • removing PCR duplicates when one of multiple samples has no unpaired alignments (#72)

Version 0.6

04 Aug 14:45
Compare
Choose a tag to compare
  • New default peak-calling parameters: -p 0.01 -a 200.0. This also means that, by default, q-values are not calculated.
  • In ATAC-seq mode, the ends of fragments are adjusted by +5/-5 to account for Tn5 transposase occupancy, when determining the centers of cut site intervals. This adjustment can be avoided with -D.
  • The number of warning messages of reads extending off the ends of chromosomes has been limited to 128 (per input file).
  • A convenience option (-S) was added, which prevents Genrich from checking the sort order of the input SAM/BAM file. The input file is still parsed under the assumption that it is sorted by queryname, so this option should be used judiciously.

Version 0.5

14 Jan 15:55
Compare
Choose a tag to compare
  • The null statistical model was changed to the log-normal distribution. This was determined to be a more appropriate choice than other distributions (including the previously used exponential) from examining ChIP-seq control samples.

  • The minimum length (-l) is now a regular peak-calling parameter, not an alternative peak-calling method. The default value is 0bp. Those who prefer peak-calling by minimum length and not AUC can run -a 0 -l <int>.

  • Two convenience options were added:

    • -X: option to skip peak-calling, but perform all alignment parsing (including identifying PCR duplicates)
    • -P: option to call peaks directly from a bedgraph-ish log file (-f) produced by a previous Genrich run

Version 0.4

17 Dec 15:01
Compare
Choose a tag to compare
  • Option to remove PCR duplicates (-r). The procedure is explained here. Incorporates ideas from findDups9.py, in particular the process of evaluating multimapping reads/fragments.

    • A log file listing duplicates can be produced via -R <file>.
  • A reasonably complete README was added to the homepage.

  • Input SAM/BAM files are now required to be queryname-sorted (samtools sort -n).

  • A bug in alignment parsing (paired secondary alignments with matching positions) was fixed.

Version 0.3

04 Oct 14:23
Compare
Choose a tag to compare
  • New default peak-calling method: area under the curve (AUC). For a peak to be called, the total significance of the region must exceed a minimum value (-a <float>, default 20.0).

    • The total significance is calculated as the sum of the -log(q) values above the -q threshold over the length of the region (i.e. the area under the -log(q) "curve"). If a -p threshold is specified, the area under the -log(p) curve is calculated.
    • The maximum gap parameter (-g <int>) still allows multiple regions to be linked.
    • No minimum length is required for a peak to be called.
    • Can be overridden by specifying -l <int>, in which case peak-calling reverts to the previous method, with the given minimum length for peaks.
  • Option to provide a BED file of genomic regions to exclude from analysis (-E <file>).

    • The regions will affect peak calls, such that no peak may extend into or around an excluded region.
    • The regions' lengths will be subtracted from the genome length calculated by the program.
    • In the output log files, excluded regions will have treatment/control pileup values of 0.0 and p-/q-values of NA.
    • Multiple BED files can be specified, comma-separated (or space-separated, in quotes).
  • Accessory script findNs.py to produce a BED file of 'N' homopolymers from a fasta file (e.g. a reference genome). The output can (and should) be given to Genrich via -E (above).

  • Option to keep unpaired alignments, with lengths changed to a given value, has been changed to -w <int> (formerly -a <int>).

Version 0.2

24 Sep 13:56
Compare
Choose a tag to compare
  • Multiple replicates are now analyzed separately, with p-values calculated for each. At each position, the multiple p-values are then combined by Fisher's method, before conversion to q-values.

  • Three optional output log files:

    • -f <file>: with one replicate, it lists treatment/control pileups, p- and q-values, and significance for each interval; with multiple replicates, it lists p-values of each replicate, combined p-value, q-value, and significance for each interval.
    • -k <file>: for each replicate, sequentially, it lists a header line (# treatment file: <name>; control file: <name>), followed by treatment/control pileups and a p-value for each interval. This is the way to examine pileup values with multiple replicates, since the -f file will not supply them.
    • -b <file>: an unsorted BED file of the reads/fragments analyzed. The 4th column gives the read name, number of alignments, 'T'reatment or 'C'ontrol, and sample number (0-based), e.g. SRR5427885.57_2_T_0.
  • Option to analyze reads in ATAC-seq mode (-j). Instead of analyzing full fragments, the program uses "cut site intervals" centered on the ends of fragments. The interval lengths are determined by the -d parameter (def. 100bp).

  • Control sample pileups are now scaled to match the treatment, based on fragment/interval lengths.

  • Fixed bug related to sorting alignments by scores.

  • Fixed bugs related to reference sequences (chromosomes) missing from a SAM/BAM file. Reads aligning to chromosomes either on the "skipped" list (-e argument) or not appearing in the header of the treatment SAM/BAM file will not be analyzed.

Version 0.1

31 Aug 19:54
Compare
Choose a tag to compare
  • Analyzes secondary alignments:
    • Keeps alignments whose alignment scores (AS) are within a specified value (-s <float>) of the best alignment. The default value of -s 0 means that only alignments judged as equivalent by the aligner will be kept.
    • Each of the n alignments for a read/fragment is counted as 1/n for the pileup.
    • To avoid excessive memory usage and the imprecision inherent in floating-point values, a maximum of 10 alignments per read is analyzed. Reads with more than 10 alignments will be subsampled based on the best alignment scores; in the case of ties, alignments appearing first in the SAM/BAM are favored.
  • Input SAM/BAM files are required to be name sorted (samtools sort -n)
  • A simple hashtable is implemented for efficiently compiling p-values to calculate q-values. The q-value calculation is still based on the Benjamini-Hochberg procedure.

Version 0.0

13 Aug 14:08
Compare
Choose a tag to compare

Peak-calling for genomic enrichment assays of treatment sample(s) (-t [<file>]+), with or without control sample(s) (-c [<file>]+). Input files must be in SAM/BAM format. Incorporates ideas from SAMtoBED, removeChrom, and MACS2 v2.1.2_dev.

  • Control of analysis of alignments:
    • Properly paired alignments only (default); fragments are inferred appropriately
    • Also keeping unpaired alignments:
      • As they appear in the SAM/BAM file (-y)
      • Length increased to specified value (-a <int>)
      • Length increased to average value of paired alignments (-x)
  • Filtering options:
    • Chromosomes (reference sequences) to ignore (-e <arg>)
    • Minimum mapping quality (-m <int>)
  • Peak-calling options:
    • Maximum q-value (-q <float>, def. 0.05)
    • Maximum p-value (-p <float>)
    • Minimum length of a peak (-l <int>, def. 100bp)
    • Maximum distance between significant sites (-g <int>, def. 100bp)
  • Output options:
    • Output peak file, in ENCODE narrowPeak format (-o <file)
    • Output bedgraph file listing treatment/control pileups and p/q values (-b <file>)