Skip to content
gpratt edited this page Apr 5, 2013 · 15 revisions

Welcome to the clipper wiki!

CLIPper is a tool to define peaks in your CLIP-seq dataset. CLIPper was developed in the Yeo Lab at the University of California, San Diego.

For up to date information on clipper please join our announcement and discussion google groups

https://groups.google.com/forum/?fromgroups#!forum/clipper-discuss https://groups.google.com/forum/?fromgroups#!forum/clipper-announce

Installing CLIPper is (should be) easy! Using git, you can clone the most recent version of this repository to a writable location on your system.

git clone git://github.com/YeoLab/clipper.git
cd clipper
python setup.py install
#check whether clipper is actually installed
cd ..
clipper -h

The last step should produce a help message describing CLIPper and its parameters.

Before you use CLIPper, you'll have to generate mapped reads. This can be accomplished many ways, but we recommend using an aligner which is tolerant of spliced reads, in the future we will include options to detect mRNA-bound vs pre-mRNA-bound transcripts. We prefer to use GSNAP for this, but you may use other algorithms; however, note that all testing is done with GSNAP-derived mapped reads.

An example mapping procedure using gsnap could be this:

initial pre-processing: raw_reads ---> <Quality control> ---> <Trim sequencing adaptor sequences> ---> CLIP-seq_reads.gz

gsnap -t 4 -N 1 -A sam --gunzip -B 5 -s mm9 -d mm9 CLIP-seq_reads.gz > CLIP-seq_reads.sam
samtools view -bS CLIP-seq_reads.sam > CLIP-seq_reads.bam
samtools sort CLIP-seq_reads.bam CLIP-seq_reads.srt
samtools flagstat CLIP-seq_reads.srt.bam > CLIP-seq_reads.mapStats.txt

Do not use samtools rmdup. Bad things happen.

Then index your bam alignment:

samtools index CLIP-seq_reads.srt.bam

In its simplest incarnation, CLIPper can be called like this:

clipper -b CLIP-seq_reads.srt.bam -s hg19

Where -s is the species. The Yeo lab uses a special gene definition file internally (details available upon request). However in the case you are not using a model organism (currently only hg19 and mm9) one can use a bed12 file to specify gene definitions. To use this instead of the custom gene definition file use the argument:

--bedFile

Where the bed file is in a bed12 format, and contains all exon and intron information. (can be downloaded from the UCSC genome browser by downloading any of the gene and gene prediction tracks in the bed format.

However, like all great software packages, it can be more complicated than that!

CLIPper is not a ChIP-seq peak-finder. As opposed to ChIP-seq peak-finding methods, there a few caveats that one must take into consideration when defining peaks. First, since we are not dealing with DNA, we're dealing with RNA, we must significance thresholds must be defined on a gene-by-gene basis. That is to say that we can't use a genome-wide cutoff for significance as is done for ChIP-seq because of differences in gene expression of targets. To account for this effect, we define all of our peaks' significance based on the number of reads within a peak, relative to the number of other reads on a gene and the length of that gene. To further complicate matters, we're not certain whether the CLIP-seq reads are from a pre-mRNA or a mature mRNA. Some RNA-binding proteins (RBPs) bind before an RNA is processed, others after, and the likelihood that a peak is significant changes with respect to the biology of that RBP. We have provided a pre-complied list of "effective" gene-lengths for human and mouse mRNA and pre-mRNAs from ensembl, subtracting out the length of repetitive genomic elements in the span of the gene. If you suspect your RBP is binding pre-mRNAs, please specify

--premRNA

as an option when you run CLIPper and it will use pre-mRNA lengths, instead of the default mRNA lengths.

CLIPper combines features from many CLIP peak-finding algorithms. To reduce false-positives, we employ a three-pass filter on our peaks. First for each gene we calculate the false-discovery rate threshold (FDR), which is the "height" of reads mapped at a single genomic position that is likely to be noise, determined by randomly scattering the same number of faux reads as real reads across a faux transcript that is the same effective length as the real transcript. Details of this method can be found here. If you would like to manipulate the FDR, you may do so with this parameter:

--FDR <float>

or, alternatively, you can elect to skip the FDR calculation entirely and manually set a height threshold using:

--threshold <(int ≥ 0)>

Next, we remove peaks which have fewer reads than would be significant under a poisson distribution, using a p-value cutoff set by the parameter:

--poisson-cutoff < (float ≤ 1) >

An example of peak-finding using only poisson from our lab is found here. If you'd like to ignore the poisson cutoff, set it to 1. We use this same parameter to determine whether each peak is significant relative to all other reads on and the length of the entire transcriptome. You may disable the transcriptome-wide cutoff with:

--disable_global_cutoff

To improve the accuracy of the peak finder, we use cubic spline fitting to approximate the "shape" of a peak. We define the bounds of a peak as the points on the fitted curve which fall below FDR or are local minima above the FDR. This achieves a greater resolution and allows us to disambiguate multiple binding sites which are close to one another, as opposed to clumping them together. In order to efficiently perform the curve-fitting operation, we sub-divide each gene into "sections" which are regions which have ≥ 1 read and within a certain margin apart from one another. You may manipulate the margin using the

--margin <int>

parameter. Setting this value too low will result in multiple fits and slow CLIPper down dramatically, setting this value too high will under-fit the data and result in giant peaks. Testing multiple values for --margin shows that 15 is a reasonable number to set for this parameter, but results will vary and future implementations of CLIPper may migrate towards a more intelligent way of defining this parameter internally.

Another feature which is semi-experimental, is the

--superlocal

option, which re-defines the poisson cutoff and FDR in each "section" (within <--margin> nucleotides) defined above. In this implementation, we reasoned that some significant peaks are over-shadowed and therefore their significance is discounted when there are many reads aligned in a far-away part of a gene. This method uses a window of 1 kB in each direction of a putative peak to define FDR and poisson significance and serves to pick up peaks which would be otherwise missed with genome-wide or gene-wide thresholds. Approach this option with caution, but please do approach it. In our experience it improves sensitivity.

Clipper outputs a bed8 file:

chromosome, genomic_start, genomic_stop, cluster_name, min_pval, strand, thick_start, thick_stop

cluster_name contains the gene name, the number of reads in the cluster and a counter to give the cluster a unique identifier.

The genomic start and genomic start are the positions above the FDR cutoff that we used to call the peak. Any reads with their "middle" in this region are counted towards the cluster's read count. Thick start and thick stop specify the apex of the fitted curve in the peak.

We hope your experience with CLIPper is a productive one. If you have any comments, questions or concerns please do not hesitate to contact us. There are more options which should be self-explanatory but documentation of these options will nonetheless be written here eventually.

Clone this wiki locally