Skip to content
spikesd17 edited this page Aug 8, 2012 · 15 revisions

Welcome to the clipper wiki!

CLIPper is a tool to define peaks in your CLIP-seq dataset. CLIPper was developed in the Yeo Lab at the University of California, San Diego.

Installing CLIPper is (should be) easy! Using git, you can clone the most recent version of this repository to a writable location on your system.

git clone git://github.com/YeoLab/clipper.git
cd clipper
python setup.py install
#check whether clipper is actually installed
cd ..
clipper -h

The last step should produce a help message describing CLIPper and its parameters.

Before you use CLIPper, you'll have to generate mapped reads. This can be accomplished many ways, but we recommend using an aligner which is tolerant of spliced reads, in the future we will include options to detect mRNA-bound vs pre-mRNA-bound transcripts. We prefer to use GSNAP for this, but you may use other algorithms; however, note that all testing is done with GSNAP-derived mapped reads.

An example mapping procedure using gsnap could be this:

initial pre-processing: raw_reads ---> <Quality control> ---> <Trim sequencing adaptor sequences> ---> CLIP-seq_reads.gz

gsnap -t 4 -N 1 -A sam --gunzip -B 5 -s mm9 -d mm9 CLIP-seq_reads.gz > CLIP-seq_reads.sam
samtools view -bS CLIP-seq_reads.sam > CLIP-seq_reads.bam
samtools sort CLIP-seq_reads.bam CLIP-seq_reads.srt
samtools flagstat CLIP-seq_reads.srt.bam > CLIP-seq_reads.mapStats.txt

if you decide you want to remove reads with redundant start/stop positions (If you're not sure, then omit this step. You can ask CLIPper to collpase these later [--trim option]).

samtools rmdup -s CLIP-seq_reads.srt.bam CLIP-seq_reads.srt.rmdup.bam
samtools flagstat CLIP-seq_reads.srt.rmdup.bam >> CLIP-seq_reads.mapStats.txt

Then index your bam alignment:

samtools index CLIP-seq_reads.srt.bam

In its simplest incarnation, the CLIPper can be called like this:

clipper -bam CLIP-seq_reads.srt.bam -s hg19

However, like all great software packages, it can be more complicated than that!

CLIPper is not a ChIP-seq peak-finder. As opposed to ChIP-seq peak-finding methods, there a few caveats that one must take into consideration when defining peaks. First, since we are not dealing with DNA, we're dealing with RNA, we must re-define significance thresholds on a gene-by-gene basis. That is to say that we can't use a genome-wide cutoff for significance as is done for ChIP-seq because this will simply bais results towards more highly-expressed transcripts. To account for this effect, we define all of our peaks' significance based on the number of reads within a peak, relative to the number of other reads on a gene and the length of that gene. It gets tricky here again because we're not certain whether the CLIP-seq reads are from a pre-mRNA or a mature mRNA. Some RNA-binding proteins (RBPs) bind before an RNA is processed, others after, and the likelihood that a peak is significant changes with respect to the biology of that RBP. We have provided a pre-complied list of "effective" gene-lengths for human and mouse mRNA and pre-mRNAs from ensembl, subtracting out the length of repetitive genomic elements in the span of the gene. If you suspect your RBP is binding pre-mRNAs, please specify

--premRNA

as an option when you run CLIPper and it will use pre-mRNA lengths, instead of the default mRNA lengths.

CLIPper combines features from many CLIP peak-finding algorithms. To reduce false-positives, we employ a three-pass filter on our peaks. First for each gene we calculate the False-discovery rate threshold (FDR), which is the "height" of reads mapped at a single genomic position that is likely to be noise, by randomly scattering the same number of faux reads as real reads across a faux transcript that is the same effective length as the real transcript. Details of this method can be found here. If you would like to manipulate the FDR, you may do so with this parameter:

--FDR <float>

or, alternatively, you can elect to skip the FDR calculation entirely and manually set a height threshold using:

--threshold <(int ≥ 0)>

Next, we remove peaks which have fewer reads than would be significant under a poisson distribution, using a p-value set by the parameter:

--poisson-cutoff < (float ≤ 1) >

An example of peak-finding using only poisson from our lab is found here. If you'd like to ignore the poisson cutoff, set it to 1. We use this same parameter to determine whether each peak is significant relative to all other reads on and the length of the entire transcriptome. It is not possible to toggle this filter yet, but it will be possible in future versions.

To improve the accuracy of the peak finder, we use cubic spline fitting to approximate the "shape" of a peak and we define the bounds of a peak as the points on the fitted curve which fall below FDR or are local minima above the FDR. This achieves a greater resolution and allows us to disambiguate multiple binding sites which are close to one another, as opposed to clumping them together. In order to efficiently perform the curve-fitting operation, we sub-divide each gene into "sections" which are regions which have ≥ 1 read and within a certain margin apart from one another. You may manipulate the margin using the

--margin <int>

parameter. Setting this value too low will result in multiple fits and slow CLIPper down dramatically, setting this value too high will under-fit the data and result in giant peaks. Testing multiple values for --margin shows that 15 is a reasonable number to set for this parameter, but results will vary and future implementations of CLIPper may migrate towards a more intelligent way of defining this parameter internally.

Another feature which is semi-experimental, but uber-cool is the

--superlocal

option, which re-defines the poisson cutoff and FDR in each "section" (within --margin nucleotides) defined above. In this implementation, we reasoned that some significant peaks are over-shadowed and therefore their significance is discounted when there are many reads aligned in a far-away part of a gene. This method uses a window of 1 kB in each direction of a putative peak to define FDR and poisson significance and serves to pick up peaks which would be otherwise missed with genome-wide or gene-wide thresholds. Approach this option with caution, but please do approach it. In our experience it improves sensitivity without doing much damage to specificity.

We hope your experience with CLIPper is a productive one. If you have any comments, questions or concerns please do not hesitate to contact us. There are more options which should be self-explanatory but documentation of these options will nonetheless be written here eventually.

Clone this wiki locally