Cufflinks

Transcript assembly, differential expression, and differential regulation for RNA-Seq

      
Please Note If you have questions about how to use Cufflinks or would like more information about the software, please email tophat.cufflinks@gmail.com, though we ask you to have a look at the paper and the supplemental methods first, as your question may be answered there.

Site Map

News and updates

    New releases and related tools will be announced through the mailing list

Getting Help

    Questions about Cufflinks and Cuffdiff should be posted on our Google Group. Please use tophat.cufflinks@gmail.com for private communications only. Please do not email technical questions to Cufflinks contributors directly.

Releases

Related Tools

  • Monocle: Single-cell RNA-Seq analysis
  • CummeRbund: Visualization of RNA-Seq differential analysis
  • TopHat: Alignment of short RNA-Seq reads
  • Bowtie: Ultrafast short read alignment

Publications

Contributors

Links

How Cufflinks works


What is Cufflinks?


Cufflinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. Cufflinks runs on Linux and OS X.

Cufflinks is described in our recent paper, and much of the algorithmic and mathematical material is presented in the supplemental methods


How does Cufflinks assemble transcripts?


Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's Theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the directed acyclic graph for the relation.

While Cufflinks works well with unpaired RNA-Seq reads, it is designed with paired reads in mind. The assembly algorithm explicitly handles paired end reads by treating the alignment for a given pair as a single object in the covering relation. The proof of Dilworth's theorem finds a maximum cardinality matching on the bipartite graph of the transitive closure of the DAG. However, there is not necessarily a unique maximum cardinality matching, reflecting the fact that due to the limited size of RNA-Seq cDNA fragments, we may not know with certainty which outcomes of alternative splicing events go together in the same transcripts. Cufflinks tries to find the correct parsimonious set of transcripts by performing a minimum cost maximum matching. The cost of associating splicing events is based on the "percent-spliced-in" score introduced in

This matching is then extended to a minimum cost path cover of the DAG, with each path representing a different transcript.


The algorithm builds on ideas behind the ShoRAH algorithm for haplotype abundance estimation in viral populations, described in:

  • Nicholas Eriksson, Lior Pachter, Yumi Mitsuya, Soo-Yon Rhee, Chunlin Wang, Baback Gharizadeh, Mostafa Ronaghi, Robert W. Shafer, Niko Beerenwinkel Viral population estimation using pyrosequencing, PLoS Computational Biology, 4(5):e1000074
The assembler also borrows some ideas introduced with the PASA algorithm for annotating genomes from EST and full length mRNA evidence, described in:
  • Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.

Cufflinks is implemented in C++ and makes substantial use of the Boost Libraries as well as the LEMON Graph Library, which was launched by the Egerváry Research Group on Combinatorial Optimization (EGRES).

How does Cufflinks calculate transcript abundances?


In RNA-Seq experiments, cDNA fragments are sequenced and mapped back to genes and ideally, individual transcripts. Properly normalized, the RNA-Seq fragment counts can be used as a measure of relative abundance of transcripts, and Cufflinks measures transcript abundances in Fragments Per Kilobase of exon per Million fragments mapped (FPKM), which is analagous to single-read "RPKM", originally proposed in:



In paired-end RNA-Seq experiments, fragments are sequenced from both ends, providing two reads for each fragment. To estimate isoform-level abundances, one must assign fragments to individual transcripts, which may be difficult because a read may align to multiple isoforms of the same gene. Cufflinks uses a statistical model of paired-end sequencing experiments to derive a likelihood for the abundances of a set of transcripts given a set of fragments. This likelihood function can be shown to have a unique maximum, which Cufflinks finds using a numerical optimization algorithm. The program then multiplies these probabilities to compute the overall likelihood that one would observe the fragments in the experiment, given the proposed abundances on the transcripts. Because Cufflinks' statistical model is linear, the likelihood function has a unique maximum value, and Cufflinks finds it with a numerical optimization algorithm.

Using this statistical method, Cufflinks can estimate the abundances of the isoforms present in the sample, either using a known "reference" annotation, or after an ab-initio assembly of the transcripts using only the reference genome.


How does Cufflinks estimate the fragment length distribution?


The probability distribution on the length of fragments plays an important role in assembly, abundance estimation, and bias correction. The accuracy of this distribution will have a great impact on the accuracy of our results. Because of this, we now attempt to "learn" this distribution from the input data instead of relying on an approximate Gaussian distribution, whenever possible.

  • If only single-end reads are provided, there is no way to estimate the empirical distribution, so Cufflinks must use an approximate Gaussian distribution with either default or user-provided parameters (see the manual for more details).
  • If the alignment file contains paired-end reads and an asssembly is provided, Cufflinks is able to learn the distribution from reads that map to single-isoform genes. Because the assembly provides the splicing structure, introns can be removed from between the paired reads, providing the most accurate estimation.
  • If given paired end reads and no assembly, Cufflinks will search for large "open ranges" where the alignments contain no splices within the reads. Within these ranges, Cufflinks uses the genomic length of the paired-end reads to estimate the distribution. If not enough reads can be found within these ranges, Cufflinks will default to the Gaussian approximation as in the single-end case. To ensure an empirical distribution will be used, one can first assemble with the Gaussian and then supply the output GTF assembly in a second run of Cufflinks.

How does Cufflinks identify and correct for sequence bias?


Often in RNA-Seq experiments, a sequence-specific bias is introduced during the library preparation that challenges the assumption of uniform coverage. For example, a sequence-specific bias caused by the use of random hexamers was identified in



Because this bias is usually caused by primers used either in PCR or reverse transcription, it appears near the ends of the sequenced fragments. We have developed a method to correct this bias by “learning” what sequences are being selected for (or ignored) in a given experiment, and including these measurements in the abundance estimation.

The first step in the process is to generate an initial abundance estimation without using bias correction. Since different transcripts will have different sequences appear in them, we use this approximate abundance to weight reads by the expression level of the transcript from which they arise. This helps us avoid over-counting sequences that may be common in the mapping data due to high expression rather than bias.

We next revisit each fragment in the alignment file and apply the abundance weighting as we “learn” features of the sequence in a window surrounding the 5’ and 3’ end of the transcript using a graphical model of the statistical dependencies between bases in the window. We keep a separate model for each end of the read since the biases in the first and second strand synthesis of the fragment are not always the same.

Finally, we re-estimate the abundances using a new likelihood function that has been adjusted to take the sequence bias into account, based on the parameters of the graphical model we computed in the previous step. The result is a new set of FPKMs that are less affected by sequence-specific bias.

Note that since it must know which ends of reads are fragment ends, Cufflinks will not bias correct reads mapping to transcripts with unknown strandedness.

The full details of our method can be found in



How does Cufflinks handle multi-mapped reads?


Individual reads will sometimes be mapped to multiple positions in the genome due to sequence repeats and homology. By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position.

If multi-mapped read correction is enabled (-u/--multi-read-correct), Cufflinks will improve its estimation in a manner inspired by (but using more information than) the 'rescue' method described in



Cufflinks will first calculate initial abundance estimates for all transcripts using the uniform dividing scheme. Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabalistically based on the initial abundance estimation of the genes it maps to, the inferred fragment length, and fragment bias (if bias correction is enabled).


How does Reference Annotation Based Transcript (RABT) assembly work?


Reference annotation based assembly seeks to build upon available information about the transcriptome of an organism to find novel genes and isoforms. When a reference GTF is provided with the -g/--GTF-guide option, the reference transcripts are tiled with faux-reads that will aid in the assembly of novel isoforms. These faux reads are combined with the sequencing reads and are input into the regular Cufflinks assembler. The assembled transfrags are then compared to the reference transcripts to determine if they are sufficiently different to be considered novel. Those that are labeled novel by our criteria (see Cufflinks options to adjust the parameters) are output along with the transcripts from the annotation.

The use of faux-reads was inspired by the methods of


  • J Venter, M Adams, E Myers, P Li, R Mural, G Sutton, H Smith, M Yandell, C Evans, R Holt, et al. The sequence of the human genome Science, Volume 291, 1304 - 1351 (2001)

What is Cuffdiff?


Cuffdiff is a program that uses the Cufflinks transcript quantification engine to calculate gene and transcript expression levels in more than one condition and test them for signficant differences. You can use it to find differentially expressed genes and transcripts, as well as genes that are being differentially regulated at the transcriptional and post-transcriptional level.

Cuffdiff is described in detail in the manuscript below:


How does Cuffdiff 2 test for differentially expressed and regulated genes?


To identify a gene or transcript as DE, Cuffdiff 2 tests the observed log-fold-change in its expression against the null hypothesis of no change (i.e. the true log-fold-change is zero). Because measurement error, technical variability, and cross-replicate biological variability might result in an observed log-fold-change that is nonzero, Cuffdiff assesses significance using a model of variability in the log-fold-change under the null hypothesis. This model is described in detail in detail in Trapnell and Hendrickson et al. Briefly, Cuffdiff two constructs, for each condition, a table that predicts how much variance there is in the number of reads originating from a gene or transcript. The table is keyed by the average reads across replicates, so to look up the variance for a transcript using the table, Cuffdiff estimates how many reads originated from that transcript, and then queries the table to retrieve the variance for that number of reads. Cuffdiff 2 then accounts for read mapping and assignment uncertainty by simulating probabilistic assignment of the reads mapping to a locus to the splice isoforms for that locus. At the end of the estimation procedure, Cuffdiff 2 obtains an estimate of the number of reads that originated from each gene and transcript, along with variances in those estimates. The read counts are reported along with FPKM values and their variances. Change in expression is reported as the log fold change in FPKM, and the FPKM variances allow the program to estimate the variance in the log-fold-change itself. Naturally, a gene that has highly variable expression will have a highly variable log-fold-change between two conditions.


What's changed since the paper?


Numerous small changes, bugfixes, and minor features have appeared since version 2.0.2 of Cuffdiff (which was used for Trapnell and Hendrickson et al). There are also two more substantial changes:

The first change is that Cuffdiff now reports the FPKM for each condition as the average FPKM calculated for each individual replicate. Previous versions pooled the reads from all replicates of a condition together and computed FPKM for each gene and transcript from the pool. The two approaches yield extremely similar values, but the new method is faster and simpler, and will make implementing several planned features much easier.

The second change is that Cuffdiff 2.1 introduced a new testing method that substantially improves performance over previous releases, including the one used for the paper. The modifications made in Cuffdiff 2.1 improve sensitivity in calling differentially expressed (DE) genes and transcripts while maintaining a low false positive rate. They stem from the method used to calculate the variability in the log fold change in expression. In Trapnell et al, Cuffdiff 2 used the “delta method” to estimate the variance of the log fold change estimate for a gene or transcript. This method yields a simple equation that takes as input the mean and variance of the transcript’s expression in two conditions and produces a variance for the log fold change. However, the equation contains no explicit accounting for the number of replicates used to produce those estimates – they are assumed to be perfectly accurate. The equation also assumes that the distribution of log fold changes (after a particular transformation) is approximately normal. For most genes and transcripts, this approximation is a good one. However, for the remaining genes and transcripts, Cuffdiff 2 sometimes failed to detect a signficant change in expression.

The improved version of Cuffdiff 2 more accurately estimates the variance in the log-fold-change using simulated draws from the model of variance in expression for each of the two conditions. Imagine an experiment that has n replicates in condition A and m replicates in condition B. To estimate the distribution of the log-fold-change in expression for a gene G under the null hypothesis, Cuffdiff first draws n times from the distribution of expression of G according to the algorithm’s model of expression. Cuffdiff then takes the average of the n draws to obtain an expression “measurement”. Then, Cuffdiff draws m from the same distribution and again takes their average. Cuffdiff then takes the log ratio of these averages, places this value in a list, and then repeats the procedure until there are thousands of such log-fold-change samples in the list. The software then makes a similar list, this time using the expression model for condition B – the null hypothesis assumes both sets of replicates originate from the same condition, but we do not know whether A or B is the better representative of that condition, so we must draw samples from both and combine them. To calculate a p-value of observing the real log-fold-change under this null model, we simply sort all the samples and count how many of them are more extreme than the log fold change we actually saw in the real data. This number divided by the total number of draws is our estimate for the p-value.

Cuffdiff 2 reports not only genes and transcripts that are significantly DE between conditions, but also groups of transcripts (i.e. the isoforms of a gene) that show significant changes in expression relative to one another. The test for this is similar to what is described in Trapnell et al, but comparably modified along the lines described above for single genes or transcripts. Draws of expression are made for each transcript in a group according to the number of replicates in the experiment. These are averaged, and the shift in relative transcript abundance for the draw is made using the Jensen-Shannon metric. These draws are added to a list and used to calculate p-values for significance of observed shifts in relative abundance under the null hypothesis.