News and updates
|New releases and related tools will be announced through the mailing list|
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
Nature Biotechnology doi:10.1038/nbt.1621
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias
Genome Biology doi:10.1186/gb-2011-12-3-r22
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq
Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L Differential analysis of gene regulation at transcript resolution with RNA-seq
Nature Biotechnology doi:10.1038/nbt.2450
Frequently Asked Questions
What's the difference between FPKM and RPKM?
They're almost the same thing. RPKM stands for Reads Per Kilobase of transcript per Million mapped reads. FPKM stands for Fragments Per Kilobase of transcript per Million mapped reads. In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. Paired-end RNA-Seq experiments produce two reads per fragment, but that doesn't necessarily mean that both reads will be mappable. For example, the second read is of poor quality. If we were to count reads rather than fragments, we might double-count some fragments but not others, leading to a skewed expression value. Thus, FPKM is calculated by counting fragments, not reads.
Can Cufflinks handle strand-specific RNA-Seq data?
Yes. If you mapped your reads with TopHat using the strand-specific mode or manually added XS tags, Cufflinks will automatically treat your data as strand-specific. If this is not the case, you should use the --library-type option along with the type of strand-specific protocol that was used to generate your data. More information can be found here.
Can I use Cufflinks with RNA-Seq data from Bacteria?
Sure, with one or two caveats. We don't recommend assembling bacterial transcripts using Cufflinks at first. If you're working on a new bacterial genome, consider using a computational gene finding application such as Glimmer. These tools are very mature and work well. With an annotation, you can use Cufflinks and Cuffdiff to profile expression and look for differences. Also, note that Cufflinks doesn't handle circular genomes in any special way, so make sure your genome and annotation are suitably linearized.
I want to find differentially expressed genes. Can I use Cufflinks in conjunction with count-based differential expression packages?
It's possible, but we strongly advise against this. Current count-based differential expression tools are poorly suited to differential expression analysis in genomes with alternatively spliced genes. The main reason for this is that when a gene has multiple isoforms, a change in the total number of reads or fragments from that gene doesn't always correspond to a change in expression for that gene. Conversely, a gene's expression may change, but the total number of fragments generated by its isoforms may be very similar. In order to detect changes accurately, it's necessary to estimate how many fragments came from each individual splice variant in each sample. Current count-based tools don't do this (to our knowledge - please send us email if you know of one!). Even if they did, fragments that come from parts of genes that are shared by more than one splice variant can't generally assigned to a single isoform, so the fragment counts for each isoform are only estimates, and there is some uncertainty in the counts. Isoforms that are very similar will have a great deal of uncertainty surrounding their fragment counts. This uncertainty needs to be accounted for when testing for differential expression. So while you could use Cufflinks to estimate isoform-level counts, you'd be throwing away Cufflinks' uncertainty, and thus have more confidence in the differences you see than you really should. This will probably lead to many false positives in your analysis. Furthermore, we do not normalize simply by the length to calculate FPKM but an effective length, as explained in our publications. Calculting counts from FPKM by multiplying by the length will give incorrect results. We strongly encourage you to consider using Cuffdiff to find differentially expressed genes and transcripts.
How can I find out how many fragments come from each transcript?
In response to user-demand and support the development of analysis methods downstream of abundance estimation, we have added this functionality in Cuffdiff 2.0. See the count tracking table description in the manual page. Support for count tracking will be coming to Cufflinks soon. However, people who have asked for this almost always want to use Cufflinks in conjunction with count-based differential expression packages, which is not a good idea.
How is Cuffdiff different from other differential expression tools?
To our knowledge, most other differential expression packages work by counting up the number of fragments that originate from each gene in each sample, looking for differences in the count for each gene across conditions, and testing the observed differences for statistical significance. This approach works well when genes have a single isoform and occur only once in the genome (i.e. aren't members of "gene families"). However, when dealing with genes with more than one isoform, it's generally not possible to tell with certainty which isoform generated each fragment. This greatly complicates the problem of looking for differentially expressed genes.
Most differential expression packages count reads at a gene level and look for differences in these counts directly. However, this approach can be very inaccurate for genes that have multiple isoforms, because a change in relative abundance of the isoforms can change overall gene abundance
Cuffdiff looks for differentially expressed genes by estimating how many fragments came from each isoform and then converting the counts into isoform expression levels. To find differentially expressed genes, Cuffdiff calculates expression levels in each condition for each gene by adding up the expression levels for each gene's splice isoforms. Cuffdiff tests for differences in each gene's expression level across conditions. See here for more details about how this works.
Can I use Cufflinks to find SNPs or RNA editing sites?
Not at the moment, because Cufflinks doesn't do any base calling of its own. We recommend that you check out the samtools for this purpose.
Does Cufflinks discover or quantify gene fusions or trans-spliced transcripts?
Not yet, but we will likely support this in a future release (we have no idea how far away that release will be).
Do Cufflinks and Cuffdiff support both BAM and SAM?
Yes. If a SAM is supplied, a message will be output that the file is not a valid BAM file. However, Cufflinks will recognize this and treat the file as a SAM. When using a SAM file, you should include a proper header or ensure that the reads are lexicographically by chromosome and then numerically by left position. You can accomplish this sorting with the command sort -k3,3 -k4,4n in.sam > out.sam.
How is Cuffdiff different from other differential expression tools?
Cuffdiff is able to estimate p-values for differential expression of individual transcripts. Like other tools, it uses a negative binomial model estimated from data to obtain variance estimates from which p-values are computed. More details will be available in a forthcoming publication.
What does "NOTEST" mean in Cuffdiff's output?
This status code is used by Cuffdiff to indicate that neither of the conditions contained enough reads in a locus to support a reliable calculation of expression level. Basically, Cuffdiff wasn't confident that it had enough data in that gene. You can control Cuffdiff's behavior in this regard by lowering or raising the -c option.
What does "LOWDATA" mean in Cuffdiff's output?
This status code is used by Cuffdiff to indicate that in both conditions, the gene being tested was either too complex or too shallowly sequenced to support a reliable calculation of abundance. Cuffdiff can often (though not always) tell via some simple linear algebra that it doesn't have enough reads to pick apart expression levels for all of a gene's isoforms, and without these, it can't calculate expression for the gene. When this happens in both conditions, Cuffdiff marks any observed difference in expression as not statistically signficant and sets the status code as LOWDATA. You should treat any such observed changes as highly unreliable.
What does "FAIL" mean in Cuffdiff's output?
This status code is used by Cuffdiff to indicate that in one or both conditions, Cuffdiff encountered a numerical exception while calculating expression levels in the locus. This can happen because of rounding errors and other "facts of life" when doing math on computers. Typically, this occurs in very complex genes with relatively low expression levels. When this happens in one or both conditions being tested, Cuffdiff marks any observed difference in expression as not statistically signficant and sets the status code as FAIL. You should treat any such observed changes as highly unreliable.
Can Cufflinks/Cuffdiff give me FPKM for each exon or splicing event?
For the foreseeable future, no.
How does cufflinks calculate its p values?
Run Cuffdiff (see here for an example) and look at the gene_exp.diff output file.
How do I find differentially expressed genes with Cuffdiff?
I'm looking for a GTF file with annotated transcripts. Do you provide annotation files?
No. Curating these annotation files would be way too much work for us, and organizations like Ensembl and UCSC already do a great job. We suggest you check out either of these for a high quality annotation GTF for your organism.
I'm trying to assemble a sample. Cufflinks is almost done, but it seems to be hanging at "99% complete". What's going on?
Cufflinks spawns threads for each locus to assemble and quantitate the "bundle" of reads in that locus. Some loci may have more reads and more complicated alternative splicing than others, which requires more CPU cycles. These bundles can continue processing long after all others have completed, leading to this behavior. You may be able to decrease the number of such bundles by masking out ribosomal and mitochondrial RNA using the -M/--mask-file option described in the Manual.
Can I use Cuffdiff on time-series data?
Yes. Check out the -T option in Cuffdiff.
How can I visualize by Cufflinks and Cuffdiff results?
I'm having trouble installing Cufflinks on my system. Can you help me?
While binaries are provided for easy installation on Mac and Linux, Cufflinks can also be installed from source on numerous systems. If you are using a Mac we recommend MacPorts. For other systems, we encourage you to visit SeqAnswers for discussions about installation issues on various platforms. You can try sending email to our support address, but please be aware that we get a great deal of email, and requests for installation help are generally the lowest priority for us.