Cufflinks

Transcript assembly, differential expression, and differential regulation for RNA-Seq

      

Site Map

News and updates

    New releases and related tools will be announced through the mailing list

Getting Help

    Questions about Cufflinks and Cuffdiff should be posted on our Google Group. Please use tophat.cufflinks@gmail.com for private communications only. Please do not email technical questions to Cufflinks contributors directly.

Releases

Related Tools

  • Monocle: Single-cell RNA-Seq analysis
  • CummeRbund: Visualization of RNA-Seq differential analysis
  • TopHat: Alignment of short RNA-Seq reads
  • Bowtie: Ultrafast short read alignment

Publications

Contributors

Links

GFF files


Some of the Cufflinks modules take as input a file (or more) containing known gene annotations or other transcript data in GFF format (General Feature Format). GFF has many versions, but the two most popular that are supported by Cufflinks (and other programs in the Tuxedo suite, like Tophat) are GTF2 (Gene Transfer Format, described here) and GFF3 (defined here). Here are a few notes about the way these formats are interpreted by the Cufflinks programs.

GTF2

As seen in the GTF2 specification, the transcript_id attribute is also required by our GFF parser, and a gene_id attribute, though not strictly required in our programs, is very useful for grouping alternative transcripts under a gene/locus identifier. An optional gene_name attribute, if found, will be taken and shown as  a symbolic gene name or short-form abbreviation (e.g. gene symbols from HGNC or Entrez Gene). Some annotation sources (e.g. Ensembl) place a "human readable" gene name/symbol in the gene_name attribute, like a HUGO symbol (while gene_id might be just an automatically generated numeric identifier for the gene).

TopHat and Cufflinks generally expect exon features to define a transcript structure, with optional CDS features to specify the coding segments. Our GFF reader will ignore redundant features like start_codon, stop_codon when whole CDS features were provided, or *UTR features when whole exon features were also given. However, it is still possible to provide only CDS and *UTR features and our GFF parser will reassemble the exonic structure accordingly.

Example of a GTF2 transcript record with minimal attributes:


GFF3

As defined by the GFF3 specification, the parent features (usually transcripts, i.e. "mRNA" features) are required to have an ID attribute, but here again an optional gene_name attribute can be used to specify a common gene name abbreviation. If gene_name is not given, it can be also inferred from the Name or ID attributes of the parent gene feature of the current parent mRNA feature (if given in the input file). Exon or CDS features arerequired to have a Parent attribute whose value must match the value of the ID attribute of a parent transcript feature (usually a "mRNA" feature).

Feature restrictions

For various reasons we currently assume the following limits (maximum values) for the genomic length (span) of gene and transcript features:
  • genes and transcripts cannot span more than 7 Megabases on the genomic sequence
  • exons cannot be longer than 30 Kilobases
  • introns cannot be larger than 6 Megabases
Also, transcript IDs are expected to be unique per GFF input file (though we relaxed this restriction by limiting it to a chromosome/contig scope).

Due to these requirements, Cufflinks programs may fail to load the user provided GFF file, and an error message should specify the offending GFF record. The user is expected to remove or correct such GFF records in order to continue the analysis.

Example of a GFF3 transcript record with minimal attributes:


The gffread utility

A program called gffread is included with the Cufflinks package and it can be used to verify or perform various operations on GFF files (use gffread -h to see the various usage options). Because the program shares the same GFF parser code with Cufflinks and other programs in the Tuxedo suite, it could be used to verify that a GFF file from a certain annotation source is correctly "understood" by these programs. Thus the gffread utility can be used to simply read the transcripts from the file and print these transcripts back, in either GFF3 (default) or GTF2 format (-T option), discarding any extra attributes and keeping only the essential ones, so the user can quickly verify if the transcripts in that file are properly parsed by the GFF reader code. The command line for such a quick cleanup and visual inspection of a given GFF file could be:

gffread -E annotation.gff -o- | more

This will show the minimalist GFF3 re-formatting of the transcript records given in the input file (annotation.gff in this example). The -E option directs gffread to "expose" (display warnings about) any potential issues encountered while parsing the given GFF file.
In order to see the GTF2 version of the same transcripts the -T option should be added:

gffread -E annotation.gff -T -o- | more

From these examples it can be seen that gffread can also be used to convert a file between GTF2 and GFF3 formats.

Extracting transcript sequences

The gffread utility can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags assembled from a Cufflinks assembly session. This can be accomplished with a command line like this:

gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf

The file genome.fa in this example would be a multi fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat. Note that the retrieval of the transcript sequences this way is going to be quicker if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with samtools prior to running gffread, like this:

samtools faidx genome.fa

Then in subsequent runs using the -g option gffread will find the fasta index and use it to speed up the extraction of the transcript sequences.