News and updates
|New releases and related tools will be announced through the mailing list|
Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
Nature Biotechnology doi:10.1038/nbt.1621
Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. Improving RNA-Seq expression estimates by correcting for fragment bias
Genome Biology doi:10.1186/gb-2011-12-3-r22
Roberts A, Pimentel H, Trapnell C, Pachter L. Identification of novel transcripts in annotated genomes using RNA-Seq
Trapnell C, Hendrickson D,Sauvageau S, Goff L, Rinn JL, Pachter L Differential analysis of gene regulation at transcript resolution with RNA-seq
Nature Biotechnology doi:10.1038/nbt.2450
Some of the Cufflinks modules take as input a file (or more) containing known gene annotations or other transcript data in GFF format (General Feature Format). GFF has many versions, but the two most popular that are supported by Cufflinks (and other programs in the Tuxedo suite, like Tophat) are GTF2 (Gene Transfer Format, described here) and GFF3 (defined here). Here are a few notes about the way these formats are interpreted by the Cufflinks programs.
As seen in the GTF2
specification, the transcript_id attribute is also required by
our GFF parser, and a gene_id attribute, though not strictly
required in our programs, is very useful for grouping alternative transcripts under a gene/locus identifier.
An optional gene_name
attribute, if found, will be taken and shown as
a symbolic gene name or short-form abbreviation (e.g.
gene symbols from HGNC or Entrez Gene). Some annotation
sources (e.g. Ensembl) place a "human readable" gene
name/symbol in the gene_name
attribute, like a HUGO symbol (while gene_id might be
just an automatically generated numeric identifier for the
Example of a GTF2 transcript record with minimal attributes:
GFF3As defined by the GFF3 specification, the parent features (usually transcripts, i.e. "mRNA" features) are required to have an ID attribute, but here again an optional gene_name attribute can be used to specify a common gene name abbreviation. If gene_name is not given, it can be also inferred from the Name or ID attributes of the parent gene feature of the current parent mRNA feature (if given in the input file). Exon or CDS features arerequired to have a Parent attribute whose value must match the value of the ID attribute of a parent transcript feature (usually a "mRNA" feature).
Feature restrictionsFor various reasons we currently assume the following limits (maximum values) for the genomic length (span) of gene and transcript features:
Due to these requirements, Cufflinks programs may fail to load the user provided GFF file, and an error message should specify the offending GFF record. The user is expected to remove or correct such GFF records in order to continue the analysis.
Example of a GFF3 transcript record with minimal attributes:
The gffread utilityA program called gffread is included with the Cufflinks package and it can be used to verify or perform various operations on GFF files (use gffread -h to see the various usage options). Because the program shares the same GFF parser code with Cufflinks and other programs in the Tuxedo suite, it could be used to verify that a GFF file from a certain annotation source is correctly "understood" by these programs. Thus the gffread utility can be used to simply read the transcripts from the file and print these transcripts back, in either GFF3 (default) or GTF2 format (-T option), discarding any extra attributes and keeping only the essential ones, so the user can quickly verify if the transcripts in that file are properly parsed by the GFF reader code. The command line for such a quick cleanup and visual inspection of a given GFF file could be:
gffread -E annotation.gff -o- | more
This will show the minimalist GFF3 re-formatting of the transcript records given in the input file (annotation.gff in this example). The -E option directs gffread to "expose" (display warnings about) any potential issues encountered while parsing the given GFF file.
In order to see the GTF2 version of the same transcripts the -T option should be added:
gffread -E annotation.gff -T -o- | more
From these examples it can be seen that gffread can also be used to convert a file between GTF2 and GFF3 formats.
Extracting transcript sequences
The gffread utility can be used to generate a FASTA file with the DNA sequences for all transcripts in a GFF file. For this operation a fasta file with the genomic sequences have to be provided as well. For example, one might want to extract the sequence of all transfrags assembled from a Cufflinks assembly session. This can be accomplished with a command line like this:
gffread -w transcripts.fa -g /path/to/genome.fa transcripts.gtf
The file genome.fa in this example would be a multi fasta file with the genomic sequences of the target genome. This also requires that every contig or chromosome name found in the 1st column of the input GFF file (transcript.gtf in this example) must have a corresponding sequence entry in chromosomes.fa. This should be the case in our example if genome.fa is the file corresponding to the same genome (index) that was used for mapping the reads with Tophat. Note that the retrieval of the transcript sequences this way is going to be quicker if a fasta index file (genome.fa.fai in this example) is found in the same directory with the genomic fasta file. Such an index file can be created with samtools prior to running gffread, like this:
samtools faidx genome.fa
Then in subsequent runs using the -g option gffread will find the fasta index and use it to speed up the extraction of the transcript sequences.