Manual
Prerequisites
Cufflinks runs on intel-based computers running Linux or Mac OS X and
that have GCC 4.0 or greater installed. You can install pre-compiled
binaries or build Cufflinks from the source code.
If you wish to build Cufflinks yourself, you will need to install the
Boost C++ libraries. See
Installing Boost, on the getting started page. You will also need to build and install the
SAM tools, but you should take a look at the getting
started page for detailed instructions, because the headers and libbam must be accessible
to the Cufflinks build scripts.
Running Cufflinks
Run cufflinks from the command line as follows:
Usage: cufflinks [options]* <aligned_reads.(sam/bam)>
The following is a detailed description of the options used to control
Cufflinks:
Arguments:
|
|
<aligned_reads.(sam/bam)>
|
A file of RNA-Seq read alignments in the
SAM format. SAM is a standard short read alignment, that allows aligners to
attach custom tags to individual alignments, and Cufflinks requires that
the alignments you supply have some of these tags. Please see
Input formats for more details.
|
General Options:
|
|
-h/--help
|
Prints the help message and exits
|
-o/--output-dir <string>
|
Sets the name of the directory in which Cufflinks will write all of its
output. The default is "./".
|
-p/--num-threads <int>
|
Use this many threads to align reads. The default is 1.
|
-G/--GTF <reference_annotation.(gtf/gff)>
|
Tells Cufflinks to use the supplied reference annotation (a GFF file)
to estimate isoform expression. It will not assemble novel transcripts, and the program
will ignore alignments not structurally compatible with any reference transcript.
|
-g/--GTF-guide <reference_annotation.(gtf/gff)>
|
Tells Cufflinks to use the supplied reference annotation (GFF) to guide RABT assembly.
Reference transcripts will be tiled with faux-reads to provide additional information
in assembly. Output will include all reference transcripts as well as any novel
genes and isoforms that are assembled.
|
-M/--mask-file <mask.(gtf/gff)>
|
Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF
file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant
transcripts you wish to ignore in your analysis in this file. Due to variable efficiency
of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often
improves the overall robustness of transcript abundance estimates.
|
-b/--frag-bias-correct <genome.fa>
|
Providing Cufflinks with a multifasta file via this option instructs it to run our new
bias detection and correction algorithm which can significantly improve accuracy of
transcript abundance estimates. See How Cufflinks Works for more details.
|
-u/--multi-read-correct
|
Tells Cufflinks to do an initial estimation procedure to more accurately weight reads
mapping to multiple locations in the genome. See How Cufflinks Works for more details.
|
--library-type
|
See Library Types
|
--library-norm-method
|
See Library Normalization Methods
|
Advanced Abundance Estimation Options:
|
|
-m/--frag-len-mean <int>
|
This is the expected (mean) fragment length. The default is 200bp. Note:
Cufflinks now learns the fragment length mean for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
-s/--frag-len-std-dev <int>
|
The standard deviation for the distribution on fragment lengths. The default is 80bp. Note:
Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
-N/--upper-quartile-norm
|
With this option, Cufflinks normalizes by the upper quartile of the number of fragments mapping
to individual loci instead of the total number of sequenced fragments. This can improve robustness
of differential expression calls for less abundant genes and transcripts.
|
--total-hits-norm
|
With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is active by default.
|
--compatible-hits-norm
|
With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped hits used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default, and can only be used in combination with --GTF. Use with either RABT or ab initio assembly is not supported
|
--max-mle-iterations <int>
|
Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
|
--max-bundle-frags <int>
|
Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are listed in skipped.gtf. Default: 1000000
|
--no-effective-length-correction
|
Cufflinks will not employ its "effective" length normalization to transcript FPKM.
|
--no-length-correction
|
Cufflinks will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.
|
Advanced Assembly Options:
|
|
-L/--label
|
Cufflinks will report transfrags in GTF format, with a prefix given by
this option. The default prefix is "CUFF".
|
-F/--min-isoform-fraction <0.0-1.0>
|
After calculating isoform abundance for a gene, Cufflinks filters out
transcripts that it believes are very low abundance, because isoforms
expressed at extremely low levels often cannot reliably be assembled,
and may even be artifacts of incompletely spliced precursors of processed
transcripts. This parameter is also used to filter out introns
that have far fewer spliced alignments supporting them. The default is
0.1, or 10% of the most abundant isoform (the major isoform) of the gene.
|
-j/--pre-mrna-fraction <0.0-1.0>
|
Some RNA-Seq protocols produce a significant amount of reads that originate
from incompletely spliced transcripts, and these reads can confound the
assembly of fully spliced mRNAs. Cufflinks uses this parameter to filter
out alignments that lie within the intronic intervals implied by the
spliced alignments. The minimum depth of coverage in the intronic region covered
by the alignment is divided by the number of spliced reads, and if the
result is lower than this parameter value, the intronic alignments are
ignored. The default is 15%.
|
-I/--max-intron-length <int>
|
The maximum intron length. Cufflinks will not report transcripts with
introns longer than this, and will ignore SAM alignments with REF_SKIP
CIGAR operations longer than this. The default is 300,000.
|
-a/--junc-alpha <0.0-1.0>
|
The alpha value for the binomial test used during false positive spliced alignment
filtration. Default: 0.001
|
-A/--small-anchor-fraction <0.0-1.0>
|
Spliced reads with less than this percent of their length on each side of the junction
are considered suspicious and are candidates for filtering prior to assembly. Default: 0.09.
|
--min-frags-per-transfrag <int>
|
Assembled transfrags supported by fewer than this many aligned RNA-Seq fragments are
not reported. Default: 10.
|
--overhang-tolerance <int>
|
The number of bp allowed to enter the intron of a transcript when determining if a read or
another transcript is mappable to/compatible with it. The default is 8 bp based on the default
bowtie/TopHat parameters.
|
--max-bundle-length <int>
|
Maximum genomic length allowed for a given bundle. The default is 3,500,000 bp.
|
--min-intron-length <int>
|
Minimum intron size allowed in genome. The default is 50 bp.
|
--trim-3-avgcov-thresh <int>
|
Minimum average coverage required to attempt 3' trimming. The default is 10.
|
--trim-3-dropoff-frac <int>
|
The fraction of average coverage below which to trim the 3' end of an assembled
transcript. The default is 0.1.
|
--max-multiread-fraction <0.0-1.0>
|
The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler. Default: 0.75 (75% multireads or more is suppressed).
|
--overlap-radius <int>
|
Transfrags that are separated by less than this distance get merged together, and the gap is filled. Default: 50 (in bp).
|
Advanced Reference Annotation Based Transcript (RABT) Assembly Options:
|
|
These options have an affect only when used in conjuction with -g/--GTF-guide.
|
|
--3-overhang-tolerance <int>
|
The number of bp allowed to overhang the 3' end of a reference transcript when determining
if an assembled transcript should be merged with it (ie, the assembled transcript is not novel).
The default is 600 bp.
|
--intron-overhang-tolerance <int>
|
The number of bp allowed to enter the intron of a reference transcript when determining if an
assembled transcript should be merged with it (ie, the assembled transcript is not novel).
The default is 50 bp.
|
--no-faux-reads
|
This option disables tiling of the reference transcripts with faux reads. Use this if you only
want to use sequencing reads in assembly but do not want to output assembled transcripts that lay
within reference transcripts. All reference transcripts in the input annotation will also
be included in the output.
|
Advanced Program Behavior Options:
|
|
-v/--verbose
|
Print lots of status updates and other diagnostic information.
|
-q/--quiet
|
Suppress messages other than serious warnings and errors.
|
--no-update-check
|
Turns off the automatic routine that contacts the Cufflinks server to check for a more
recent version.
|
Cufflinks takes a text file of SAM alignments, or a binary SAM (BAM) file as input. For more details
on the SAM format, see the
specification. The RNA-Seq read mapper TopHat
produces output in this format, and is
recommended for use with Cufflinks. However Cufflinks will accept SAM
alignments generated by any read mapper. Here's an example of an alignment Cufflinks will accept:
s6.25mer.txt-913508 16 chr1 4482736 255 14M431N11M * 0 0 \
CAAGATGCTAGGCAAGTCTTGGAAG IIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 XS:A:-
Note the use of the custom tag XS. This attribute, which must
have a value of "+" or "-", indicates which strand the RNA that produced
this read came from. While this tag can be applied to any alignment,
including unspliced ones, it must be present for all
spliced alignment records (those with a 'N' operation in the CIGAR string).
The SAM file supplied to Cufflinks must be sorted by
reference position. If you aligned your reads with TopHat, your
alignments will be properly sorted already. If you used another tool, you may
want to make sure they are properly sorted as follows:
sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted
Cufflinks Output
Cufflinks produces three output files:
1) transcripts.gtf
This GTF file contains Cufflinks'
assembled isoforms. The first 7 columns are standard GTF, and the last column
contains attributes, some of which are also standardized ("gene_id", and "transcript_id").
There one GTF record per row, and each record represents either a transcript
or an exon within a transcript. The columns are defined as follows:
Column number | Column name | Example | Description |
1 | seqname | chrX | Chromosome or contig name |
2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
3 | feature | exon | The type of record (always either "transcript" or "exon". |
4 | start | 77696957 | The leftmost coordinate of this record (where 1 is the leftmost possible coordinate) |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM) |
7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
7 | frame | . | Cufflinks does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used. |
8 | attributes | ... | See below. |
Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
gene_id | CUFF.1 | Cufflinks gene id |
transcript_id | CUFF.1.1 | Cufflinks transcript id |
FPKM | 101.267 | Isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments |
frac | 0.7647 | Reserved. Please ignore, as this attribute may be deprecated in the future |
conf_lo | 0.07 | Lower bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo) |
conf_hi | 0.1102 | Upper bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo) |
cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
2) isoforms.fpkm_tracking
This file contains the estimated isoform-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.
3) genes.fpkm_tracking
This file contains the estimated gene-level expression values in the generic FPKM Tracking Format. Note, however that as there is only one sample, the "q" format is not used.
Running Cuffcompare
Cufflinks includes a program that you can use to help analyze the transfrags
you assemble. The program cuffcompare helps you:
- Compare your assembled transcripts to a reference annotation
- Track Cufflinks transcripts across multiple experiments (e.g. across a time course)
From the command line, run cuffcompare as follows:
cuffcompare [options]* <cuff1.gtf> [cuff2.gtf] ... [cuffN.gtf]
Cuffcompare takes Cufflinks' GTF output as input, and optionally can take
a "reference" annotation (such as from Ensembl)
Arguments:
|
|
<cuff1.gtf>
|
A GTF file produced by cufflinks.
|
Options:
|
|
-h
|
Prints the help message and exits
|
-o <outprefix>
|
All output files created by Cuffcompare will have this prefix (e.g. <outprefix>.loci, <outprefix>.tracking, etc.).
If this option is not provided the default output prefix being used is: "cuffcmp"
|
-r
|
An optional "reference" annotation GFF file. Each sample is matched against
this file, and sample isoforms are tagged as overlapping, matching, or
novel where appropriate. See the refmap and
tmap output file descriptions below.
|
-R
|
If -r was specified, this option causes cuffcompare to
ignore reference transcripts that are not overlapped by any transcript in
one of cuff1.gtf,...,cuffN.gtf. Useful for ignoring
annotated transcripts that are not present in your RNA-Seq samples and thus
adjusting the "sensitivity" calculation in the accuracy report
written in the <outprefix> file
|
-s <seq_dir>
|
Causes cuffcompare to look into for fasta files with the
underlying genomic sequences (one file per contig) against which your
reads were aligned for some optional classification functions. For example,
Cufflinks transcripts consisting mostly of lower-case bases are classified
as repeats. Note that <seq_dir> must contain one
fasta file per reference chromosome, and each file must be named after
the chromosome, and have a .fa or .fasta extension.
|
-C
|
Enables the "contained" transcripts to be also written in the <outprefix>.combined.gtffile, with the attribute "contained_in"
showing the first container transfrag found. By default, without this option, cuffcompare does not write in that file isoforms that were found to be
fully contained/covered (with the same compatible intron structure) by other transfrags in the same locus.
|
-V
|
Cuffcompare is a little more verbose about what it's doing, printing messages to stderr, and it will also show warning messages about any inconsistencies or potential issues
found while reading the given GFF file(s).
|
Cuffcompare Output
Cuffcompare produces the following output files:
1) <outprefix>.stats
Cuffcompare reports various statistics related to the "accuracy" of the transcripts
in each sample when compared to the reference annotation data. The typical gene finding measures of "sensitivity"
and "specificity" (as defined in Burset, M., Guigó, R. : Evaluation of gene structure prediction programs (1996) Genomics, 34 (3), pp. 353-367.
doi: 10.1006/geno.1996.0298)
are calculated at various levels (nucleotide, exon, intron, transcript, gene) for each input file and reported in this file.
The Sn and Sp columns show specificity and sensitivity values at each level, while the fSn and fSp columns
are "fuzzy" variants of these same accuracy calculations, allowing for a very small variation in exon boundaries to still be counted as a "match".
(If the -o option was not given the default prefix "cuffcmp" is used and these stats will be printed into a file named cuffcmp.stats in the current directory)
2) <outprefix>.combined.gtf
Cuffcompare reports a GTF file containing the "union" of all transfrags in
each sample. If a transfrag is present in both samples, it is thus reported
once in the combined gtf.
3) <outprefix>.tracking
This file matches transcripts up between samples. Each row contains
a transcript structure that is present in one or more input GTF files.
Because the transcripts will generally have different IDs (unless you
assembled your RNA-Seq reads against a reference transcriptome),
cuffcompare examines the structure of each the transcripts,
matching transcripts that agree on the coordinates and order of all of
their introns, as well as strand. Matching transcripts are allowed to
differ on the length of the first and last exons, since these lengths
will naturally vary from sample to sample due to the random nature of
sequencing.
If you ran cuffcompare with the -r option, the first and second
columns contain the closest matching reference transcript to the one
described by each row.
Here's an example of a line from the tracking file:
TCONS_00000045 XLOC_000023 Tcea|uc007afj.1 j \
q1:exp.115|exp.115.0|100|3.061355|0.350242|0.350207 \
q2:60hr.292|60hr.292.0|100|4.094084|0.000000|0.000000
In this example, a transcript present in the two input files, called
exp.115.0 in the first and 60hr.292.0 in the second,
doesn't match any reference transcript exactly, but shares exons with
uc007afj.1, an isoform of the gene Tcea, as indicated by the
class code j.
The first three columns are as follows:
Column number |
Column name |
Example |
Description |
1 |
Cufflinks transfrag id |
TCONS_00000045 |
A unique internal id for the transfrag |
2 |
Cufflinks locus id |
XLOC_000023 |
A unique internal id for the locus |
3 |
Reference gene id |
Tcea |
The gene_name attribute of the reference GTF record
for this transcript, or '-' if no reference transcript overlaps
this Cufflinks transcript |
4 |
Reference transcript id |
uc007afj.1 |
The transcript_id attribute of the reference GTF
record for this transcript, or '-' if no reference transcript
overlaps this Cufflinks transcript |
5 |
Class code |
c |
The type of match between the Cufflinks transcripts
in column 6 and the reference transcript. See class codes |
Each of the columns after the fifth have the following format:
qJ:<gene_id>|<transcript_id>|<FMI>|<FPKM>|<conf_lo>|<conf_hi>|<cov>|<len>
A transcript need not be present in all samples to be reported in the tracking
file. A sample not containing a transcript will have a "-" in its entry
in the row for that transcript.
(The following output files are created for each of the <cuff_in> file given, in the same directories where the <cuff_in> files reside)
4) <cuff_in>.refmap
This tab delimited file lists, for each reference transcript, which cufflinks
transcripts either fully or partially match it. There is one row per
reference transcript, and the columns are as follows:
Column number | Column name | Example | Description |
1 | Reference gene name | Myog | The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used. |
2 | Reference transcript id | uc007crl.1 | The transcript_id attribute of the reference GTF record for this transcript |
3 | Class code | c | The type of match between the Cufflinks transcripts in column 4 and the reference transcript. One of either 'c' for partial match, or '=' for full match. |
4 | Cufflinks matches | CUFF.23567.0,CUFF.24689.0 | A comma separated list of Cufflinks transcript ids matching the reference transcript |
5) <cuff_in>.tmap
This tab delimited file lists the most closely matching reference
transcript for each Cufflinks transcript. There is one row per Cufflinks transcript,
and the columns are as follows:
Column number |
Column name |
Example |
Description |
1 |
Reference gene name |
Myog |
The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used. |
2 |
Reference transcript id |
uc007crl.1 |
The transcript_id attribute of the reference GTF record for this transcript |
3 |
Class code |
c |
The type of relationship between the Cufflinks transcripts in column 4 and the reference transcript (as described in the Class Codes section below) |
4 |
Cufflinks gene id |
CUFF.23567 |
The Cufflinks internal gene id |
5 |
Cufflinks transcript id |
CUFF.23567.0 |
The Cufflinks internal transcript id |
6 |
Fraction of major isoform (FMI) |
100 |
The expression of this transcript expressed as a fraction of the major isoform for the gene. Ranges from 1 to 100. |
7 |
FPKM |
1.4567 |
The expression of this transcript expressed in FPKM |
8 |
FPKM_conf_lo |
0.7778 |
The lower limit of the 95% FPKM confidence interval |
9 |
FPKM_conf_hi |
1.9776 |
The upper limit of the 95% FPKM confidence interval |
10 |
Coverage |
3.2687 |
The estimated average depth of read coverage across the transcript. |
11 |
Length |
1426 |
The length of the transcript |
12 |
Major isoform ID |
CUFF.23567.0 |
The Cufflinks ID of the gene's major isoform |
Class Codes
If you ran cuffcompare with the -r option, tracking rows will
contain the following values. If you did not use -r, the rows will
all contain "-" in their class code column.
Priority |
Code |
Description |
1 |
= |
Complete match of intron chain |
2 |
c |
Contained |
|
3 |
j |
Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript |
|
4 |
e |
Single exon transfrag overlapping a reference
exon and at least 10 bp of a reference intron, indicating a
possible pre-mRNA fragment. |
|
5 |
i |
A transfrag falling entirely within a
reference intron |
|
6 |
o |
Generic exonic overlap with a reference transcript |
|
7 |
p |
Possible polymerase run-on fragment (within 2Kbases of a reference transcript) |
|
8 |
r |
Repeat. Currently determined by looking at the soft-masked reference
sequence and applied to transcripts where at least 50% of the
bases are lower case |
|
9 |
u |
Unknown, intergenic transcript |
|
10 |
x |
Exonic overlap with reference on the opposite strand |
|
11 |
s |
An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
|
12 |
. |
(.tracking file only, indicates multiple classifications) |
|
Merging assemblies with cuffmerge
Cufflinks includes a script called cuffmerge that you can use to merge
together several Cufflinks assemblies. It handles also handles running
Cuffcompare for you, and automatically filters a number of transfrags
that are probably artfifacts. If you have a reference GTF file available,
you can provide it to the script in order to gracefully merge novel
isoforms and known isoforms and maximize overall assembly quality.
The main purpose of this script is to make it easier to make an assembly
GTF file suitable for use with Cuffdiff.
From the command line, run cuffmerge as follows:
cuffmerge [options]* <assembly_GTF_list.txt>
cuffmerge takes several assembly GTF files from Cufflinks' as input. Input GTF files must be specified in a "manifest" file listing full paths to the files.
Arguments:
|
|
<assembly_list.txt>
|
Text file "manifest" with a list (one per line) of GTF files that you'd like to merge together into a single GTF file.
|
Options:
|
|
-h/--help
|
Prints the help message and exits
|
-o <outprefix>
|
Write the summary stats into the text output file <outprefix>(instead of stdout)
|
-g/--ref-gtf
|
An optional "reference" annotation GTF. The input assemblies are merged together with
the reference GTF and included in the final output.
|
-p/--num-threads <int>
|
Use this many threads to align reads. The default is 1.
|
-s/--ref-sequence <seq_dir>/<seq_fasta>
|
This argument should point to the genomic DNA sequences for the reference.
If a directory, it should contain one fasta file per contig. If a
multifasta file, all contigs should be present. The merge script will
pass this option to cuffcompare, which will use the sequences
to assist in classifying transfrags and excluding artifacts (e.g. repeats). For example,
Cufflinks transcripts consisting mostly of lower-case bases are classified
as repeats. Note that <seq_dir> must contain one
fasta file per reference chromosome, and each file must be named after
the chromosome, and have a .fa or .fasta extension.
|
cuffmerge Output
cuffmerge produces a GTF file that contains an assembly that merges together the input assemblies.
<outprefix>/merged.gtf
Running Cuffquant
Run cuffquant from the command line as follows:
Usage: cuffquant [options]* <annotation.(gtf/gff)> <aligned_reads.(sam/bam)>
The following is a detailed description of the options used to control
Cuffquant:
Arguments:
|
|
<annotation.(gtf/gff)>
|
Tells Cuffquant to use the supplied reference annotation (a GFF file)
to estimate isoform expression. The program
will ignore alignments not structurally compatible with any reference transcript.
|
<aligned_reads.(sam/bam)>
|
A file of RNA-Seq read alignments in the
SAM format. SAM is a standard short read alignment, that allows aligners to
attach custom tags to individual alignments, and Cuffquant requires that
the alignments you supply have some of these tags. Please see
Input formats for more details.
|
General Options:
|
|
-h/--help
|
Prints the help message and exits
|
-o/--output-dir <string>
|
Sets the name of the directory in which Cuffquant will write all of its
output. The default is "./".
|
-p/--num-threads <int>
|
Use this many threads to align reads. The default is 1.
|
-M/--mask-file <mask.(gtf/gff)>
|
Tells Cuffquant to ignore all reads that could have come from transcripts in this GTF
file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant
transcripts you wish to ignore in your analysis in this file. Due to variable efficiency
of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often
improves the overall robustness of transcript abundance estimates.
|
-b/--frag-bias-correct <genome.fa>
|
Providing Cuffquant with a multifasta file via this option instructs it to run
bias detection and correction algorithm which can significantly improve accuracy of
transcript abundance estimates. See How Cufflinks Works for more details.
|
-u/--multi-read-correct
|
Tells Cuffquant to do an initial estimation procedure to more accurately weight reads
mapping to multiple locations in the genome. See How Cufflinks Works for more details.
|
--library-type
|
See Library Types
|
Advanced Abundance Estimation Options:
|
|
-m/--frag-len-mean <int>
|
This is the expected (mean) fragment length. The default is 200bp. Note:
Cuffquant learns the fragment length mean for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
-s/--frag-len-std-dev <int>
|
The standard deviation for the distribution on fragment lengths. The default is 80bp. Note:
Cuffquant learns the fragment length standard deviation for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
--max-mle-iterations <int>
|
Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
|
--max-bundle-frags <int>
|
Sets the maximum number of fragments a locus may have before being skipped. Default: 1000000
|
--no-effective-length-correction
|
Cuffquant will not employ its "effective" length normalization to transcript FPKM.
|
--no-length-correction
|
Cuffquant will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.
|
Advanced Program Behavior Options:
|
|
-v/--verbose
|
Print lots of status updates and other diagnostic information.
|
-q/--quiet
|
Suppress messages other than serious warnings and errors.
|
--no-update-check
|
Turns off the automatic routine that contacts the Cufflinks server to check for a more
recent version.
|
Cuffquant Output
Cuffquant produces writes a single output file, abundances.cxb, into the output directory. CXB files are binary files, and can be passed to Cuffnorm or Cuffdiff for further processing.
Running Cuffdiff
Cufflinks includes a program, "Cuffdiff", that you can use to find significant changes
in transcript expression, splicing, and promoter use. From the command line, run cuffdiff as follows:
cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]
Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more
SAM files containing the fragment alignments for two or more samples.
It produces a number of output files that contain test results for
changes in expression at the level of transcripts, primary transcripts,
and genes. It also tracks changes in the relative abundance of transcripts sharing a common
transcription start site, and in the relative abundances of the primary
transcripts of each gene. Tracking the former allows one to see changes in
splicing, and the latter lets one see changes in relative promoter use
within a gene.
If you have more than one replicate for a sample, supply the SAM files for
the sample as a single comma-separated list. It is not necessary to
have the same number of replicates for each sample.
Note that Cuffdiff can also accepted BAM files (which are binary, compressed SAM files). It can also accept CXB files produced by Cuffquant. Note that mixing SAM and BAM files is supported, but you cannot currently mix CXB and SAM/BAM. If one of the samples is supplied as a CXB file, all of the samples must be supplied as CXB files.
Cuffdiff requires that transcripts in the input GTF be annotated with
certain attributes in order to look for changes in primary transcript
expression, splicing, coding output, and promoter use. These attributes
are:
Attribute |
Description |
tss_id |
The ID of this transcript's inferred start site.
Determines which primary transcript this processed transcript
is believed to come from. Cuffcompare appends this attribute to
every transcript reported in the .combined.gtf file. |
p_id |
The ID of the coding sequence this transcript contains. This
attribute is attached by Cuffcompare to the .combined.gtf records
only when it is run with a reference annotation that include CDS
records. Further, differential CDS analysis is only
performed when all isoforms of a gene have p_id
attributes, because neither Cufflinks nor Cuffcompare attempt to
assign an open reading frame to transcripts. |
Note:
If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by
Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these
attributes with a command like this:
cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf
The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and
p_id attributes added to each record and this file can be used as input for cuffdiff.
Arguments:
|
|
<transcripts.(gtf/gff)>
|
A transcript annotation file produced by cufflinks, cuffcompare, or other source.
|
<sample1.(sam/bam/cxb)>
|
A SAM file of aligned RNA-Seq reads. If more than two are provided,
Cuffdiff tests for differential expression and regulation between all pairs of
samples.
|
Options:
|
|
-h/--help
|
Prints the help message and exits
|
-o/--output-dir <string>
|
Sets the name of the directory in which Cuffdiff will write all of its
output. The default is "./".
|
-L/--labels <label1,label2,...,labelN>
|
Specify a label for each sample, which will be included in various output files
produced by Cuffdiff.
|
-p/--num-threads <int>
|
Use this many threads to align reads. The default is 1.
|
-T/--time-series
|
Instructs Cuffdiff to analyze the provided samples as a time series, rather than
testing for differences between all pairs of samples. Samples should be provided in
increasing time order at the command line (e.g first time point SAM, second timepoint SAM, etc.)
|
--total-hits-norm
|
With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. It is inactive by default.
|
--compatible-hits-norm
|
With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. This option can be combined with -N/--upper-quartile-norm. Using this mode is generally recommended in Cuffdiff to reduce certain types of bias caused by differential amounts of ribosomal reads which can create the impression of falsely differentially expressed genes. It is active by default.
|
-b/--frag-bias-correct <genome.fa>
|
Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our
bias detection and correction algorithm which can significantly improve accuracy of
transcript abundance estimates. See How Cufflinks Works for more details.
|
-u/--multi-read-correct
|
Tells Cufflinks to do an initial estimation procedure to more accurately weight reads
mapping to multiple locations in the genome. See How Cufflinks Works for more details.
|
|
-c/--min-alignment-count <int>
|
The minimum number of alignments in a locus for needed to conduct
significance testing on changes in that locus observed between samples.
If no testing is performed, changes in the locus are deemed not signficant,
and the locus' observed changes don't contribute to correction for
multiple testing. The default is 10 fragment alignments.
|
-M/--mask-file <mask.(gtf/gff)>
|
Tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF
file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant
transcripts you wish to ignore in your analysis in this file. Due to variable efficiency
of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often
improves the overall robustness of transcript abundance estimates.
|
--FDR <float>
|
The allowed false discovery rate. The default is 0.05.
|
--library-type
|
See Library Types
|
--library-norm-method
|
See Library Normalization Methods
|
--dispersion-method
|
See Cross-replicate dispersion estimation methods
|
Advanced Options:
|
|
-m/--frag-len-mean <int>
|
This is the expected (mean) fragment length. The default is 200bp. Note:
Cufflinks now learns the fragment length mean for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
-s/--frag-len-std-dev <int>
|
The standard deviation for the distribution on fragment lengths. The default is 80bp. Note:
Cufflinks now learns the fragment length standard deviation for each SAM file, so using this option is
no longer recommended with paired-end reads.
|
--num-importance-samples <int>
|
Deprecated
|
--max-mle-iterations <int>
|
Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
|
-v/--verbose
|
Print lots of status updates and other diagnostic information.
|
-q/--quiet
|
Suppress messages other than serious warnings and errors.
|
--no-update-check
|
Turns off the automatic routine that contacts the Cufflinks server to check for a more
recent version.
|
--poisson-dispersion
|
Use the Poisson fragment dispersion model instead of learning one
in each condition.
|
--emit-count-tables
|
Cuffdiff will output a file for each condition (called <sample>_counts.txt)
containing the fragment counts, fragment count variances, and fitted
variance model. For internal debugging only. This option will be removed
in a future version of Cuffdiff.
|
-F/--min-isoform-fraction <0.0-1.0>
|
Cuffdiff will round down to zero the abundance of alternative isoforms quantified at below
the specified fraction of the major isoforms. This is done after MLE estimation
but before MAP estimation to improve robustness of confidence interval
generation and differential expression analysis. The default is 1e-5, and
we recommend you not alter this parameter.
|
--max-bundle-frags <int>
|
Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are marked with status HIDATA. Default: 1000000
|
--max-frag-count-draws <int>
|
Cuffdiff will make this many draws from each transcript's predicted negative binomial
random numbder generator. Each draw is a number of fragments that will be
probabilistically assigned to the transcripts in the transcriptome. Used to estimate
the variance-covariance matrix on assigned fragment counts. Default: 100.
|
--max-frag-assign-draws <int>
|
For each fragment drawn from a transcript, Cuffdiff will assign it this
many times (probabilistically), thus estimating the assignment uncertainty
for each transcript. Used to estimate
the variance-covariance matrix on assigned fragment counts. Default: 50.
|
-F/--min-outlier-p <0.0-1.0>
|
DEPRECATED
|
--min-reps-for-js-test <int>
|
Cuffdiff won't test genes for differential regulation unless the conditions in question have at least this many replicates. Default: 3.
|
--no-effective-length-correction
|
Cuffdiff will not employ its "effective" length normalization to transcript FPKM.
|
--no-length-correction
|
Cuffdiff will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.
|
Cuffdiff Output
1) FPKM tracking files
Cuffdiff calculates the FPKM of each transcript, primary transcript,
and gene in each sample. Primary transcript and gene FPKMs are computed
by summing the FPKMs of transcripts in each primary transcript group
or gene group. The results are output in FPKM tracking files in the format
described here.
There are four FPKM tracking files:
isoforms.fpkm_tracking | Transcript FPKMs |
genes.fpkm_tracking | Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id |
cds.fpkm_tracking | Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id |
tss_groups.fpkm_tracking | Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id |
2) Count tracking files
Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript,
and gene in each sample. Primary transcript and gene counts are computed
by summing the counts of transcripts in each primary transcript group
or gene group. The results are output in count tracking files in the format
described here.
There are four Count tracking files:
isoforms.count_tracking | Transcript counts |
genes.count_tracking | Gene counts. Tracks the summed counts of transcripts sharing each gene_id |
cds.count_tracking | Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id |
tss_groups.count_tracking | Primary transcript counts. Tracks the summed counts of transcripts sharing each tss_id |
3) Read group tracking files
Cuffdiff calculates the expression and fragment count for each transcript, primary transcript,
and gene in each replicate. The results are output in per-replicate tracking files in the format
described here.
There are four read group tracking files:
isoforms.read_group_tracking | Transcript read group tracking |
genes.read_group_tracking | Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene_id in each replicate |
cds.read_group_tracking | Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate |
tss_groups.read_group_tracking | Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate |
4) Differential expression tests
This tab delimited file lists the results of differential
expression testing between samples for spliced transcripts,
primary transcripts, genes, and coding sequences.
For each pair of samples x and y, four files are created
isoform_exp.diff | Transcript differential FPKM. |
gene_exp.diff | Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id |
tss_group_exp.diff | Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id |
cds_exp.diff | Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id |
Each of the above files has the following format:
Column number | Column name | Example | Description |
1 | Tested id | XLOC_000001 | A unique identifier describing the transcipt, gene, primary transcript, or CDS being tested |
2 | gene | Lypla1 | The gene_name(s) or gene_id(s) being tested |
3 | locus | chr1:4797771-4835363 | Genomic coordinates for easy browsing to the genes or transcripts being tested. |
4 | sample 1 | Liver | Label (or number if no labels provided) of the first sample being tested |
5 | sample 2 | Brain | Label (or number if no labels provided) of the second sample being tested |
6 | Test status | NOTEST | Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing. |
7 | FPKMx | 8.01089 | FPKM of the gene in sample x |
8 | FPKMy | 8.551545 | FPKM of the gene in sample y |
9 | log2(FPKMy/FPKMx) | 0.06531 | The (base 2) log of the fold change y/x |
|
10 | test stat | 0.860902 | The value of the test statistic used to compute significance of the observed change in FPKM |
11 | p value | 0.389292 | The uncorrected p-value of the test statistic |
12 | q value | 0.985216 | The FDR-adjusted p-value of the test statistic |
13 | significant | no | Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing |
5) Differential splicing tests - splicing.diff
This tab delimited file lists, for each primary transcript, the amount of overloading
detected among its isoforms, i.e. how much differential splicing exists between
isoforms processed from a single primary transcript. Only primary
transcripts from which two or more isoforms are spliced are listed in this file.
Column number | Column name | Example | Description |
1 | Tested id | TSS10015 | A unique identifier describing the primary transcript being tested. |
2 | gene name | Rtkn | The gene_name or gene_id that the primary transcript being tested belongs to |
3 | locus | chr6:83087311-83102572 | Genomic coordinates for easy browsing to the genes or transcripts being tested. |
4 | sample 1 | Liver | Label (or number if no labels provided) of the first sample being tested |
5 | sample 2 | Brain | Label (or number if no labels provided) of the second sample being tested |
6 | Test status | OK | Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing. |
7 | Reserved | 0 | |
8 | Reserved | 0 | |
9 | √JS(x,y) | 0.22115 | The splice overloading of the primary transcript, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants |
|
10 | test stat | 0.22115 | The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y) |
11 | p value | 0.000174982 | The uncorrected p-value of the test statistic. |
12 | q value | 0.985216 | The FDR-adjusted p-value of the test statistic |
13 | significant | yes | Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing |
6) Differential coding output - cds.diff
This tab delimited file lists, for each gene, the amount of overloading
detected among its coding sequences, i.e. how much differential
CDS output exists between samples. Only genes producing two or more
distinct CDS (i.e. multi-protein genes) are listed here.
Column number | Column name | Example | Description |
1 | Tested id | XLOC_000002-[chr1:5073200-5152501] | A unique identifier describing the gene being tested. |
2 | gene name | Atp6v1h | The gene_name or gene_id |
3 | locus | chr1:5073200-5152501 | Genomic coordinates for easy browsing to the genes or transcripts being tested. |
4 | sample 1 | Liver | Label (or number if no labels provided) of the first sample being tested |
5 | sample 2 | Brain | Label (or number if no labels provided) of the second sample being tested |
6 | Test status | OK | Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing. |
7 | Reserved | 0 | |
8 | Reserved | 0 | |
9 | √JS(x,y) | 0.0686517 | The CDS overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the coding sequences |
|
10 | test stat | 0.0686517 | The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y) |
11 | p value | 0.00546783 | The uncorrected p-value of the test statistic |
12 | q value | 0.985216 | The FDR-adjusted p-value of the test statistic |
13 | significant | yes | Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing |
This tab delimited file lists, for each gene, the amount of overloading
detected among its primary transcripts, i.e. how much differential
promoter use exists between samples. Only genes producing two or more
distinct primary transcripts (i.e. multi-promoter genes) are listed here.
Column number | Column name | Example | Description |
1 | Tested id | XLOC_000019 | A unique identifier describing the gene being tested. |
2 | gene name | Tmem70 | The gene_name or gene_id |
3 | locus | chr1:16651657-16668357 | Genomic coordinates for easy browsing to the genes or transcripts being tested. |
4 | sample 1 | Liver | Label (or number if no labels provided) of the first sample being tested |
5 | sample 2 | Brain | Label (or number if no labels provided) of the second sample being tested |
6 | Test status | OK | Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing. |
7 | Reserved | 0 | |
8 | Reserved | 0 | |
9 | √JS(x,y) | 0.0124768 | The promoter overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the primary transcripts |
|
10 | test stat | 0.0124768 | The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y) |
11 | p value | 0.394327 | The uncorrected p-value of the test statistic |
12 | q value | 0.985216 | The FDR-adjusted p-value of the test statistic |
13 | significant | no | Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing |
8) Read group info - read_groups.info
This tab delimited file lists, for each replicate, key properties used by Cuffdiff during quantification, such as library normalization factors.
The read_groups.info file has the following format:
Column number | Column name | Example | Description |
1 | file | mCherry_rep_A/accepted_hits.bam | BAM or SAM file containing the data for the read group |
2 | condition | mCherry | Condition to which the read group belongs |
3 | replicate_num | 0 | Replicate number of the read group |
4 | total_mass | 4.72517e+06 | Total number of fragments for the read group |
5 | norm_mass | 4.72517e+06 | Fragment normalization constant used during calculation of FPKMs. |
6 | internal_scale | 1.23916 | Internal scaling factor, used to transform replicates of a single condition onto the "internal" common count scale. |
7 | external_scale | 0.96 | External scaling factor, used to transform counts from different conditions onto an internal common count scale. |
8) Run info - run.info
This tab delimited file lists various bits of information about a Cuffdiff run to help track what options were provided. For example:
param value
cmd_line cuffdiff base_ref.gtf mCherry/accepted_hits.bam R49/accepted_hits.bam
version 2.0.0
SVN_revision 3258
boost_version 104700
Running Cuffnorm
Cufflinks includes a program, "Cuffnorm", that you can use to generate tables of expression values that are properly normalized for library size. From the command line, run cuffnorm as follows:
cuffnorm [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]
Running Cuffnorm is very similar to running Cuffdiff. Cuffnorm takes a GTF2/GFF3 file of transcripts as input, along with two or more
SAM, BAM, or CXB files for two or more samples.
It produces a number of output files that contain expression levels and normalized fragment counts at the level of transcripts, primary transcripts,
and genes. It also tracks changes in the relative abundance of transcripts sharing a common
transcription start site, and in the relative abundances of the primary
transcripts of each gene. Tracking the former allows one to see changes in
splicing, and the latter lets one see changes in relative promoter use
within a gene.
If you have more than one replicate for a sample, supply the SAM files for
the sample as a single comma-separated list. It is not necessary to
have the same number of replicates for each sample.
Note that Cuffnorm can also accepted BAM files (which are binary, compressed SAM files). It can also accept CXB files produced by Cuffquant. Note that mixing SAM and BAM files is supported, but you cannot currently mix CXB and SAM/BAM. If one of the samples is supplied as a CXB file, all of the samples must be supplied as CXB files.
Cuffnorm also requires a GFF/GTF file, conforming to the same specifications as needed for Cuffdiff.
Arguments:
|
|
<transcripts.(gtf/gff)>
|
A transcript annotation file produced by cufflinks, cuffcompare, or other source.
|
<sample1.(sam/bam/cxb)>
|
A SAM file of aligned RNA-Seq reads. If more than two are provided,
Cuffdiff tests for differential expression and regulation between all pairs of
samples.
|
Options:
|
|
-h/--help
|
Prints the help message and exits
|
-o/--output-dir <string>
|
Sets the name of the directory in which Cuffdiff will write all of its
output. The default is "./".
|
-L/--labels <label1,label2,...,labelN>
|
Specify a label for each sample, which will be included in various output files
produced by Cuffdiff.
|
-p/--num-threads <int>
|
Use this many threads to align reads. The default is 1.
|
--total-hits-norm
|
With this option, Cuffquant counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. It is inactive by default.
|
--compatible-hits-norm
|
With this option, Cuffnorm counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. It is active by default.
|
--library-type
|
See Library Types
|
--library-norm-method
|
See Library Normalization Methods
|
--output-format
|
See Output format options
|
Advanced Options:
|
|
-v/--verbose
|
Print lots of status updates and other diagnostic information.
|
-q/--quiet
|
Suppress messages other than serious warnings and errors.
|
--no-update-check
|
Turns off the automatic routine that contacts the Cufflinks server to check for a more
recent version.
|
Cuffnorm Output
Cuffnorm outputs a set of files containing normalized expression levels for each gene, transcript, TSS group, and CDS group in the experiment. It does not perform differential expression analysis. To assess the significance of changes in expression for genes and transcripts between conditions, use Cuffdiff. Cuffnorm's output files are useful when you have many samples and you simply want to cluster them or plot expression levels of genes important in your study.
By default, Cuffnorm reports expression levels in the "simple-table" tab-delimted text files. The program also reports information about your samples and about the genes, transcripts, TSS groups, and CDS groups as tab delimited text files. Note that these files have a different format than the files used by Cuffdiff. However, you can direct Cuffnorm to report its output in the same format as used by Cuffdiff if you wish. Simply supply the option --output-format cuffdiff at the command line.
Cuffnorm will report both FPKM values and normalized, estimates for the number of fragments that originate from each gene, transcript, TSS group, and CDS group. Note that because these counts are already normalized to account for differences in library size, they should not be used with downstream differential expression tools that require raw counts as input.
To see the details of the simple table format used by Cuffnorm, refer to the simple table expression format, simple table sample attribute format, and simple table feature (e.g. gene) attribute format sections below.
Sample sheets for Cuffdiff and Cuffnorm
Both Cuffdiff and Cuffnorm can be run by specifying a list of SAM, BAM, or CXB files at the command line. For analysis with many such files, specifying them in this way can be cumbersome and error-prone. Both programs also allow you to specify these inputs in a simple, tab-delimited table. Create a file called sample_sheet.txt or another name of your choice, and specify samples as follows, one per line:
Column number | Column name | Example | Description |
1 | sample_id | C1_R1.sam | the path to the SAM/BAM/CXB file for this sample |
2 | group_label | C1 | The condition label for this sample. Replicates of a condition should share the same label |
To run Cuffdiff or Cuffnorm with a sample sheet, create one and then at the command line, provide the --use-sample-sheet option and replace the list of SAM/BAM/CXB files with the name of your sample sheet file, as follows:
cuffdiff --use-sample-sheet <transcripts.gtf> <sample_sheet.txt>
An example sample sheet might look like this:
sample_id group_label
C1_R1.sam C1
C1_R2.sam C1
C2_R1.sam C2
C2_R2.sam C2
Contrast files for Cuffdiff
Cuffdiff, by default, compares each pair of conditions in your experiment. If you have many conditions, this can create a lot of additional work for the program. These extra conditions can cause Cuffdiff's output files to be very large, which can slow down CummeRbund and other downstream analysis software. Often, you are not interested in all pairwise contrasts. Rather, you'd like to compare all conditions to a common control, or only look at matched pairs of samples. You can specify the contrasts Cuffdiff should perform using a contrast file.
Contrast files are simple, tab delimited text files. They should have a single header line as the first line in the file, followed by one line for each contrast you'd like to perform. The files should have two columns, as specified below:
Column number | Column name | Example | Description |
1 | condition_A | Ctrl | A condition label. Must match one of the labels specified through -L or in the sample sheet. |
1 | condition_B | Ctrl | A condition label. Must match one of the labels specified through -L or in the sample sheet. |
To run Cuffdiff with a contrast file, simply provide the -C <contrasts.txt> option at the command line.
An example table might look like this:
condition_A condition_B
Ctrl Mutant_X
Ctrl Mutant_Y
Ctrl Mutant_Z
Cufflinks, Cuffdiff, and Cuffnorm all output numerous types of files. They fall into one of the formats described below:
FPKM tracking files use a generic format to output estimated expression values. Each FPKM tracking file has the following format:
Column number | Column name | Example | Description |
1 | tracking_id | TCONS_00000001 | A unique identifier describing the object (gene, transcript, CDS, primary transcript) |
2 | class_code | = | The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present |
3 | nearest_ref_id | NM_008866.1 | The reference transcript to which the class code refers, if any |
4 | gene_id | NM_008866 | The gene_id(s) associated with the object |
5 | gene_short_name | Lypla1 | The gene_short_name(s) associated with the object |
6 | tss_id | TSS1 | The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present |
7 | locus | chr1:4797771-4835363 | Genomic coordinates for easy browsing to the object |
8 | length | 2447 | The number of base pairs in the transcript, or '-' if not a transcript/primary transcript |
9 | coverage | 43.4279 | Estimate for the absolute depth of read coverage across the object |
10 | q0_FPKM | 8.01089 | FPKM of the object in sample 0 |
11 | q0_FPKM_lo | 7.03583 | the lower bound of the 95% confidence interval on the FPKM of the object in sample 0 |
12 | q0_FPKM_hi | 8.98595 | the upper bound of the 95% confidence interval on the FPKM of the object in sample 0 |
13 | q0_status | OK | Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
14 | q1_FPKM | 8.55155 | FPKM of the object in sample 1 |
15 | q1_FPKM_lo | 7.77692 | the lower bound of the 95% confidence interval on the FPKM of the object in sample 0 |
16 | q1_FPKM_hi | 9.32617 | the upper bound of the 95% confidence interval on the FPKM of the object in sample 1 |
17 | q1_status | 9.32617 | the upper bound of the 95% confidence interval on the FPKM of the object in sample 1. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
3N + 12 | qN_FPKM | 7.34115 | FPKM of the object in sample N |
3N + 13 | qN_FPKM_lo | 6.33394 | the lower bound of the 95% confidence interval on the FPKM of the object in sample N |
3N + 14 | qN_FPKM_hi | 8.34836 | the upper bound of the 95% confidence interval on the FPKM of the object in sample N |
3N + 15 | qN_status | OK | Quantification status for the object in sample N. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
Count tracking files use a generic format to output estimated fragment count values. Each Count tracking file has the following format:
Column number | Column name | Example | Description |
1 | tracking_id | TCONS_00000001 | A unique identifier describing the object (gene, transcript, CDS, primary transcript) |
2 | q0_count | 201.334 | Estimated (externally scaled) number of fragments generated by the object in sample 0 |
3 | q0_count_variance | 5988.24 | Estimated variance in the number of fragments generated by the object in sample 0 |
4 | q0_count_uncertainty_var | 170.21 | Estimated variance in the number of fragments generated by the object in sample 0 due to fragment assignment uncertainty. |
5 | q0_count_dispersion_var | 4905.63 | Estimated variance in the number of fragments generated by the object in sample 0 due to cross-replicate variability. |
6 | q0_status | OK | Quantification status for the object in sample 0. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
7 | q1_count | 201.334 | Estimated (externally scaled) number of fragments generated by the object in sample 1 |
8 | q1_count_variance | 5988.24 | Estimated variance in the number of fragments generated by the object in sample 1 |
9 | q1_count_uncertainty_var | 170.21 | Estimated variance in the number of fragments generated by the object in sample 1 due to fragment assignment uncertainty. |
10 | q1_count_dispersion_var | 4905.63 | Estimated variance in the number of fragments generated by the object in sample 1 due to cross-replicate variability. |
11 | q1_status | OK | Quantification status for the object in sample 1. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
7 | qN_count | 201.334 | Estimated (externally scaled) number of fragments generated by the object in sample N |
4N + 6 | qN_count_variance | 7.34115 | Estimated variance in the number of fragments generated by the object in sample N |
4N + 7 | qN_count_uncertainty_var | 6.33394 | Estimated variance in the number of fragments generated by the object in sample N due to fragment assignment uncertainty. |
4N + 8 | qN_count_dispersion_var | 8.34836 | Estimated variance in the number of fragments generated by the object in sample N due to cross-replicate variability. |
4N + 9 | qN_status | OK | Quantification status for the object in sample N. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
Read group tracking files record per-replicate expression and count data. Each Count tracking file has the following format:
Column number | Column name | Example | Description |
1 | tracking_id | TCONS_00000001 | A unique identifier describing the object (gene, transcript, CDS, primary transcript) |
2 | condition | Fibroblasts | Name of the condition |
3 | replicate | 1 | Name of the replicate of the condition |
4 | raw_frags | 170.21 | The estimate number of (unscaled) fragments originating from the object in this replicate |
5 | internal_scaled_frags | 4905.63 | Estimated number of fragments originating from the object, after transforming to the internal common count scale (for comparison between replicates of this condition.) |
6 | external_scaled_frags | 99.21 | Estimated number of fragments originating from the object, after transforming to the external common count scale (for comparison between conditions) |
7 | FPKM | 201.334 | FPKM of this object in this replicate |
8 | effective_length | 5988.24 | The effective length of the object in this replicate. Currently a reserved, unreported field. |
9 | status | OK | Quantification status for the object. Can be one of OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents deconvolution. |
Cuffnorm reports two different types of files with this format: *.fpkm_table files and *.count_table files for each group of features: genes, transcripts, TSS groups, and CDS groups. The files start with a column indicating the feature ID for each row. There is one subsequent column for each sample in the analysis:
Column number | Column name | Example | Description |
1 | tracking_id | TCONS_00000001 | A unique identifier describing the object (gene, transcript, CDS, primary transcript) |
2 | conditionX_N | = | The FPKM value (for *.fpkm_table files) or normalized fragment count (for *.count_table files) for this feature in replicate N of conditionX |
Cuffdiff reports metadata for each gene, transcript, TSS group, and CDS group in the following tab delimited format:
Column number | Column name | Example | Description |
1 | tracking_id | TCONS_00000001 | A unique identifier describing the object (gene, transcript, CDS, primary transcript) |
2 | class_code | = | The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present |
3 | nearest_ref_id | NM_008866.1 | The reference transcript to which the class code refers, if any |
4 | gene_id | NM_008866 | The gene_id(s) associated with the object |
5 | gene_short_name | Lypla1 | The gene_short_name(s) associated with the object |
6 | tss_id | TSS1 | The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present |
7 | locus | chr1:4797771-4835363 | Genomic coordinates for easy browsing to the object |
8 | length | 2447 | The number of base pairs in the transcript, or '-' if not a transcript/primary transcript |
Cuffnorm reports some information about each sample (i.e. each SAM, BAM, or CXB file) in the analysis in the following format:
Column number | Column name | Example | Description |
1 | sample_id | q1_0 | A unique identifier describing the sample. Has the format conditionX_N, meaning replicate N of conditionX. |
2 | file | C1_R1.sam | The path to the file (SAM/BAM/CXB) attribute for the sample |
3 | total_mass | 94610 | The total (un-normalized) number of fragment alignments for this sample |
4 | internal_scale | 1.0571 | The scaling factor used to normalize this sample library size. |
5 | external_scale | 1 | Reserved |
Library Types
In cases where Cufflinks cannot determine the platform and protocol used to generate
input reads, you can supply this information manually, which will allow Cufflinks to
infer source strand information with certain protocols. The available options are listed below.
For paired-end data, we currently only support protocols where reads are point towards each other.
Library Type | Examples | Description |
fr-unstranded (default) | Standard Illumina | Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand. |
fr-firststrand | dUTP, NSR, NNSR | Same as above except we enforce the rule that the right-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during first strand synthesis is sequenced. |
fr-secondstrand | Directional Illumina (Ligation), Standard SOLiD | Same as above except we enforce the rule that the left-most end of the fragment (in transcript coordinates) is the first sequenced (or only sequenced for single-end reads). Equivalently, it is assumed that only the strand generated during second strand synthesis is sequenced. |
Please contact tophat.cufflinks@gmail.com to request support for a new protocol.
Library Normalization Methods
You can control how library sizes (i.e. sequencing depths) are normalized in Cufflinks and Cuffdiff. Cuffdiff has several methods that require multiple libraries in order to work. Library normalization methods supported by Cufflinks work on one library at a time.
Normalization Method | Supported by Cufflinks | Supported by Cuffdiff | Description |
classic-fpkm | Yes | Yes | Library size factor is set to 1 - no scaling applied to FPKM values or fragment counts. (default for Cufflinks) |
geometric | No | Yes | FPKMs and fragment counts are scaled via the median of the geometric means of fragment counts across all libraries, as described in Anders and Huber (Genome Biology, 2010). This policy identical to the one used by DESeq. (default for Cuffdiff) |
quartile | No | Yes | FPKMs and fragment counts are scaled via the ratio of the 75 quartile fragment counts to the average 75 quartile value across all libraries. |
Cross-replicate dispersion estimation methods
Cuffdiff works by modeling the variance in fragment counts across replicates as a function of the mean fragment count across replicates. Strictly speaking, models a quantitity called dispersion - the variance present in a group of samples beyond what is expected from a simple Poisson model of RNA_Seq. You can control how Cuffdiff constructs its model of dispersion in locus fragment counts. Each condition that has replicates can receive its own model, or Cuffdiff can use a global model for all conditions. All of these policies are identical to those used by DESeq (Anders and Huber, Genome Biology, 2010)
Dispersion Method | Description |
pooled | Each replicated condition is used to build a model, then these models are averaged to provide a single global model for all conditions in the experiment. (Default) |
per-condition | Each replicated condition receives its own model. Only available when all conditions have replicates. |
blind | All samples are treated as replicates of a single global "condition" and used to build one model. |
poisson | The Poisson model is used, where the variance in fragment count is predicted to equal the mean across replicates. Not recommended. |
Which method you choose largely depends on whether you expect variability in each group of samples to be similar. For example, if you are comparing two groups, A and B, where A has low cross-replicate variability and B has high variability, it may be best to choose per-condition. However, if the conditions have similar levels of variability, you might stick with the default, which sometimes provides a more robust model, especially in cases where each group has few replicates. Finally, if you only have a single replicate in each condition, you must use blind, which treats all samples in the experiment as replicates of a single condition. This method works well when you expect the samples to have very few differentially expressed genes. If there are many differentially expressed genes, Cuffdiff will construct an overly conservative model and you may not get any significant calls. In this case, you will need more replicates in your experiment.
|