edgeR : a Bioconductor package for differential expression analysis of digital gene expression data
Abstract
Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions.
Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).
Contact:mrobinson@wehi.edu.au
1 INTRODUCTION
Modern molecular biology data present major challenges for the statistical methods that are used to detect differential expression, such as the requirement of multiple testing procedures and increasingly, empirical Bayes or similar methods that share information across all observations to improve inference. For microarrays, the abundance of a particular transcript is measured as a fluorescence intensity, effectively a continuous response, whereas for digital gene expression (DGE) data the abundance is observed as a count. Therefore, procedures that are successful for microarray data are not directly applicable to DGE data.
This note describes the software package
2 MODEL
Bioinformatics researchers have learned many things from the analysis of microarray data. For instance, power to detect differential expression can be improved and false discoveries reduced by sharing information across all probes. One such procedure is
We assume the data can be summarized into a table of counts, with rows corresponding to genes (or tags or exons or transcripts) and columns to samples. For RNA-seq experiments, these may be counts at the exon, transcript or gene-level. We model the data as negative binomial (NB) distributed,(1)for gene g and sample i. Here, Mi is the library size (total number of reads), ϕg is the dispersion and pgj is the relative abundance of gene g in experimental group j to which sample i belongs. We use the NB parameterization where the mean is μgi=Mipgj and variance is μgi (1+μgiϕg). For differential expression analysis, the parameters of interest are pgj.
The NB distribution reduces to Poisson when ϕg=0. In some DGE applications, technical variation can be treated as Poisson. In general, ϕg represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation.
3 FEATURES
The required inputs for
For users of
A number of features have been added to the
Many of the early RNA-seq datasets involve sequence reads from technical replicates (e.g. same source of RNA) as opposed to biological replicates (e.g. RNA from different individuals). Technical replicates will generally have lower variability than biological replicates and in our experience, the dispersion parameter (and the moderation procedure in
4 DISCUSSION
We have developed a Bioconductor package
Funding: National Health and Medical Research Council Program (Grant 406657 to G.K.S.); NHMRC, Independent Research Institutes Infrastructure Support Scheme (Grant 361646); Victorian State Government OIS grant (awarded to the WEHI); a Melbourne International Research Scholarship (to M.D.R.); Belz, Harris and IBS Honours scholarships (to D.J.M.).
Conflict of Interest: none declared.
References
- 1. Comparative analysis of human gut microbiota by barcoded pyrosequencingPLoS ONE20083e2836[PubMed][Google Scholar]
- 2. Bioconductor: open software development for computational biology and bioinformaticsGenome Biol.20045R80[PubMed][Google Scholar]
- 3. Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer modelProc. Natl Acad. Sci. USA20081052017920184[PubMed][Google Scholar]
- 4. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arraysGenome Res.20081815091517[PubMed][Google Scholar]
- 5. Moderated statistical tests for assessing differences in tag abundanceBioinformatics20072328812887[PubMed][Google Scholar]
- 6. Small sample estimation of negative binomial dispersion, with applications to SAGE dataBiostatistics20089321332[PubMed][Google Scholar]
- 7. Linear models and empirical Bayes methods for assessing differential expression in microarray experimentsStat. Appl. Genet. Mol. Biol.20041Art 3[Google Scholar]
- 8. A conditional approach to residual maximum likelihood estimation in generalized linear modelsJ. R. Stat. Soc. B199658565572[Google Scholar]
- 9. Computational methods for the comparative quantification of proteins in label-free LCn-MS experimentsBrief. Bioinform.20089156165[PubMed][Google Scholar]