| Literature DB >> 29843589 |
Shilin Zhao1, Chung-I Li2, Yan Guo3,4, Quanhu Sheng1, Yu Shyr5.
Abstract
BACKGROUND: One of the most important and often neglected components of a successful RNA sequencing (RNA-Seq) experiment is sample size estimation. A few negative binomial model-based methods have been developed to estimate sample size based on the parameters of a single gene. However, thousands of genes are quantified and tested for differential expression simultaneously in RNA-Seq experiments. Thus, additional issues should be carefully addressed, including the false discovery rate for multiple statistic tests, widely distributed read counts and dispersions for different genes.Entities:
Keywords: Power analysis; RNA-Seq; Sample size; Simulation
Mesh:
Year: 2018 PMID: 29843589 PMCID: PMC5975570 DOI: 10.1186/s12859-018-2191-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1RnaSeqSampleSize package workflow
Fig. 2Read counts and dispersion distribution greatly influence the estimated sample size and power. a The read counts and dispersion distribution for all genes from TCGA Rectum adenocarcinoma (READ) dataset. The red lines indicate read counts equal to one and 10. And the green line indicates the 95% quantile of all gene dispersions. b The estimated sample size required to achieve 0.8 power in different combinations of read counts and dispersions
Fig. 3Sample size estimation with real data. a The read counts distribution for all genes from TCGA Breast Invasive Carcinoma (BRCA) and Rectum adenocarcinoma (READ) dataset; (b) The dispersion distribution for all genes from TCGA BRCA and READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA BRCA dataset when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset when sample size equals 71. The red lines indicate the mean value of power distribution
Fig. 4Sample size estimation for interested genes. a The read counts distribution for genes in three KEGG pathways in TCGA READ dataset; (b) The dispersion distribution for genes in three KEGG pathways in TCGA READ dataset; (c) The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Calcium signaling pathway when sample size equals 71. The red lines indicate the mean value of power distribution. d The power distribution based on the count and dispersion distributions in TCGA READ dataset for genes in Proteasome pathway when sample size equals 71. The red lines indicate the mean value of the power distribution
Fig. 5Power curve visualization and parameter optimization by RnaSeqSampleSize. a Power curves for balanced (same sample size in two groups) and unbalanced (different sample size in two groups) experiment design. The power curves indicate that the balanced experiment design (red line) will achieve the highest power with the same total number of samples; (b) Optimization of parameters in sample size estimation. The dispersion and fold change were set as 0.5 and two, respectively. A power matrix with different pairs of numbers of samples and read counts were generated. The power distribution indicates that the number of samples plays a more significant role in determining the power, and suggests at least 96 samples should be used in RNA-Seq experiments with these parameters to get 0.8 power