| Literature DB >> 26039068 |
Nysia I George1, John F Bowyer2, Nathaniel M Crabtree3, Ching-Wei Chang1.
Abstract
The discrete data structure and large sequencing depth of RNA sequencing (RNA-seq) experiments can often generate outlier read counts in one or more RNA samples within a homogeneous group. Thus, how to identify and manage outlier observations in RNA-seq data is an emerging topic of interest. One of the main objectives in these research efforts is to develop statistical methodology that effectively balances the impact of outlier observations and achieves maximal power for statistical testing. To reach that goal, strengthening the accuracy of outlier detection is an important precursor. Current outlier detection algorithms for RNA-seq data are executed within a testing framework and may be sensitive to sparse data and heavy-tailed distributions. Therefore, we propose a univariate algorithm that utilizes a probabilistic approach to measure the deviation between an observation and the distribution generating the remaining data and implement it within in an iterative leave-one-out design strategy. Analyses of real and simulated RNA-seq data show that the proposed methodology has higher outlier detection rates for both non-normalized and normalized negative binomial distributed data.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26039068 PMCID: PMC4454687 DOI: 10.1371/journal.pone.0125224
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Flowchart of the iLOO algorithm.
Number of features with 0 through 4 detected outliers in the control group of rat RNA-seq data.
| No. of Outliers |
|
|
|---|---|---|
|
| 15967 | 16144 |
|
| 857 | 706 |
|
| 68 | 45 |
|
| 2 | 0 |
|
| 1 | 0 |
Fig 2Venn diagram of the number of features with outliers detected by iLOO and edgeR-robust.
The totals provided present the number of (a) single outlier features and (b) features with two detected outliers identified by iLOO and edgeR-robust in the control group of rat RNA-seq data.
Fig 3Scatterplot of read counts observed in real data for a sample of features.
Scatterplot of raw counts for six representative features displaying counts identified as outliers by iLOO (purple diamond), edgeR-robust (red diamond), and both methods (blue diamond) in the control group of rat RNA-seq data.
Mean (standard deviation) of accuracy metrics from simulated RNA-seq data comparing iLOO (using edgeR normalized sequencing depths) to edgeR-robust.
| Method | Accuracy | Sample Size | |||||
|---|---|---|---|---|---|---|---|
| 5 | 10 | 15 | 20 | 30 | 40 | ||
|
|
| 0.9260 (0.0525) | 0.9756 (0.0372) | 0.9797 (0.0297) | 0.9867 (0.0195) | 0.9899 (0.019) | 0.9885 (0.0176) |
|
| 0.7801 (0.0203) | 0.9703 (0.044) | 0.9774 (0.0325) | 0.9862 (0.0203) | 0.9898 (0.0199) | 0.9887 (0.0173) | |
|
| 0.9422 (0.0575) | 0.9762 (0.0364) | 0.9799 (0.0294) | 0.9867 (0.0194) | 0.9899 (0.0189) | 0.9884 (0.0177) | |
|
|
| 0.9118 (0.0039) | 0.9244 (0.0118) | 0.9371 (0.0195) | 0.9486 (0.0238) | 0.9512 (0.0213) | 0.9515 (0.0283) |
|
| 0.1256 (0.0287) | 0.2440 (0.1194) | 0.3743 (0.2001) | 0.4954 (0.2521) | 0.5770 (0.2662) | 0.6753 (0.2332) | |
|
| 0.9991 (0.0023) | 0.9999 (0.0003) | 0.9997 (0.0017) | 0.9989 (0.0044) | 0.9927 (0.0185) | 0.9822 (0.0376) | |