| Literature DB >> 26339642 |
Yongchao Dou1, Xiaomei Guo2, Lingling Yuan2, David R Holding3, Chi Zhang4.
Abstract
To improve the applicability of RNA-seq technology, a large number of RNA-seq data analysis methods and correction algorithms have been developed. Although these new methods and algorithms have steadily improved transcriptome analysis, greater prediction accuracy is needed to better guide experimental designs with computational results. In this study, a new tool for the identification of differentially expressed genes with RNA-seq data, named GExposer, was developed. This tool introduces a local normalization algorithm to reduce the bias of nonrandomly positioned read depth. The naive Bayes classifier is employed to integrate fold change, transcript length, and GC content to identify differentially expressed genes. Results on several independent tests show that GExposer has better performance than other methods. The combination of the local normalization algorithm and naive Bayes classifier with three attributes can achieve better results; both false positive rates and false negative rates are reduced. However, only a small portion of genes is affected by the local normalization and GC content correction.Entities:
Mesh:
Year: 2015 PMID: 26339642 PMCID: PMC4538581 DOI: 10.1155/2015/789516
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Flowchart of GExposer.
Summary of all data sets used in this paper.
| Data set | DE genes | NDE genes | SRA accession number |
|---|---|---|---|
| MAQC UHRR and HBRR | 1966 | 3388 | SRA010153.1 |
| Colorectal cancer | 13 | 0 | SRX026158 and SRX026158 |
| Maize leaf | 6 | 9 | SRA012297 |
Figure 2ROC curves of different methods tested on the training data set.
AUC values of six methods on the training data set with the leave-one-out cross-validation.
| Method | False positive test | False negative test | |
|---|---|---|---|
| edgeR | 0.8997 | 0.8567 | 0.7945 |
| DESeq | 0.9002 | 0.8602 | 0.7909 |
| Cuffdiff | 0.8347 | 0.7740 | 0.7610 |
| NOISeq | 0.8679 | 0.8460 | 0.7267 |
| Gfold | 0.9079 | 0.8886 | 0.7790 |
| GExposer | 0.9255 | 0.9030 | 0.8054 |
AUC values of six methods on the training data set with different number of replicates.
| Method | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| False positive test, no-call as NDE | |||||||
| edgeR | 0.8535 | 0.8607 | 0.8595 | 0.8591 | 0.8584 | 0.8574 | 0.8567 |
| DESeq | 0.6621 | 0.8626 | 0.8622 | 0.8618 | 0.8613 | 0.8610 | 0.8602 |
| Cuffdiff | 0.5963 | 0.7904 | 0.7772 | 0.7904 | 0.7773 | 0.7748 | 0.7740 |
| NOIseq | 0.8334 | 0.8392 | 0.8425 | 0.8445 | 0.8452 | 0.8456 | 0.8460 |
| Gfold | 0.8870 | 0.8334 | 0.8871 | 0.8875 | 0.8874 | 0.8886 | 0.8886 |
| GExposer | 0.8968 | 0.9016 | 0.9024 | 0.9024 | 0.9028 | 0.9032 | 0.9030 |
|
| |||||||
| False negative test, no-call as DE | |||||||
| edgeR | 0.7845 | 0.7903 | 0.793 | 0.7934 | 0.7936 | 0.7934 | 0.7945 |
| DESeq | 0.5800 | 0.7871 | 0.7895 | 0.7902 | 0.7912 | 0.7905 | 0.7909 |
| Cuffdiff | 0.5894 | 0.7674 | 0.7645 | 0.7674 | 0.7623 | 0.7606 | 0.7610 |
| NOIseq | 0.7269 | 0.7342 | 0.7335 | 0.7308 | 0.7288 | 0.7276 | 0.7267 |
| Gfold | 0.7905 | 0.6702 | 0.7498 | 0.7670 | 0.7753 | 0.7753 | 0.7790 |
| GExposer | 0.7942 | 0.8015 | 0.8045 | 0.8051 | 0.8053 | 0.8051 | 0.8054 |
Ranking of 13 genes by six different methods.
| Gene | edgeR | DESeq | cuffdiff | NOIseq | Gfold | GExposer |
|---|---|---|---|---|---|---|
| LAPTM4B | 109 | 109 | 99 |
|
|
|
| TSPAN12 | 59 | 54 | 18 | 55 | 98 |
|
| TNNI2 | 8750 | 8730 | 41066 | 671 | 241 | 15861 |
| H19 |
|
|
|
|
|
|
| ZNF185 | 651 | 672 | 56286 | 7879 | 81 | 1008 |
| MR1 | 144 | 141 | 22837 | 51 |
| 4156 |
| ASRGL1 | 125 | 269 | 8560 | 101 |
| 854 |
| C12orf59 |
|
| 70 |
|
|
|
| KLK6 | 680 | 646 | 1568 |
|
| 115 |
| ATOH8 | 66 | 70 | 956 | 224 |
| 99 |
| FUT3 | 57 | 62 | 1583 |
|
|
|
| KRT20 | 356 | 625 | 462 | 399 |
|
|
| OLR1 |
|
| 8741 |
|
|
|
The numbers in bold font correspond to genes that were ranked in top 50.
The distributions of top 10 differentially expressed genes ranked by six different methods on maize RNA-seq data.
| DE | No-call | NDE | |
|---|---|---|---|
| edgeR | 4 | 6 | 0 |
| DESeq | 3 | 7 | 0 |
| Cuffdiff | 4 | 6 | 0 |
| NOIseq | 4 | 6 | 0 |
| Gfold | 4 | 5 | 1 |
| GExposer | 5 | 5 | 0 |
Results of four maize genes with and without the local normalization.
| Method | GRMZM2 | GRMZM2 | GRMZM2 | GRMZM2 |
|---|---|---|---|---|
| log2(FC) without local normalization | −1.20 | 1.69 | −1.72 | 4.18 |
| log2(FC) with local normalization | −0.61 | 0.40 | −0.59 | 0.43 |
Figure 3(a) RT-PCR results of four genes in o2 and QPM lines. (b) Short read distribution in three exons of GRMZM2G002678 in W64 o2 and QPM, and the red dotted line indicates the adjusted depth by local normalization method.
Figure 4Fractions of discarded reads by local normalization method for both real and simulated RNA-seq data.
Performance of GExposer omitting each attribute.
| False positive test | False negative test | ||
|---|---|---|---|
| ΔGCC | 0.9028 | 0.8964 | 0.8017 |
| ΔARPK | 0.8791 | 0.8656 | 0.747 |
| ΔFC | 0.6315 | 0.5946 | 0.6141 |
| GExposer | 0.9255 | 0.903 | 0.8054 |
Figure 5The distributions of nucleotides with different depths.