| Literature DB >> 26176014 |
J Zyprych-Walczak1, A Szabelska1, L Handschuh2, K Górczak1, K Klamecka1, M Figlerowicz3, I Siatkowski1.
Abstract
High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.Entities:
Mesh:
Year: 2015 PMID: 26176014 PMCID: PMC4484837 DOI: 10.1155/2015/621690
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Summary of data set information.
| Data set | Number of samples | Number of genes | Number of genes (after filtering) | Min of average abundance of gene from all genes | Max of average abundance of gene from all genes | Number of HK (housekeeping genes) |
|---|---|---|---|---|---|---|
| Cheung | 41 | 52 580 | 12 410 | 0.024 | 90180 | 124 |
| Bodymap | 16 | 52 580 | 13 131 | 0 | 934100 | 131 |
| AML | 27 | 57 736 | 12 749 | 50.04 | 482400 | 127 |
Classifiers used in the calculations, function in R, and the name and the version of R package.
| Classification method | Function in R | R package | Version |
|---|---|---|---|
| Naive Bayes | NaiveBayesI | MLInterfaces | 1.40.0 |
| Neural network | nnetI | MLInterfaces | 1.40.0 |
|
| knnI | MLInterfaces | 1.40.0 |
| Support vector machines | svmI | MLInterfaces | 1.40.0 |
| Random forest | randomForestI | MLInterfaces | 1.40.0 |
Figure 1Bar plots of the DEGs with specified levels of count abundance in all studied data sets. On the x-axis the methods of normalization are featured, whereas the y-axis represents the number of DEGs determined after each normalization procedure. The bar colours represent the groups of genes of particular level of expression.
Rank of the bias values obtained using five normalization methods and RD for the normalization of the three tested data sets.
| Normalization method | Cheung data set rank (bias value) | Bodymap data set rank (bias value) | AML data set rank (bias value) |
|---|---|---|---|
| TMM | 3.5 (0.890) | 4 (1.123) | 6 (0.587) |
| UQ | 3.5 (0.890) | 5 (1.139) | 4 (0.564) |
| DES | 1 (0.885) | 2 (1.102) | 1 (0.512) |
| EBS | 2 (0.887) | 3 (1.111) | 3 (0.532) |
| PS | 5 (0.893) | 1 (1.098) | 2 (0.523) |
| RD | 6 (0.908) | 6 (1.159) | 5 (0.581) |
Rank of the variance values obtained using the five normalization methods and RD for the normalization of the three tested data sets.
| Normalization method | Cheung data set rank (variance value) | Bodymap data set rank (variance value) | AML data set rank (variance value) |
|---|---|---|---|
| TMM | 4 (0.779) | 4 (1.305) | 6 (0.364) |
| UQ | 3 (0.778) | 5 (1.330) | 4 (0.335) |
| DES | 1 (0.768) | 2 (1.270) | 1 (0.283) |
| EBS | 2 (0.771) | 3 (1.283) | 3 (0.300) |
| PS | 5 (0.782) | 1 (1.268) | 2 (0.291) |
| RD | 6 (0.812) | 6 (1.390) | 5 (0.355) |
Sensitivity and specificity of the studied five normalization methods and RD applied to the AML data set.
| Methods | TMM | UQ | DES | EBS | PS | RD |
|---|---|---|---|---|---|---|
| Sensitivity (%) | 11.4 | 20.5 | 18.2 | 45.5 | 40.9 | 18.2 |
| Rank of sensitivity | 6 | 3 | 4.5 | 1 | 2 | 4.5 |
| Specificity (%) | 97.7 | 84.1 | 97.7 | 52.3 | 65.9 | 97.7 |
| Rank of specificity | 2 | 4 | 2 | 6 | 5 | 2 |
Performances of the normalization methods with informative genes based on five classifiers and LOOCV applied to the Cheung, Bodymap, and AML data sets. The first number in each cell denotes the percentage of the average of five prediction errors from five different classification methods. The second number in each cell that is in brackets is the percentage of the median of the five prediction errors.
| TMM | UQ | DES | EBS | PS | RD | |
|---|---|---|---|---|---|---|
| Cheung | 11.1 | 10.7 | 4.5 | 4.4 | 3.9 | 11.6 |
| (9.8) | (7.3) | (0.4) | (0.0) | (0.0) | (12.2) | |
|
| ||||||
| Bodymap | 16.2 | 16.2 | 16.0 | 15.8 | 16.0 | 16.2 |
| (18.8) | (18.8) | (17.7) | (16.7) | (17.7) | (18.8) | |
|
| ||||||
| AML | 8.1 | 8.0 | 7.7 | 7.4 | 7.3 | 8.3 |
| (7.4) | (7.4) | (7.4) | (7.4) | (7.4) | (7.4) | |
Figure 2Diagnostic plots for the AML data set. Besides the five normalization methods, raw data (RD) were also included in the plots as a benchmark. (a) presents calculated normalization factor values across the samples by each method. The samples are ordered by the minimum values of normalization factors. (b) shows 95% confidence intervals for the mean of the percentages of classification errors calculated for each method based on five selected classifiers. (c) shows the numbers of common DE genes across each pair of normalization methods. The size and shading of the circles represent the average percentage value of common genes between each pair of methods. (d) presents the results of clustering of the normalization methods based on 20 common DE genes found by each of these methods. A dendrogram was created with hierarchical clustering based on Ward's method.
Summary of comparison results for the five normalization methods under consideration. The final rank is based on the bias and variance values, the prediction errors, sensitivity, specificity values, and the number of common DEGs for AML data.
| TMM | UQ | DES | EBS | PS | |
|---|---|---|---|---|---|
| Bias | 5.0 | 4.0 | 1.0 | 3.0 | 2.0 |
| Variance | 5.0 | 4.0 | 1.0 | 3.0 | 2.0 |
| Sensitivity | 5.0 | 3.0 | 4.0 | 1.0 | 2.0 |
| Specificity | 1.5 | 3.0 | 1.5 | 5.0 | 4.0 |
| Prediction errors | 5.0 | 4.0 | 3.0 | 2.0 | 1.0 |
| Common DEGs | 5.0 | 2.0 | 1.0 | 4.0 | 3.0 |