| Literature DB >> 29187149 |
Qinxue Meng1, Daniel Catchpoole2, David Skillicorn3, Paul J Kennedy4.
Abstract
BACKGROUND: Data from patients with rare diseases is often produced using different platforms and probe sets because patients are widely distributed in space and time. Aggregating such data requires a method of normalization that makes patient records comparable.Entities:
Keywords: Distribution; Gene expression data; Normalization; R
Mesh:
Year: 2017 PMID: 29187149 PMCID: PMC5706403 DOI: 10.1186/s12859-017-1912-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An example application of DBNorm. Probesets from data source 1 (top left) and probesets from data source 2 (bottom left) are visualised with PCA and show platform specific variation. DBNorm takes histograms of probe sets from each data source (second column) and fits a curve to the shape (column 3). The formula (middle column 3) normalises data source 1 to take the shape of data source 2. PCA visualisation of the combined dataset (bottom right) shows reduction in platform specific variation
Westmead Acute Lymphoblastic Leukaemia (ALL) dataset
| Platform | Patients | Probesets | Relapse (Y/N) |
|---|---|---|---|
| Affymetrix_U133A | 18 | 22,283 | 6/12 |
| Affymetrix_U133A2 | 44 | 22,277 | 13/31 |
| Affymetrix_U133Plus2 | 44 | 54,675 | 6/38 |
| Affymetrix_HG1ST | 40 | 33,297 | 8/32 |
Public domain datasets
| Public domain datasets | Platform | Samples | Microarray |
|---|---|---|---|
| Dilution/mixture | HG-U95A cRNA data source A | 75 | 201,800 |
| HG-U95A cRNA data source B | 75 | 201,800 | |
| Spike-in | HGU133 | 42 | 248,152 |
| HGU95 | 59 | 201,807 | |
| Public ALL | HG-U133A | 20 | 22,283 |
| HG-U133B | 20 | 22,283 |
Statistics of the ALL microarray data
| Platform | Probesets | min | max | mean | std |
|---|---|---|---|---|---|
| Affymetrix_U133A | 11,288 | 3.335 | 14.504 | 6.457 | 1.606 |
| Affymetrix_U133A2 | 11,288 | 2.961 | 14.946 | 6.577 | 2.075 |
| Affymetrix_U133Plus2 | 11,288 | 2.451 | 15.019 | 6.515 | 2.294 |
| Affymetrix_HG1ST | 11,288 | 1.749 | 13.874 | 6.274 | 1.966 |
Fig. 2Plots of distributions of all genes across all patients from platform U133A, U133A2, U133Plus2 and HG1ST before and after normalization by the proposed distribution-based normalization method
Results of Kullback–Leibler Divergence
| Distribution Comparsion | Before normalization | After normalization |
|---|---|---|
| U133A vs. U133Plus2 | 0.1475 | 0.0001 |
| U133A2 vs. U133Plus2 | 0.1177 | 0.0001 |
| HG1ST vs. U133Plus2 | 0.2300 | 0.0007 |
Statistical evaluation of normalization methods
| Method | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|---|
| z-score | −1.741 | −0.739 | −0.112 | 0.000 | 0.553 | 4.688 |
| AvgDiff | −1.827 | −0.742 | −0.097 | 0.000 | 0.591 | 4.689 |
| Quantile | −1.628 | −0.846 | −0.082 | 0.000 | 0.673 | 3.910 |
| ComBat | −2.536 | −0.811 | −0.064 | 0.000 | 0.681 | 4.910 |
| DBNorm | −1.636 | −0.854 | −0.073 | 0.000 | 0.690 | 3.748 |
Fig. 3Visualizing gene expression data by principal component analysis (PCA). In the figure, green dots are patient data from HG1ST; blue dots are from U133A; red dots are from U133A2; and black dots are from U133Plus2
Evaluating normalization by SVM
| Method | Averaged LOOCV Training | Averaged LOOCV Test | ||||
|---|---|---|---|---|---|---|
| Accuracy | F-measure | ROC AUC | Accuracy | F-measure | ROC AUC | |
| Unnormalized | 0.57 | 0.31 | 0.63 | 0.19 | 0.08 | 0.22 |
| z-score | 0.79 | 0.49 | 0.81 | 0.23 | 0.11 | 0.26 |
| AvgDiff | 0.87 | 0.58 | 0.90 | 0.41 | 0.29 | 0.45 |
| Quantile | 0.91 | 0.60 | 0.93 | 0.79 | 0.52 | 0.82 |
| ComBat | 0.94 | 0.67 | 0.95 | 0.81 | 0.61 | 0.85 |
| DBNorm | 0.97 | 0.72 | 0.98 | 0.84 | 0.73 | 0.87 |
p-value of comparing DBNorm with the other normalization methods in terms of ROC AUC on test dataset
| Unnormalized | z-score | AvgDiff | Quantile | ComBat | |
|---|---|---|---|---|---|
| DBNorm | 2.2 × 10−16 | 6.7 × 10−16 | 4.5 × 10−13 | 2.1 × 10−14 | 2.9 × 10−13 |
Fig. 4Visualization and comparison of normalization performance on Dilution/mixture dataset
Fig. 5Visualization and comparison of normalization performance on Spike-in dataset
Fig. 6Visualization and comparison of normalization performance on the public ALL dataset