| Literature DB >> 25285214 |
Nicole Hartmann1, Evert Luesink1, Edward Khokhlovich1, Joseph D Szustakowski1, Lukas Baeriswyl1, Joshua Peterson1, Andreas Scherer2, Nirmala R Nanguneri1, Frank Staedtler1.
Abstract
BACKGROUND: Exact sample annotation in expression microarray datasets is essential for any type of pharmacogenomics research.Entities:
Keywords: Biomarkers; Dip test; HLA-DQA1; HLA-DRB4; Microarray quality control (QC); mRNA
Year: 2014 PMID: 25285214 PMCID: PMC4184161 DOI: 10.1186/2050-7771-2-17
Source DB: PubMed Journal: Biomark Res ISSN: 2050-7771
Result of Hartigans’ dip test
| 8653 | Yq11 | 0.147 | <1.00E-06 | ||
| 3117 | 6p21.3 | 0.147 | <1.00E-06 | ||
| 228492_at | USP9Y | 8287 | Yq11.2 | 0.143 | <1.00E-06 |
| 232618_at | TXLNG2P | 246126 | Yq11.222 | 0.140 | <1.00E-06 |
| 6192 | Yp11.3 | 0.139 | <1.00E-06 | ||
| 3126 | 6p21.3 | 0.137 | <1.00E-06 | ||
| 223646_s_at | TXLNG2P | 246126 | Yq11.222 | 0.133 | <1.00E-06 |
| 7503 | Xq13.2 | 0.131 | <1.00E-06 | ||
| 8284 | Yq11 | 0.130 | <1.00E-06 | ||
| 9086 | Yq11.223 | 0.124 | <1.00E-06 | ||
| 8653 | Yq11 | 0.123 | <1.00E-06 | ||
| 7503 | Xq13.2 | 0.121 | <1.00E-06 | ||
| 7503 | Xq13.2 | 0.120 | <1.00E-06 | ||
| 7503 | Xq13.2 | 0.118 | <1.00E-06 | ||
| 231592_at | TSIX | 9383 | Xq13.2 | 0.117 | <1.00E-06 |
| 7503 | Xq13.2 | 0.116 | <1.00E-06 | ||
| 211149_at | UTY | 7404 | Yq11 | 0.113 | <1.00E-06 |
| 226736_at | CHURC1 | 91612 | 14q23.3 | 0.108 | <1.00E-06 |
| 235446_at | --- | --- | --- | 0.107 | <1.00E-06 |
| 1560263_at | --- | --- | --- | 0.103 | 2.00E-06 |
| 223645_s_at | TXLNG2P | 246126 | Yq11.222 | 0.099 | 6.00E-06 |
| 208067_x_at | UTY | 7404 | Yq11 | 0.096 | 1.00E-05 |
| 230760_at | ZFY | 7544 | Yp11.3 | 0.093 | 1.60E-05 |
| 9086 | Yq11.223 | 0.093 | 1.80E-05 | ||
| 7503 | Xq13.2 | 0.092 | 2.90E-05 | ||
| 205048_s_at | PSPH | 5723 | 7p11.2 | 0.090 | 4.00E-05 |
| 214131_at | TXLNG2P | 246126 | Yq11.222 | 0.089 | 4.70E-05 |
| 207805_s_at | PSMD9 | 5715 | 12q24.31 | 0.089 | 5.60E-05 |
| 238900_at | HLA-DRB1 | 3123 | 6p21.3 | 0.088 | 6.70E-05 |
| 1559003_a_at | CCDC163P | 126661 | 1p34.1 | 0.088 | 7.10E-05 |
| 208909_at | UQCRFS1 | 7386 | 19q12 | 0.086 | 0.000112 |
| 215333_x_at | GSTM1 | 2944 | 1p13.3 | 0.085 | 0.000158 |
| 208919_s_at | NADK | 65220 | 1p36.33 | 0.085 | 0.000163 |
| 241808_at | ZC2HC1A | 51101 | 8q21.12 | 0.082 | 0.000287 |
| 225318_at | --- | --- | --- | 0.081 | 0.000345 |
| 212262_at | QKI | 9444 | 6q26 | 0.081 | 0.000379 |
| 225236_at | RBM18 | 92400 | 9q33.2 | 0.081 | 0.000434 |
| 206279_at | PRKY | 5616 | Yp11.2 | 0.080 | 0.000554 |
| 1554094_at | ENTPD5 | 957 | 14q24 | 0.080 | 0.000554 |
| 203280_at | SAFB2 | 9667 | 19p13.3 | 0.080 | 0.000574 |
| 226990_at | CAPRIN1 | 4076 | 11p13 | 0.079 | 0.000628 |
| 203056_s_at | PRDM2 | 7799 | 1p36.21 | 0.079 | 0.000726 |
| 241033_at | --- | --- | --- | 0.078 | 0.000844 |
| 205173_x_at | CD58 | 965 | 1p13 | 0.078 | 0.000872 |
| 235104_at | ERAP2 | 64167 | 5q15 | 0.078 | 0.000882 |
The table shows probe sets with an empirical p-value < 0.001, sorted by descending dip test statistics. About 47% of the genes in the list are located on heterosomes, among those all genes of the gender marker “REDKX” ([6], italicized for visualization purposes). Probe sets for HLA-genes, which were further considered during the marker validation process, are highlighted in bold.
Allele-specificity of HLA-DQA1 probe sets
| Allele (predicted) | | | | | | | | | | | | | | | | | | | | | | |
| HLA-DQA1*0101.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||
| HLA-DQA1*0102 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||
| HLA-DQA1*0103 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||||||||
| HLA-DQA1*0201 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| HLA-DQA1*0301.1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| HLA-DQA1*0401 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||||
| HLA-DQA1*0501 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||||||||
The nucleotide sequence of 203290_at perfectly matches the *0401 allele, and contains at least one mismatched probe for other alleles while 213831_at perfectly matches the *0103 allele, and contains at least one mismatched probe for other alleles. Perfect matches are indicated as an italicized bold 1. Allele sequences from [12].
Figure 1Quantile-quantile plot of Hartigans’ dip test statistics. The line of identity (in red) indicates unimodal distribution of data. Simulated data are distributed along this line, while some of the probe sets from the dataset GSE7753 deviate from unimodal distribution. Three candidate marker probe sets, 203290_at (HLA-DQA1*0401), 213831_at (HLA-DQA1*0103), and 209728_at (HLA-DRB4) are pointed out.
Figure 2Bimodal intensity distribution of three candidate marker probe sets in the dataset GSE7753. The training set consisted of 47 samples. Horizontal lines on the Y-axes indicate the intensity thresholds which were empirically determined for each probe set separately. The box plots represent (from bottom to top) the 10th, 25th, 50th, 75th and 90th percentile of the distribution. Number of bins = 50.
Application of the score to public datasets
| 55 | Time point 1 | GSM155503.CEL | 28 | 4182 | 48 | 010 | F | | | |
| | Time point 2 | GSM155504.CEL | 23 | 4606 | 72 | 010 | F | | | |
| | Time point 3 | GSM155505.CEL | 10 | 2412 | 24 | 010 | F | | | |
| | Time point 4 | GSM155506.CEL | 28 | 4765 | 21 | 010 | F | | | |
| 54 | Time point 1 | GSM155499.CEL | 33 | 516 | 985 | 011 | F | | | |
| | Time point 2 | GSM155500.CEL | 49 | 681 | 1245 | 011 | F | | | |
| | Time point 3 | GSM155501.CEL | 28 | 1073 | 3142 | 011 | F | | | |
| | Time point 4 | GSM155502.CEL | 26 | 914 | 2573 | 011 | F | | | |
| 45 | Time point 1 | GSM155495.CEL | 45 | 8041 | 187 | F | HLA-score | Intensity of 209728_at HLA-DRB4 slightly above threshold | Not critical | |
| | Time point 2 | GSM155496.CEL | 59 | 7619 | 123 | 010 | F | | ||
| | Time point 3 | GSM155497.CEL | 32 | 6385 | 105 | 010 | F | | ||
| | Time point 4 | GSM155498.CEL | 59 | 7062 | 40 | 010 | F | | ||
| 35 | Time point 1 | GSM155475.CEL | 1218 | 42 | 6444 | HLA-score and REDKX QC-score | Gender different for patient samples | Possible sample mix-up; follow up | ||
| | Time point 2 | GSM155476.CEL | 436 | 1491 | 231 | 111 | M | | | |
| | Time point 3 | GSM155477.CEL | 508 | 1751 | 200 | 111 | M | | | |
| | Time point 4 | GSM155478.CEL | 126 | 113 | 5 | M | HLA-score | Intensity of 209728_at HLA-DRB4 and 213831_at HLA-DQA1 10× to 46× smaller than those from other patient samples | Possible sample mix-up; follow up | |
| 32 | Time point 1 | GSM155471.CEL | 1235 | 32 | 8231 | 101 | F | HLA-score and REDKX QC-score | | |
| | Time point 2 | GSM155472.CEL | 420 | 1878 | 291 | Gender different for patient samples | Possible sample mix-up; follow up | |||
| | Time point 3 | GSM155473.CEL | 1807 | 49 | 7332 | 101 | F | |||
| Time point 4 | GSM155474.CEL | 1128 | 42 | 7122 | 101 | F |
Each sample of the dataset is labelled with a three-digit score (one “1” or “0” flag per probe set). The presence of intra-score differences elicits further follow-up investigation as to the nature and source of the difference. Intensity thresholds mark an “on” or “off”-status of the transcript expression, flagged as “1” or “0”, respectively. The thresholds were empirically determined for each marker separately. In some instances, the score difference is due to mild threshold violation, in other instances it may be due to sample mix-ups. The score picks up those samples which are being flagged by the REDKX gender QC, and detects further samples with issues (labeled in bold). REDKX panel expression values are provided in Additional file 1: Table S1.
Summary of the score tests
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| | | | | | | | | ||
| | GSE6751 | PBMC | 59 | 15 | 5 | 3 | 5 | 4 | Follow-up |
| | GSE6281 | Skin | 33 | 11 | 0 | 0 | 0 | 0 | NA |
| | GSE20489 | Whole blood | 54 | 11 | 1 | 0 | 1 | 1 | Follow-up |
| | GSE24206 | Lung | 12 | 6 | 0 | 0 | 0 | 0 | NA |
| | GSE32473 | Skin | 30 | 10 | 0 | 0 | 0 | 0 | NA |
| Summary | 5 datasets | 4 tissues | 188 | 53 | 6 | 3 | 6 | 5 | |
The candidate markers were applied to datasets from the public domain (http://www.ncbi.nlm.nih.gov/geo). About 3% of samples have annotation issues which could not be resolved by visual inspection of the intensity data alone. The individual REDKX marker intensity values are not shown. Individual results are shown in Additional file 1: Table S1.