| Literature DB >> 28155690 |
Cong Liu1,2, Jianping Jiang2,3, Jianlei Gu2,3,4, Zhangsheng Yu2,3, Tao Wang5,6, Hui Lu7,8,9,10.
Abstract
BACKGROUND: High-throughput technology could generate thousands to millions biomarker measurements in one experiment. However, results from high throughput analysis are often barely reproducible due to small sample size. Different statistical methods have been proposed to tackle this "small n and large p" scenario, for example different datasets could be pooled or integrated together to provide an effective way to improve reproducibility. However, the raw data is either unavailable or hard to integrate due to different experimental conditions, thus there is an emerging need to develop a method for "knowledge integration" in high-throughput data analysis.Entities:
Keywords: Dimension reduction; Knowledge integration; SKI; Sure independence screening; Variable selection
Mesh:
Year: 2016 PMID: 28155690 PMCID: PMC5260139 DOI: 10.1186/s12918-016-0358-0
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Fig. 1A brief description of (i)SKI procedure. For each variable, two ranks are generated, one based on prior knowledge (R 0), the other based on marginal correlation (R 1). A predefined α, (or estimated based on the dev. ratio) is used to control the weight of prior knowledge. Variables are then sorted by weighted geometric mean of two ranks. SKI first reduces the variable number from p to d, and then a more sophisticated method such as SCAD is used to further refine the model to size d ’ and estimate the parameters. iSKI updates the marginal correlation based rank (R 1) by regressing residues over the rest p − d ’ variables. The procedure is repeated until the desired number of parameters obtained
Simulation results compared the number of true positives among different methods
| Positivea | 1% | 5% | 10% | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| %b |
|
|
| SISf | SKIg | Ph | SIS | SKI | P | SIS | SKI | P |
| 0.0 | 1 | 1 | 0.075 | 38.96 | 38.94 | 36.36 | 45.78 | 45.72 | 43.63 | 47.66 | 47.63 | 45.63 |
| 0.5 | 1 | 1 | 0.275 | 38.53 | 43.06 | 45.22 | 45.66 | 47.65 | 48.54 | 47.53 | 48.85 | 49.13 |
| 1.0 | 1 | 1 | 0.384 | 38.5 | 46.34 | 47.99 | 45.65 | 48.9 | 49.58 | 47.49 | 49.51 | 49.83 |
| 0.0 | 1 | 3 | 0.090 | 39.10 | 38.97 | 35.01 | 45.81 | 45.80 | 42.94 | 47.71 | 47.72 | 44.03 |
| 0.5 | 1 | 3 | 0.249 | 38.92 | 42.55 | 43.85 | 45.80 | 47.31 | 48.28 | 47.57 | 48.55 | 49.10 |
| 1.0 | 1 | 3 | 0.368 | 39.04 | 45.81 | 47.58 | 45.88 | 48.60 | 49.44 | 47.65 | 49.21 | 49.73 |
| 0.0 | 3 | 1 | 0.113 | 36.84 | 36.43 | 35.77 | 44.61 | 44.01 | 43.37 | 46.69 | 46.57 | 46.19 |
| 0.5 | 3 | 1 | 0.261 | 37.27 | 42.16 | 44.90 | 45.15 | 47.36 | 48.34 | 47.07 | 48.56 | 49.03 |
| 1.0 | 3 | 1 | 0.374 | 36.91 | 46.01 | 48.89 | 44.76 | 49.42 | 49.51 | 47.12 | 49.86 | 49.90 |
| 0.0 | 3 | 3 | 0.104 | 37.84 | 37.48 | 35.19 | 45.73 | 45.43 | 44.07 | 47.63 | 47.53 | 45.93 |
| 0.5 | 3 | 3 | 0.264 | 37.26 | 42.52 | 44.48 | 45.03 | 47.35 | 48.26 | 47.19 | 48.58 | 49.00 |
| 1.0 | 3 | 3 | 0.355 | 37.05 | 45.20 | 47.37 | 45.1 | 48.6 | 49.39 | 47.05 | 49.36 | 49.76 |
aTop 1, 5 and 10% variables were selected respectively under different settings
bthe percentage of non-zero β’s overlapped with each other in two datasets
c σ 2 : the variance added in internal dataset to generate response Y
d σ 2: the variance added in external dataset to generate response Y
e α: the estimated value of α which control the weight of two ranks in geometric mean
fSIS: variables were sorted by marginal correlation using only internal dataset
gSKI: variables were sorted by weighted geometric mean of two marginal correlation based ranks using two dataset
hPool: two dataset were pooled together and treated as a single dataset, and then variables were sorted by marginal correlation
Simulation results compared the number of true positives among iterative and non-iterative approaches when top 1% variables were selected
| %a |
|
| SISd | SKIe | iSISf | iSKIg |
|---|---|---|---|---|---|---|
| 0 | 0.3 | 0.061 | 23.32 | 23.12 | 25.22 | 22.53 |
| 0.5 | 0.3 | 0.342 | 24.83 | 33.20 | 26.13 | 34.43 |
| 1 | 0.3 | 0.443 | 23.14 | 34.41 | 26.33 | 38.85 |
| 0 | 0.6 | 0.044 | 37.35 | 36.34 | 41.11 | 36.17 |
| 0.5 | 0.6 | 0.392 | 36.47 | 41.67 | 39.67 | 44.83 |
| 1 | 0.6 | 0.453 | 37.12 | 45.83 | 40.44 | 49.40 |
a%: the percentage of non-zero β’s overlapped with each other in two datasets
b ρ: correlation coefficients between two neighbor variables in each cluster
c α: the estimated value of α which control the weight of two ranks in geometric mean
dSIS: variables were sorted by marginal correlation using only internal dataset
eiSIS: iterative version of SIS
fSKI: variables were sorted by weighted geometric mean of two marginal correlation based ranks using two dataset
giSKI: iterative version of SKI
Fig. 2Boxplot of squared error for selumtinib response prediction using two methods. Whiskers indicate min/max, upper box limit 75% percentile, low box limit 25% percentile and line the median
18 variables selected by SKI procedure when top 100 variables were selected, whose association with selumetinib could be found in database
| Gene Symbol | Probe ID | Type |
|
|
|---|---|---|---|---|
| BRAF | NA | Mut | 4 | 1 |
| ADCK3 | 56997_at | Exp | 172 | 5 |
| TESK1 | 7016_at | Exp | 194 | 6 |
| DCLK2 | 166614_at | Exp | 196 | 8 |
| TNIK | 23043_at | Exp | 206 | 9 |
| NUAK2 | 81788_at | Exp | 209 | 10 |
| ERBB3 | 2065_at | Exp | 328 | 14 |
| PRKCD | 5580_at | Exp | 338 | 15 |
| MYLK | 4638_at | Exp | 479 | 20 |
| MAP3K1 | 4214_at | Exp | 502 | 21 |
| ULK3 | 25989_at | Exp | 519 | 23 |
| FGFR1 | 2260_at | Exp | 556 | 25 |
| SNRK | 54861_at | Exp | 582 | 26 |
| RPS6KA3 | 6197_at | Exp | 623 | 29 |
| STK10 | 6793_at | Exp | 691 | 31 |
| MAPK9 | 5601_at | Exp | 756 | 34 |
| TAOK3 | 51347_at | Exp | 761 | 35 |
| PIK3CB | 5291_at | Exp | 764 | 36 |
a R : rank by marginal correlation
b R : rank by prior knowledge integrated