| Literature DB >> 35585485 |
Petr Smirnov1,2, Ian Smith1,2, Zhaleh Safikhani2, Wail Ba-Alawi2, Farnoosh Khodakarami2, Eva Lin3, Yihong Yu3, Scott Martin3, Janosch Ortmann4, Tero Aittokallio5,6,7,8, Marc Hafner9, Benjamin Haibe-Kains10,11,12.
Abstract
BACKGROUND: Identifying associations among biological variables is a major challenge in modern quantitative biological research, particularly given the systemic and statistical noise endemic to biological systems. Drug sensitivity data has proven to be a particularly challenging field for identifying associations to inform patient treatment.Entities:
Keywords: Association testing; Biomarker; Drug sensitivity; Non-parametric statistics; Pharmacogenomics; Power analysis; Statistics
Mesh:
Year: 2022 PMID: 35585485 PMCID: PMC9118710 DOI: 10.1186/s12859-022-04693-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1The asymptotic approximation of the CI null distribution produces an excess of small p-values. We took independent samples from a normal and beta distribution, computed their similarity using the coefficients above, and calculated asymptotic p-values using the approximations from the text. Because the samples are independent, their p-value distribution should be uniform. The Q-Q plots for normal (a) and beta (c) distributions for samples of length N = 100 sampled 200,000 times shows an excess of small p-values for CI and rCI. In the case of the normal distribution, p-values of occur over twenty times more often than would be expected, and for the beta distribution nearly one hundred times more often for rCI. (b, Normal) and (d, beta) summarize the frequency of for different sample sizes. As the number of samples grows large, the asymptotic approximation becomes more correct, but even in the regime of hundreds of samples, extreme p-values occur several times more often than they should under the null
Fig. 2The analytical null accurately computes exact p values for CI. a The analytical distribution matches a permutation null of K = 1e6 samples of length 100 from a standard normal distribution. As CI is entirely non-parametric, the choice of distribution is irrelevant. b The Q-Q plot shows the -10 empirical rank of the CI on the x-axis and the -10 theoretical quantile from the analytical null (red) and asymptotic null (blue). The analytical p-values are both monotonic and correctly approximate the uniform distribution (grey)
Fig. 3Power analysis for data simulated using the bivariate Gaussian family. a displays the effect of the parameter on the empirical power at a fixed effect size of population r=0.3. Other statistics unaffected by the parameter are plotted for comparison. b displays the empirically observed power for the rCI statistic only, plotting the dependence on delta at 3 different effect sizes. The power is normalized as percent of maximum power achieved for each effect size to highlight the optimal region for choosing delta. c empirical power for as the population expected Pearson correlation increases. d empirical power for a varying sample size, as the effect size is modified to keep a theoretically constant power for the Pearson correlation of 0.5. Power is plotted as the percent of achieved Pearson correlation power in simulation
Fig. 4Drug recall analysis across pharmacogenomic datasets. For all pairs of datasets, the similarity between the vector of cell line responses for all pairs of drugs is computed with each coefficient for (a) all drugs and (b) those drugs with at least fifty cell lines in common across datasets. For drugs present in both datasets, the rank of the matched drug relative to all drugs is extracted. The x-axis is the rank of the matched drug, where 0 is most similar and 1 is least similar. The y-axis is the empirical CDF of the matched drugs for a given rank, or the fraction of matched drugs with rank less than x
A table of the areas under the empirical CDF curves for drug recall across datasets
| Coefficient | Area under CDF | Area under CDF, N > 50 cell lines |
|---|---|---|
| Pearson | 0.8899 | 0.9382 |
| Spearman | 0.8704 | 0.9230 |
| CI | 0.8698 | 0.9238 |
| rCI | 0.8639 | 0.9222 |
| kCI | 0.8808 | 0.9341 |
Because the drug recall computes the rank of the matched drug, this is equivalent to 1 - the mean rank of matched drugs across studies
| True Value | |||
|---|---|---|---|
| Positive | Negative | ||
| rCI decision boundary | Positive | ||
| Negative | |||