| Literature DB >> 35072059 |
Susanne Gerber1, Lukas Pospisil2, Stanislav Sys1, Charlotte Hewel1, Ali Torkamani3, Illia Horenko2.
Abstract
Mislabeling of cases as well as controls in case-control studies is a frequent source of strong bias in prognostic and diagnostic tests and algorithms. Common data processing methods available to the researchers in the biomedical community do not allow for consistent and robust treatment of labeled data in the situations where both, the case and the control groups, contain a non-negligible proportion of mislabeled data instances. This is an especially prominent issue in studies regarding late-onset conditions, where individuals who may convert to cases may populate the control group, and for screening studies that often have high false-positive/-negative rates. To address this problem, we propose a method for a simultaneous robust inference of Lasso reduced discriminative models and of latent group-specific mislabeling risks, not requiring any exactly labeled data. We apply it to a standard breast cancer imaging dataset and infer the mislabeling probabilities (being rates of false-negative and false-positive core-needle biopsies) together with a small set of simple diagnostic rules, outperforming the state-of-the-art BI-RADS diagnostics on these data. The inferred mislabeling rates for breast cancer biopsies agree with the published purely empirical studies. Applying the method to human genomic data from a healthy-ageing cohort reveals a previously unreported compact combination of single-nucleotide polymorphisms that are strongly associated with a healthy-ageing phenotype for Caucasians. It determines that 7.5% of Caucasians in the 1000 Genomes dataset (selected as a control group) carry a pattern characteristic of healthy ageing.Entities:
Keywords: bias; bioinformatics; label noise; latent variable estimation; machine learning; mislabeling; regression
Year: 2022 PMID: 35072059 PMCID: PMC8766632 DOI: 10.3389/frai.2021.739432
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
FIGURE 1Application of (1–4) to the analysis of the standard BI-RADS dataset from http://archive.ics.uci.edu/ml/datasets/Mammographic+Mass: (A) Model selection by cross-validation (bootstrap-averaged values of the functional L from (1) with optimal parameters from the training sets being evaluated on the validation datasets); (B) optimal parameter vector α*; (C) probability of a malignant diagnosis as a function of BI-RADS features for the two groups of patients; (D) average impact of single BI-RADS features (sensitivity of the risk to the 7 binary features of importance).
FIGURE 2(A) Significant correlation between the true synthetically induced mislabeling rate of the mammography data and the error rate predicted by the co-inference method. (B) Performance of different model types on the mammography dataset with various mislabeling rates. All models, except the original one, were trained using the respective mislabeled dataset. The average prediction accuracy was calculated based on the original mammography dataset. Co-inference outperforms a linear SVC and performs nearly on par with a state-of-the-art SVC using an RBF kernel.
FIGURE 3Application of (1–4) to the analysis of filtered SNP data from (Erikson et al., 2016): (A) posterior probability distribution of the inferred optimal mislabelings from the “close-to-Wellderly European” cohort from 1000 Genomes (basis for the control group); (B) posterior probability distribution of the inferred optimal mislabelings from the “Wellderly Caucasian” cohort (Erikson et al., 2016) (basis for the case group); (C) estimated optimal weights α of the SNP patterns together with their 95% confidence intervals; (D) individual probabilities of Wellderly being due to genomic factors (together with their 95% confidence intervals), as inferred from the optimal feature weights α from panel (C). Posterior distributions in (A,B) and confidence intervals in (C,D) are obtained by means of the non-parametric bootstrap sampling (Efron and Tibshirani, 1993) with 100 ensemble realizations.