Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Empirically-derived synthetic populations to mitigate small sample sizes.

Literature DB >> 32173502

Empirically-derived synthetic populations to mitigate small sample sizes.

Erin E Fowler¹, Anders Berglund², Michael J Schell³, Thomas A Sellers, Steven Eschrich⁴, John Heine⁵.

Abstract

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.

Entities: CellLine Chemical Disease Gene Species

Keywords: Differential evolution; Distance to the model in X-space; Kernel density estimation; Overfitting; Principal component analysis; Synthetic data generation

Mesh：

Year: 2020 PMID： 32173502 PMCID： PMC7839232 DOI： 10.1016/j.jbi.2020.103408

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

18 in total

Review 1. Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography. Part 1. Tissue and related risk factors.

Authors: John J Heine; Poonam Malhotra
Journal: Acad Radiol Date: 2002-03 Impact factor: 3.173

2. What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.

Authors: Michael A Babyak
Journal: Psychosom Med Date: 2004 May-Jun Impact factor: 4.312

3. Breast Imaging Reporting and Data System (BI-RADS) breast composition descriptors: automated measurement development for full field digital mammography.

Authors: E E Fowler; T A Sellers; B Lu; J J Heine
Journal: Med Phys Date: 2013-11 Impact factor: 4.071

Review 4. Mammographic tissue, breast cancer risk, serial image analysis, and digital mammography. Part 2. Serial breast tissue change and related temporal influences.

Authors: John J Heine; Poonam Malhotra
Journal: Acad Radiol Date: 2002-03 Impact factor: 3.173

5. Full field digital mammography and breast density: comparison of calibrated and noncalibrated measurements.

Authors: John J Heine; Erin E E Fowler; Chris I Flowers
Journal: Acad Radiol Date: 2011-11 Impact factor: 3.173

6. A quantitative description of the percentage of breast density measurement using full-field digital mammography.

Authors: John J Heine; Ke Cao; Dana E Rollison; Gail Tiffenberg; Jerry A Thomas
Journal: Acad Radiol Date: 2011-05 Impact factor: 3.173

7. Data-driven approach for creating synthetic electronic medical records.

Authors: Anna L Buczak; Steven Babin; Linda Moniz
Journal: BMC Med Inform Decis Mak Date: 2010-10-14 Impact factor: 2.796

8. Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data.

Authors: Madhusmita Behera; Erin E Fowler; Taofeek K Owonikoko; Walker H Land; William Mayfield; Zhengjia Chen; Fadlo R Khuri; Suresh S Ramalingam; John J Heine
Journal: Biomed Eng Online Date: 2011-11-08 Impact factor: 2.819

9. Rare disease research: Breaking the privacy barrier.

Authors: Deborah Mascalzoni; Angelo Paradiso; Matts Hansson
Journal: Appl Transl Genom Date: 2014-04-18

10. SynSys: A Synthetic Data Generation System for Healthcare Applications.

Authors: Jessamyn Dahmen; Diane Cook
Journal: Sensors (Basel) Date: 2019-03-08 Impact factor: 3.576

1 in total

1. Comparison of Machine Learning Techniques for Mortality Prediction in a Prospective Cohort of Older Adults.

Authors: Salvatore Tedesco; Martina Andrulli; Markus Åkerlund Larsson; Daniel Kelly; Antti Alamäki; Suzanne Timmons; John Barton; Joan Condell; Brendan O'Flynn; Anna Nordström
Journal: Int J Environ Res Public Health Date: 2021-12-04 Impact factor: 3.390

1 in total