| Literature DB >> 32514134 |
Roshan A Karunamuni1, Minh-Phuong Huynh-Le2, Chun C Fan3, Rosalind A Eeles4,5, Douglas F Easton6, ZSofia Kote-Jarai4, Ali Amin Al Olama6,7, Sara Benlloch Garcia6, Kenneth Muir8,9, Henrik Gronberg10, Fredrik Wiklund10, Markus Aly10,11,12, Johanna Schleutker13,14, Csilla Sipeky13, Teuvo L J Tammela15,16, Børge G Nordestgaard17,18, Tim J Key19, Ruth C Travis19, David E Neal20,21,22, Jenny L Donovan23, Freddie C Hamdy24,25, Paul Pharoah26, Nora Pashayan26,27,28, Kay-Tee Khaw29, Stephen N Thibodeau30, Shannon K McDonnell31, Daniel J Schaid31, Christiane Maier32, Walther Vogel33, Manuel Luedeke32, Kathleen Herkommer34, Adam S Kibel35, Cezary Cybulski36, Dominika Wokolorczyk36, Wojciech Kluzniak36, Lisa Cannon-Albright37,38, Hermann Brenner39,40,41, Ben Schöttker39,42, Bernd Holleczek43,44, Jong Y Park45, Thomas A Sellers45, Hui-Yi Lin46, Chavdar Slavov47, Radka Kaneva48, Vanio Mitev48, Jyotsna Batra49,50, Judith A Clements51,52, Amanda Spurdle53, Manuel R Teixeira54,55, Paula Paulo54,56, Sofia Maia54,56, Hardev Pandha57, Agnieszka Michael57, Ian G Mills58,59, Ole A Andreassen60, Anders M Dale61,62,63, Tyler M Seibert64,65.
Abstract
We determined the effect of sample size on performance of polygenic hazard score (PHS) models in prostate cancer. Age and genotypes were obtained for 40,861 men from the PRACTICAL consortium. The dataset included 201,590 SNPs per subject, and was split into training and testing sets. Established-SNP models considered 65 SNPs that had been previously associated with prostate cancer. Discovery-SNP models used stepwise selection to identify new SNPs. The performance of each PHS model was calculated for random sizes of the training set. The performance of a representative Established-SNP model was estimated for random sizes of the testing set. Mean HR98/50 (hazard ratio of top 2% to average in test set) of the Established-SNP model increased from 1.73 [95% CI: 1.69-1.77] to 2.41 [2.40-2.43] when the number of training samples was increased from 1 thousand to 30 thousand. Corresponding HR98/50 of the Discovery-SNP model increased from 1.05 [0.93-1.18] to 2.19 [2.16-2.23]. HR98/50 of a representative Established-SNP model using testing set sample sizes of 0.6 thousand and 6 thousand observations were 1.78 [1.70-1.85] and 1.73 [1.71-1.76], respectively. We estimate that a study population of 20 thousand men is required to develop Discovery-SNP PHS models while 10 thousand men should be sufficient for Established-SNP models.Entities:
Mesh:
Year: 2020 PMID: 32514134 PMCID: PMC7608255 DOI: 10.1038/s41431-020-0664-2
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Performance metrics used in the evaluation of polygenic hazard scores.
| Performance metric | Description |
|---|---|
| HR98/50 | Hazard ratio of the top 2% to the average (30–70%) in the test set |
| HR20/50 | Hazard ratio of the bottom 20% to the average (30–70%) in the test set |
| HR98/20 | Hazard ratio of the top 2% to the bottom 20% in the test set |
| HR80/20 | Hazard ratio of the top 20% to the bottom 20% in the test set |
| beta | Coefficient of PHS in a Cox proportional hazards model using PHS as a sole predictor of age in the test set |
Fig. 1Comparison of performance metrics between Established (EST) and Discovery (DIS) SNP models using 50 random samples of the training set using a sample size of 30 thousand.
There is more variability with the Discovery process. Established SNPs, though, were discovered using the data in the training set; this circularity is not accounted for in the present study, which focuses on sample size effects.
Fig. 2Performance metrics of Established-SNP model.
Box plots of performance metrics are shown for random samples of the training set using sample sizes of 1, 5, 10, 15, 20, 25, and 30 thousand total observations. Within each box plot, the horizontal line represents the median and the box extends from the 25th to 75th percentile.
Fig. 3Performance metrics of the Discovery-SNP model.
Box plots of performance metrics are shown for random samples of the training set using sample sizes of 1, 5, 10, 15, 20, 25, and 30 thousand total observations. Within each box plot, the horizontal line represents the median and the box extends from the 25th to 75th percentile.
Fig. 4Performance as a function of testing sample size.
Box plots of performance metrics of the representative Established-SNP model in random samples of the testing set from 0.5 to 6 thousand total observations.