| Literature DB >> 29387152 |
Emma V A Sylvester1, Paul Bentzen2, Ian R Bradbury3, Marie Clément4,5, Jon Pearce6, John Horne2, Robert G Beiko1.
Abstract
Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest and guided regularized random forest) compared with FST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP data set for Atlantic salmon (Salmo salar) and a published SNP data set for Alaskan Chinook salmon (Oncorhynchus tshawytscha). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than FST-selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each data set, respectively, a level of accuracy never reached for these species using FST-selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic data sets to improve assignment for management and conservation of exploited populations.Entities:
Keywords: SNP selection; conservation genetics; fisheries management; individual assignment; random forest
Year: 2017 PMID: 29387152 PMCID: PMC5775496 DOI: 10.1111/eva.12524
Source DB: PubMed Journal: Evol Appl ISSN: 1752-4571 Impact factor: 5.183
Site locations and sample size for all study collections of juvenile salmon, sampled in 2013 and 2014
| River name | Sample size | Site ID | Latitude (N) | Longitude (W) |
|---|---|---|---|---|
| Cape Caribou River | 21 | CB | 53°32′48,8″ | 60°36′27,0″ |
| Caroline Brook | 20 | CL | 53°15,232′ | 60°31,899′ |
| Peters River | 21 | PR1 | 53°20′10,4″ | 60°47′15,3″ |
| PR2 | 53°20,345′ | 60°37,293′ | ||
| Red Wine River | 22 | RW1 | 53°52,764′ | 61°27,976′ |
| RW2 | 53°52,928′ | 61°28,730′ | ||
| Susan River | 22 | SR1 | 53°44,365′ | 61°3,275′ |
| SR2 | 53°44,184′ | 61°02,216′ | ||
| Crooked River | 21 | CR | 53°50,991′ | 60°48,863′ |
| Kenamu River | 22 | KE | 52°50,952′ | 60°08,279′ |
| Main Brook River | 21 | MB | 54°04,355′ | 57°52,374′ |
| Mulligan River | 17 | MU | 53°52,138′ | 60°05,392′ |
| Sebaskachu River | 22 | SK1 | 53°47,397′ | 60°08,523′ |
| SK2 | 53°46,10′ | 60°10,575′ | ||
| Traverspine River | 22 | TR | 53°08,853′ | 60°27,769′ |
Figure 1Sampling locations of (a) Atlantic salmon (Salmo salar) from Lake Melville, Labrador, Canada and (b) Chinook salmon (Oncorhynchus tshawytscha) from western Alaska and the Yukon River. See Table 1 for site coordinates, site ID and sample size for Atlantic salmon sampling. Coordinates for Chinook salmon sampling sites were obtained from Larson et al. (2014a). Maps were created using ArcGIS (ESRI, 2011)
Properties of panels selected for assignment analysis by SNP selection method (F ST rank, random forest (RF), regularized random forest (RRF) and guided regularized random forest (GRRF) (See Section “2.2”). As RF rank was selected to create panels of target size, panel size column indicates “(Rank) panel size” for RF‐selected panels. See Fig. S3 for intersections of SNPs across methods
| Method | Parameter for selection | Parameter value | Panel size Atlantic Salmon | Panel size Chinook Salmon |
|---|---|---|---|---|
|
| Top ranked | – | 60 | 47 |
| – | 85 | 65 | ||
| – | 104 | 88 | ||
| – | 130 | 112 | ||
| – | 184 | 134 | ||
| – | 266 | 182 | ||
| – | 344 | 240 | ||
| – | 508 | 384 | ||
| – | 519 | 454 | ||
| – | 670 | 509 | ||
| RF | Within (×) rank across all 5 runs | – | (800) 66 | (400) 41 |
| – | (825) 90 | (600) 74 | ||
| – | (850) 110 | (700) 91 | ||
| – | (875) 135 | (850) 125 | ||
| – | (900) 157 | (950) 167 | ||
| – | (950) 201 | (1,000) 216 | ||
| – | (1,050) 298 | (1,100) 277 | ||
| – | (1,200) 435 | (1,250) 341 | ||
| – | (1,400) 605 | (1,400) 437 | ||
| – | (1,500) 697 | (1,500) 519 | ||
| RRF | Penalty coefficient (λ) | 0.75 | 51 | 47 |
| 0.8 | 83 | 71 | ||
| 0.825 | 114 | 94 | ||
| 0.85 | 140 | 110 | ||
| 0.875 | 180 | 150 | ||
| 0.9 | 275 | 191 | ||
| 0.925 | 336 | 260 | ||
| 0.95 | 515 | 364 | ||
| 0.975 | 604 | 470 | ||
| 0.99 | 710 | 528 | ||
| GRRF | Weight of penalty (γ) | 0.25 | 60 | 47 |
| 0.2 | 85 | 65 | ||
| 0.175 | 104 | 88 | ||
| 0.15 | 130 | 112 | ||
| 0.125 | 184 | 134 | ||
| 0.1 | 266 | 182 | ||
| 0.075 | 344 | 240 | ||
| 0.05 | 508 | 384 | ||
| 0.025 | 519 | 454 | ||
| 0.01 | 670 | 509 |
Figure 2Average, overall self‐assignment accuracy of identified SNP panels (50–700 SNPs) for (a) Atlantic salmon and (b) Chinook salmon (Larson et al., 2014a) calculated across sampling sites. SNP selection method (F ST rank, RF, RRF and GRRF) is indicated by colour (see Section “2.2” for more information)
Figure 3Self‐assignment accuracy of identified SNP panels (50–700 SNPs) across all sampling sites as indicated by site ID (see Table 1) for (a) Atlantic salmon and (b) Chinook salmon (Larson et al., 2014a). SNP selection method (F ST rank, RF, RRF and GRRF) is indicated by colour (see Section “2.2” for more information)
Figure 4Assignment matrix heatmaps indicating per cent assignment calculated across the best performing panel of the smallest panels (Figure 3). Assignment as determined by (a) F ST for Atlantic salmon and (b) RF for Chinook salmon (Larson et al., 2014a). Colour intensity indicates the probability of an individual from a reference population (rows) being assigned to a given population (columns), where red indicates the highest probability and blue the lowest