| Literature DB >> 31246947 |
Stijn Vanderzande1, Nicholas P Howard2,3, Lichun Cai4, Cassia Da Silva Linge5, Laima Antanaviciute5, Marco C A M Bink6,7, Johannes W Kruisselbrink6, Nahla Bassil8, Ksenija Gasic5, Amy Iezzoni4, Eric Van de Weg9, Cameron Peace1.
Abstract
High-quality genotypic data is a requirement for many genetic analyses. For any crop, errors in genotype calls, phasing of markers, linkage maps, pedigree records, and unnoticed variation in ploidy levels can lead to spurious marker-locus-trait associations and incorrect origin assignment of alleles to individuals. High-throughput genotyping requires automated scoring, as manual inspection of thousands of scored loci is too time-consuming. However, automated SNP scoring can result in errors that should be corrected to ensure recorded genotypic data are accurate and thereby ensure confidence in downstream genetic analyses. To enable quick identification of errors in a large genotypic data set, we have developed a comprehensive workflow. This multiple-step workflow is based on inheritance principles and on removal of markers and individuals that do not follow these principles, as demonstrated here for apple, peach, and sweet cherry. Genotypic data was obtained on pedigreed germplasm using 6-9K SNP arrays for each crop and a subset of well-performing SNPs was created using ASSIsT. Use of correct (and corrected) pedigree records readily identified violations of simple inheritance principles in the genotypic data, streamlined with FlexQTL software. Retained SNPs were grouped into haploblocks to increase the information content of single alleles and reduce computational power needed in downstream genetic analyses. Haploblock borders were defined by recombination locations detected in ancestral generations of cultivars and selections. Another round of inheritance-checking was conducted, for haploblock alleles (i.e., haplotypes). High-quality genotypic data sets were created using this workflow for pedigreed collections representing the U.S. breeding germplasm of apple, peach, and sweet cherry evaluated within the RosBREED project. These data sets contain 3855, 4005, and 1617 SNPs spread over 932, 103, and 196 haploblocks in apple, peach, and sweet cherry, respectively. The highly curated phased SNP and haplotype data sets, as well as the raw iScan data, of germplasm in the apple, peach, and sweet cherry Crop Reference Sets is available through the Genome Database for Rosaceae.Entities:
Mesh:
Year: 2019 PMID: 31246947 PMCID: PMC6597046 DOI: 10.1371/journal.pone.0210928
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Errors observed during the curation process and their possible causes.
Causes that should be (mostly) already resolved by the stage a researcher would start checking for specific errors are in parentheses and italicized.
| Error | Cause | Solution |
|---|---|---|
| Low call rate and impossible cluster identification | Probe binding issues | Remove SNP from data set |
| Unexpected B-allele frequencies | ||
| Unexpected ploidy | Remove sample from data set | |
| Low sample quality | Remove sample from data set | |
| High number P(P)C errors | ||
| Incorrect pedigree | Adjust pedigree record | |
| Incorrect clustering | Manually determine genotype clusters | |
| Incorrect genotype call(s) not due to cluster issues | Adjust genotype call(s) or remove SNP from data set | |
| Low number P(P)C errors | ||
| Incorrect clustering | Manually determine genotype clusters | |
| Incorrect genotype call(s) not due to cluster issues | Adjust genotype call(s) | |
| High number double recombinations | ||
| Incorrect clustering | Manually determine genotype clusters | |
| Incorrect marker position in map | Adjust marker position or remove marker if it cannot be accurately mapped | |
| Incorrect genotype call(s) not due to cluster issues | Adjust genotype call(s) | |
| Incorrect phasing | Find responsible individual and make genotype missing | |
| Low number double recombinations | ||
| Resolve nearby double recombination | ||
| Incorrect marker position in map | Adjust marker position or remove marker if it cannot be accurately mapped | |
| Incorrect genotype call(s) not due to cluster issues | Adjust genotype call(s) | |
| Incorrect phasing | Wait for haploblock analysis to resolve issue | |
| Incorrect haplotype determination | ||
| Incorrect phasing | Manually correct phasing (determine correct haplotypes) | |
| Recombination within haplotype | Adjust haploblock borders |
*Nearby double recombination can occur for two adjacent markers with many double recombinations and markers with few double recombinations. However, nearby double recombinations rarely lead to a high number of double recombinations for a single marker
Fig 1Steps of the high-resolution genotypic data curation workflow to ensure a quick and efficient curation process.
Steps that identify errors are shown in white boxes; procedures needed for detecting, keeping track of, and resolving errors but do not identify errors directly are in grey boxes. After obtaining a first set of genotypic data, initial steps ensure that inheritance principles can be readily applied by removing individuals and markers that do not follow these principles and by ensuring pedigree records are correct. In the next set of steps, inheritance principles are applied at the individual marker level. In the final set of steps, these principles are applied at the haploblock level. Output used to detect and resolve observed errors at each step are given in italics. The leaf symbol indicates errors at the level of individual; the intensity plots symbol indicates errors at the level of SNP scoring; the genetic map symbol indicates errors at the level of genetically linked markers and phased alleles. When applying inheritance principles in parts 2 and 3, alleles that do not occur in an individual’s parents (‘Mendelian-inconsistent errors’) are first resolved before addressing remaining genotyping errors (‘Mendelian-consistent errors’). Several procedures, such as marker call adjustments and map order adjustments, are performed throughout the steps of the workflow to resolve errors detected. Each time after performing these common procedures, specific steps of the workflow must be repeated, forming an iterative process that ends when all errors are resolved.
Fig 2Histograms of B-allele frequency (left) and B-allele frequency for each SNP plotted against its genomic position (right). Such histograms were used to assess a sample’s genotyping quality and ploidy. Examples shown are of a sample with good quality genotype calls (panel A), with intermediate quality of genotype calls (B), with bad quality of genotype calls (C), and that is triploid (D).
Summary of SNP classification by ASSIsT for apple, peach, and sweet cherry.
SNP classifications are grouped in retained and discarded SNPs.
| SNP classification | Apple | Peach | Sweet Cherry |
|---|---|---|---|
| | |||
| Robust | 1434 | 743 | 373 |
| OneHomozygRare_HWE | 357 | 62 | 109 |
| OneHomozyRare_NotHWE | 366 | 188 | 161 |
| DistortedAndUnexSegreg | 1362 | 3696 | 555 |
| | |||
| ShiftedHomo | 914 | 1409 | 529 |
| | |||
| NullAllele-Failed | 52 | 145 | 43 |
| Monomorphic | 708 | 1057 | 3478 |
| Failed | 2565 | 844 | 448 |
| | 3969 | ||
Summary of SNPs retained and discarded for apple, peach and sweet cherry during the steps of the workflow.
| SNP curation step | SNPs discarded | SNPs retained |
|---|---|---|
| 1b. Set of reliable SNPs with ASSIsT | 4252 | 4536 |
| 2a. Mendelian-inconsistent error detection at SNP level | 319 | 4217 |
| 2b. Mendelian-consistent error detection at SNP level | 329 | 3888 |
| *Removed SNPs due to mapping issues | 15 | |
| *Removed SNPs due to genotyping issues | 314 | |
| 3. Error detection at haplotype level | 33 | 3855 |
| 1b. Set of reliable SNPs with ASSIsT | 2046 | 6098 |
| 2a. Mendelian-inconsistent error detection at SNP level | 231 | 5867 |
| 2b. Mendelian-consistent error detection at SNP level | 1862 | 4005 |
| *Removed SNPs due to mapping issues | 156 | |
| *Removed SNPs due to genotyping issues | 1706 | |
| 3. Error detection at haplotype level | - | 4005 |
| 1b. Set of reliable SNPs with ASSIsT | 3969 | 1727 |
| 2a. Mendelian-inconsistent error detection at SNP level | 47 | 1680 |
| 2b. Mendelian-consistent error detection at SNP level | 63 | 1617 |
| *Removed SNPs due to mapping issues | 63 | |
| *Removed SNPs due to genotyping issues | 0 | |
| 3. Error detection at haplotype level | - | 1617 |
Recommended software for each step of the genetic marker data curation workflow when using Illumina Infinium SNP arrays.
| Workflow step | Recommended software |
|---|---|
| Identify polyploids, aneuploids, and samples with low quality | GenomeStudio to obtain B-allele frequencies, |
| R to plot B-allele frequency for each sample | |
| Create subset of reliable SNPs | ASSIsT |
| Identify duplicate samples | PLINK |
| Identify incorrect P(P)C relationships | GenomeStudio |
| Identify unknown P(P)C relationships | R |
| Identify unknown grandparent-grandchild relationships | Excel |
| Identify and resolve (remaining) Mendelian-inconsistent errors | GenomeStudio, FlexQTL |
| Identify and resolve Mendelian-consistent errors | Visual FlexQTL + GenomeStudio |
| Identify and correct map order inconsistencies | Visual FlexQTL |
| Identify phasing issues | FlexQTL + Visual FlexQTL |
| Haploblock border determination | Visual FlexQTL |
| Haplotype determination | |
| - Phasing | FlexQTL |
| - Haplotype assignment | PediHaplotyper |
| - Curation (automated) | FlexQTL |
* Template in Suppl. File 1 of Van de Weg and co-workers (2018) [23]