| Literature DB >> 35018888 |
Lucie A Bergeron1, Søren Besenbacher2, Tychele Turner3, Cyril J Versoza4, Richard J Wang5, Alivia Lee Price1, Ellie Armstrong6, Meritxell Riera7, Jedidiah Carlson8, Hwei-Yen Chen1, Matthew W Hahn5, Kelley Harris8, April Snøfrid Kleppe2, Elora H López-Nandam9, Priya Moorjani10, Susanne P Pfeifer11, George P Tiley12, Anne D Yoder12, Guojie Zhang1, Mikkel H Schierup7.
Abstract
In the past decade, several studies have estimated the human per-generation germline mutation rate using large pedigrees. More recently, estimates for various nonhuman species have been published. However, methodological differences among studies in detecting germline mutations and estimating mutation rates make direct comparisons difficult. Here, we describe the many different steps involved in estimating pedigree-based mutation rates, including sampling, sequencing, mapping, variant calling, filtering, and appropriately accounting for false-positive and false-negative rates. For each step, we review the different methods and parameter choices that have been used in the recent literature. Additionally, we present the results from a 'Mutationathon,' a competition organized among five research labs to compare germline mutation rate estimates for a single pedigree of rhesus macaques. We report almost a twofold variation in the final estimated rate among groups using different post-alignment processing, calling, and filtering criteria, and provide details into the sources of variation across studies. Though the difference among estimates is not statistically significant, this discrepancy emphasizes the need for standardized methods in mutation rate estimations and the difficulty in comparing rates from different studies. Finally, this work aims to provide guidelines for computational and statistical benchmarks for future studies interested in identifying germline mutations from pedigrees.Entities:
Keywords: computational pipeline; evolutionary biology; genetics; genomics; mutation rate; ngs analysis; pedigree-based estimation; rhesus macaque
Mesh:
Year: 2022 PMID: 35018888 PMCID: PMC8830884 DOI: 10.7554/eLife.73577
Source DB: PubMed Journal: Elife ISSN: 2050-084X Impact factor: 8.140
Vertebrate species with a direct estimate of the mutation rate using a pedigree approach.
The list of species includes 10 primates, 5 nonprimate mammals, 1 bird, and 4 fish (see Supplementary file 1b for differences in study design and methodology).
| Species | Mutation rate per site per generation: | Number of trios | Parental age | Reference |
|---|---|---|---|---|
| Orangutan ( | 1.66 | 1 |
| |
| Human ( | 1.17 | 1 (CEU) | Unspecified |
|
| Chimpanzee ( | 1.20 | 6 |
| |
| Gorilla ( | 1.13 | 2 |
| |
| Baboon ( | 0.57 | 12 |
| |
| Rhesus macaque ( | 0.58 | 14 |
| |
| Green monkey ( | 0.94 | 3 |
| |
| Owl monkey ( | 0.81 | 14 |
| |
| Marmoset ( | 0.43 | 1 | ~2.80 |
|
| Gray mouse lemur ( | 1.52 | 2 |
| |
| Mouse ( | 0.57 | 8 | Unspecified |
|
| Cattle ( | 1.17 | 5 | Unspecified |
|
| Wolf ( | 0.45 | 4 |
| |
| Domestic cat ( | 0.86 | 11 | ♂: 4.70 and ♀: 2.90 |
|
| Platypus ( | 0.70 | 2 | Unspecified |
|
| Collared flycatcher ( | 0.46 | 7 | Unspecified |
|
| Herring ( | 0.20 | 12 | Unspecified |
|
| Cichlid ( | 0.35 | 9 | Unspecified |
|
Depending on the study, the parental ages are reported as average paternal age (♂), average maternal age (♀), average parental age (~), or unspecified.
Figure 1.Detection of a de novo mutation (DNM) in a trio sample (mother, father, and offspring).
Potential candidates for DNMs are sites where approximately half of the reads (indicated as gray bars) from the offspring have a variant (indicated in green) that is absent from the parental reads.
Figure 2.Flow of the main steps to call de novo mutations (DNMs) from pedigree samples.
Each step lists the various choices in study design and methodology that might impact mutation rate estimates.
Figure 3.Candidate de novo mutations (DNMs) from the Mutationathon.
(a) The pedigree of three generations of rhesus macaques was sequenced and shared with five groups of researchers. Sequencing coverage is indicated for each individual. (b) Upset plot of the 43 candidate DNMs found in Heineken by each research group (LB: Lucie Bergeron; SB: Søren Besenbacher; CV: Cyril Versoza; TT: Tychele Turner; RW: Richard Wang) detected a total of 43 candidate DNMs in Heineken. The first six vertical bars are the candidates shared by at least four different groups. The PCR amplification and Sanger sequencing validation showed that 33 candidates were true-positive DNMs, 6 were false-positive calls (red bars), and 4 did not successfully amplify (gray bars). See Materials and methods for details on the experiment and Figure 3—source data 2 for the results of the PCR experiment.
TP means validated as true positive DNM and FP appeared as false positive. The genotypes of all individuals as shown by the PCR validation are presented.
For each alignment, the candidate germline mutation position is located under the black square. The last six chromatograms (surrounded by red boxes) are the candidates that were detected as false-positive candidates.
’All TPs’ correspond to all true-positive de novo mutations (DNMs) validated by the PCR experiment. The different colors correspond to the true-positive DNMs found by each pipeline (LB: Lucie Bergeron; SB: Søren Besenbacher; CV: Cyril Versoza; TT: Tychele Turner; RW: Richard Wang).
Figure 3—figure supplement 1.Mutation spectrum of the trio of rhesus macaques.
’All TPs’ correspond to all true-positive de novo mutations (DNMs) validated by the PCR experiment. The different colors correspond to the true-positive DNMs found by each pipeline (LB: Lucie Bergeron; SB: Søren Besenbacher; CV: Cyril Versoza; TT: Tychele Turner; RW: Richard Wang).
Figure 4.Estimated germline mutation rates from the Mutationathon.
(a) Number of candidate de novo mutations (DNMs) found by each group (Tychele Turner found two candidates on a sex chromosome). (b) Estimation of the denominator (i.e., the callable genome corrected by the false-negative rate [FNR]) by each group. (c) Estimated mutation rate per site per generation, the error bars correspond to the confidence intervals for binomial probabilities (calculated using the R package 'binconf').
Site-specific and sample-specific filters used by the different groups to detect de novo mutations (DNMs) in Heineken (difference in the other steps of the pipeline in Table 2—source data 1).
| Research group | Candidate DNMs | Site-specific filters | Sample-specific filters | Additional filters |
|---|---|---|---|---|
| CV | 18 | GATK Best Practices | 0.5 × dpind < DP < 2 × | |
| RW | 22 | QD < 2.0 | 20 < DP < 80 | Alternative allele on both strands |
| TT | 27 | Remove variants in recent repeats or in homopolymers of AAAAAAAAAA or TTTTTTTTTT | DP > 10 | Overlap three different variant callers |
| LB | 28 | QD < 2.0 | 0.5 × dpind < DP < 2 × | Manual curation (six candidates removed) |
| SB | 32 | FS > 30.0 | 10 < DP < 2× dpind | Alternative allele in both strands. lowQ AD2 > 1 |
LB: Lucie Bergeron; SB: Søren Besenbacher; CV: Cyril Versoza; TT: Tychele Turner; RW: Richard Wang.
Figure 5.The impact of individual filters on the estimated rate of a trio of rhesus macaques.
The default filters used by Lucie Bergeron (LB) pipeline were DP < 0.5 × depth individual; DP > 2 × depthindividual; GQ < 60; AB < 0.3; AB > 0.7, no AD filter.
Information that should ideally be reported when presenting results on de novo mutations (DNMs).
See Table 2—source data 1 for an example of this table filled out for the five pipelines used to analyze the trio of rhesus macaques.
| Step of the analysis | Information to report |
|---|---|
| 1. Sampling and sequencing | Type of sample (tissue, etc.) |
| Storage duration, buffer, temperature | |
| Type of library preparation | |
| Average sequencing coverage | |
| Sequencing technology and read lengths | |
| 2. Alignment and post-alignment processing | Trimming of adaptors and low-quality reads |
| Reference assembly versionAutosomes only or whole genome? | |
| Mapping software and version | |
| Duplicate removal software and version | |
| Base quality score recalibration (yes/no) | |
| If yes, which type of data used as known variants? | |
| Realignments around indels? | |
| Other filters? | |
| 3. Variant calling | Software and version |
| Mode: joint genotyping? GVCF blocks? GVCF in base-pair resolution? | |
| 4. Detecting DNMs | Site filters on .vcf files and justification |
| Individual filters, threshold, and remaining candidates after each filter | |
| False discovery rate estimation method: PCR validation? Manual curation? Transmission rate deviation? Removal of low-complexity regions, cluster mutations, or recurrent mutations? | |
| 5. Mutation rate estimation | Callable genome estimation method: File used? Filters taken into account? |
| False-negative rate estimation method: simulation? Filters? Probability? |