Cory D Dunn1. 1. Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland.
Abstract
Next-generation sequencing can quickly reveal genetic variation potentially linked to heritable disease. As databases encompassing human variation continue to expand, rare variants have been of high interest, since the frequency of a variant is expected to be low if the genetic change leads to a loss of fitness or fecundity. However, the use of variant frequency when seeking genomic changes linked to disease remains very challenging. Here, I explored the role of selection in controlling human variant frequency using the HelixMT database, which encompasses hundreds of thousands of mitochondrial DNA (mtDNA) samples. I found that a substantial number of synonymous substitutions, which have no effect on protein sequence, were never encountered in this large study, while many other synonymous changes are found at very low frequencies. Further analyses of human and mammalian mtDNA datasets indicate that the population frequency of synonymous variants is predominantly determined by mutational biases rather than by strong selection acting upon nucleotide choice. My work has important implications that extend to the interpretation of variant frequency for non-synonymous substitutions.
Next-generation sequencing can quickly reveal genetic variation potentially linked to heritable disease. As databases encompassing human variation continue to expand, rare variants have been of high interest, since the frequency of a variant is expected to be low if the genetic change leads to a loss of fitness or fecundity. However, the use of variant frequency when seeking genomic changes linked to disease remains very challenging. Here, I explored the role of selection in controlling human variant frequency using the HelixMT database, which encompasses hundreds of thousands of mitochondrial DNA (mtDNA) samples. I found that a substantial number of synonymous substitutions, which have no effect on protein sequence, were never encountered in this large study, while many other synonymous changes are found at very low frequencies. Further analyses of human and mammalian mtDNA datasets indicate that the population frequency of synonymous variants is predominantly determined by mutational biases rather than by strong selection acting upon nucleotide choice. My work has important implications that extend to the interpretation of variant frequency for non-synonymous substitutions.
In this era of genomic medicine, next-generation sequence data obtained from patients, families, and populations are used to reveal and predict which genes and variants may be linked to disease (Claussnitzer et al., 2020; Shendure et al., 2019). Researchers have often embraced rare variants when seeking genomic changes in protein-coding sequences that may be pathogenic, as a reduction in variant frequency is an expected outcome of selection (Bomba et al., 2017; Gibson, 2012; Sazonovs and Barrett, 2018; Zuk et al., 2014). However, while triage of rare variants has led to some success in illuminating genes linked to heritable disease (Fuchsberger et al., 2016; Genovese et al., 2016; Lencz et al., 2021; Luo et al., 2017), the interpretation and utilization of rare genomic changes remains very challenging (Macklin et al., 2018; Manrai et al., 2016; Uricchio et al., 2016).Human mitochondrial DNA (mtDNA) encodes proteins and RNAs required for the process of oxidative phosphorylation, and mitochondrial mutations are linked to a number of metabolic diseases (Gorman et al., 2016; Thompson et al., 2020). Recently, the mtDNAs of nearly 200,000 individuals were sequenced in order to produce the HelixMT database (HelixMTdb), a large catalog of human mtDNA variation (Bolze et al., 2019 preprint). Here, I found that many synonymous nucleotide substitutions were never detected within this quite substantial survey of human mtDNA. Subsequent study of more than one thousand mammalian mtDNAs suggested that selection on synonymous base substitution in mitochondrial protein-coding genes is minimal and unlikely to explain the absence or rarity of many synonymous changes within the human population. Rather, the mutational propensities of mtDNA (Kumar, 1996; Reyes et al., 1998) are more likely to have a dominant influence upon variant frequency. My findings have general implications for the interpretation of variant frequencies when studying heritable disease.
RESULTS
During an exploration of selective pressures that may act upon mitochondria-encoded polypeptides, I simulated every possible nucleotide substitution from the human reference mtDNA sequence within all protein-coding sequences, then cross-referenced these nucleotide and potential amino acid changes with the nucleotide substitutions tabulated in HelixMTdb. Consistent with selection against the vast majority of amino acid changes, non-synonymous substitutions were depleted to a far greater extent than synonymous substitutions when considering either the number of samples harboring variants of each type (Fig. 1A) or whether the nucleotide substitution was encountered at all during compilation of the HelixMTdb (Fig. 1B).
Fig. 1.
Many synonymous variants are never encountered within the HelixMT database. (A) Non-synonymous and synonymous variants differ in their population frequency. The sample count for variants within coding regions that are reachable by a single substitution from the human mtDNA reference sequence were obtained from HelixMTdb and plotted (Kolmogorov–Smirnov approximate P-value, <0.0001). Non-synonymous and synonymous substitutions never encountered are reflected as smaller dots along the x-axis at the zero position on the y-axis. (B) While the great majority of non-synonymous substitutions were never encountered during the HelixMTdb study, a substantial fraction of synonymous substitutions are also apparently lacking from the human population.
Many synonymous variants are never encountered within the HelixMT database. (A) Non-synonymous and synonymous variants differ in their population frequency. The sample count for variants within coding regions that are reachable by a single substitution from the human mtDNA reference sequence were obtained from HelixMTdb and plotted (Kolmogorov–Smirnov approximate P-value, <0.0001). Non-synonymous and synonymous substitutions never encountered are reflected as smaller dots along the x-axis at the zero position on the y-axis. (B) While the great majority of non-synonymous substitutions were never encountered during the HelixMTdb study, a substantial fraction of synonymous substitutions are also apparently lacking from the human population.Since synonymous changes would not be expected to change the structure or function of mitochondria-encoded proteins, the number of changes (2925, or ∼35% of potential synonymous substitutions) for which a synonymous substitution was never encountered (abbreviated here as ‘SSNEs’) during the HelixMTdb investigation was considered noteworthy. While HelixMTdb contained samples from almost all human haplogroups (Bolze et al., 2019 preprint), and while closely related individuals were removed from the study, more than 90% of samples were classified within the ‘N’ macro-haplogroup, a lineage associated with the exit of modern humans from Africa (Ingman et al., 2000; Maca-Meyer et al., 2001). Consequently, it was possible that SSNE abundance was linked to limited sample diversity. Therefore, in an attempt to assess HelixMTdb sampling biases, as well as to explore this dataset more deeply, I examined in greater detail the presence or absence of HelixMTdb samples harboring each synonymous change to a third codon position (abbreviated as a ‘P3’). When considering those amino acids associated only with twofold P3 degeneracy, or the ability to accept only two different bases at P3 without changing the protein sequence, samples diverging from the human reference sequence were present in HelixMTdb for nearly all (>97%) analyzed positions (Fig. 2A). This result indicates that the HelixMTdb does indeed cover a substantial amount of human mtDNA sequence diversity. Furthermore, our analysis demonstrates that SSNEs are unlikely to be associated with twofold degenerate sites and that codon choice at twofold degenerate sites governed by transitions (a purine–purine change or a pyrimidine–pyrimidine change) is unlikely to be under strong selection in humans. Since transitions at twofold degenerate sites cause changes in local GC content, as well as slight alterations in global nucleotide content, our results also argue against substantial selection upon human mtDNA that might be based upon these factors.
Fig. 2.
Transversion substitutions from the reference sequence are heavily depleted at fourfold degenerate third-codon positions of human mitochondrial coding sequences. (A) Nearly all synonymous changes from the human reference sequence can be identified within HelixMTdb at amino acid positions characterized by twofold degenerate P3s. (B) Nearly all amino acid positions characterized by fourfold degeneracy at P3 could be associated with at least one base substitution in the HelixMTdb dataset. (C) Synonymous mutations never encountered in the HelixMTdb study are most abundant at amino acids for which at least one codon is fourfold degenerate at P3. (D) The vast majority of synonymous substitutions never encountered by the HelixMTdb study are transversions. (E) For those substitutions that are encountered in HelixMTdb, population prevalence is linked to substitution type. For each transition or transversion found in the HelixMTdb at an amino acid for which at least one codon is fourfold degenerate at P3, the HelixMTdb population count is plotted. Bar and error bars represent median of population count and interquartile range, respectively. For all comparisons of variant frequencies for a given amino acid (transition versus transversion from reference), Kolmogorov–Smirnov approximate P-values are <0.0001.
Transversion substitutions from the reference sequence are heavily depleted at fourfold degenerate third-codon positions of human mitochondrial coding sequences. (A) Nearly all synonymous changes from the human reference sequence can be identified within HelixMTdb at amino acid positions characterized by twofold degenerate P3s. (B) Nearly all amino acid positions characterized by fourfold degeneracy at P3 could be associated with at least one base substitution in the HelixMTdb dataset. (C) Synonymous mutations never encountered in the HelixMTdb study are most abundant at amino acids for which at least one codon is fourfold degenerate at P3. (D) The vast majority of synonymous substitutions never encountered by the HelixMTdb study are transversions. (E) For those substitutions that are encountered in HelixMTdb, population prevalence is linked to substitution type. For each transition or transversion found in the HelixMTdb at an amino acid for which at least one codon is fourfold degenerate at P3, the HelixMTdb population count is plotted. Bar and error bars represent median of population count and interquartile range, respectively. For all comparisons of variant frequencies for a given amino acid (transition versus transversion from reference), Kolmogorov–Smirnov approximate P-values are <0.0001.Next, I examined variation at amino acids associated only with fourfold P3 degeneracy, or the ability to accept all nucleotides at P3 without altering the protein sequence. Nearly all analyzed fourfold degenerate positions (99%) harbored at least one sample with a base differing from that found in the human reference (Fig. 2B), again suggesting that HelixMTdb encompasses a sizable amount of human mtDNA diversity. However, I encountered apparent limitations on base variability at some of these fourfold degenerate P3s, as all four base possibilities could be identified within HelixMTdb at only 10.4% of these mtDNA locations.I then tested whether SSNEs might be predominantly associated with specific amino acids. I found that for the amino acids arginine, threonine, alanine, valine, serine, glycine, proline, and leucine, a more substantial fraction of synonymous changes were never seen in the HelixMTdb relative to the other amino acids (Fig. 2C). All of these amino acids can be associated with vertebrate mtDNA codons of fourfold degeneracy.While twofold degenerate P3s allow only transitions without altering the amino acid, fourfold degenerate P3s also permit transversions (a purine-pyrimidine change, or vice versa) that leave the protein code unchanged. Since amino acids with fourfold degenerate P3s were characterized by a higher number of SSNEs, I tested whether SSNEs might be more closely associated with transversions or transitions. I found that nearly all SSNEs assigned to these eight amino acids (>96%) were linked to a potential transversion (Fig. 2D). Expanding my analysis to synonymous substitutions encountered at least once in HelixMTdb, the population frequency of a variant was also clearly linked to whether the nucleotide change at fourfold degenerate P3s was a transition or a transversion (Fig. 2E).Two conspicuous and non-exclusive hypotheses exist regarding the notable enrichment of transversions among SSNEs at fourfold degenerate P3s. First, there may be unanticipated, yet substantial selection that acts upon P3s and leads to depletion of even synonymous transversions from the human population. Second, mutational biases related to mtDNA replication and maintenance may make transversions at degenerate P3s far less likely than transitions (Aquadro and Greenberg, 1983; Belle et al., 2005; Brown and Simpson, 1982; Kennedy et al., 2013; Kumar, 1996; Tamura and Nei, 1993; Vermulst et al., 2007; Wakeley, 1996; Zaidi et al., 2019). To address the first possibility, I further examined the extent of selection on P3s among mammals by examining the nucleotide frequencies at approximately 5 million P3s across the coding sequences of 1317 mammalian mtDNAs. Here, I also took into account the two different mitochondrial tRNAs recognizing leucine codons and the two mitochondrial tRNAs recognizing serine codons. As encountered in previous studies (Kumar, 1996; Reyes et al., 1998), guanine was depleted from mitochondrial P3s (the vast majority of which are encoded by the L-strand) for which the presence of any purine does not lead to an amino acid change (Fig. 3A), while adenine dominated at those positions. Cytosine and thymine were both well-represented at P3s for which any pyrimidine is permitted without altering the encoded amino acid. However, even considering the relative depletion of guanine from all fourfold degenerate P3s and twofold degenerate purine P3s, guanine was nonetheless detected at thousands of P3 positions (Fig. 3B). Consequently, nucleotide frequencies at P3s appear unlikely to reflect generalized codon limitations inherent to the process of mitochondrial translation.
Fig. 3.
Abundant substitution and degeneracy across the third codon positions of mammals. (A) Base occupancy is not equally distributed among nucleotides at degenerate P3s (n≥33,073 instances of each amino acid are analyzed across the set of input mammalian mtDNAs). (B) Guanine is not excluded from recognition by any tRNA presumed to allow silent purine substitution at P3. (C) Nearly all I-P3s can accept all of the possible synonymous substitutions available for a particular amino acid. (D) Substitution is common at nearly all mammalian I-P3 positions, although the TSS distributions of I-P3s associated with amino acids can differ within each degeneracy class (fourfold degenerate, twofold degenerate purine, or twofold degenerate pyrimidine). Kolmogorov–Smirnov approximate P-values corrected for multiple comparisons are shown (red, ≤0.001; orange, ≤0.01; green, ≤0.05; no line,>0.05). Bar and error bars represent median with 95% confidence interval. (E) Population frequency of synonymous variants is unlinked to TSS at fourfold degenerate I-P3s. Statistical significance is represented as in D, and the bar and error bars represent median and 95% confidence intervals.
Abundant substitution and degeneracy across the third codon positions of mammals. (A) Base occupancy is not equally distributed among nucleotides at degenerate P3s (n≥33,073 instances of each amino acid are analyzed across the set of input mammalian mtDNAs). (B) Guanine is not excluded from recognition by any tRNA presumed to allow silent purine substitution at P3. (C) Nearly all I-P3s can accept all of the possible synonymous substitutions available for a particular amino acid. (D) Substitution is common at nearly all mammalian I-P3 positions, although the TSS distributions of I-P3s associated with amino acids can differ within each degeneracy class (fourfold degenerate, twofold degenerate purine, or twofold degenerate pyrimidine). Kolmogorov–Smirnov approximate P-values corrected for multiple comparisons are shown (red, ≤0.001; orange, ≤0.01; green, ≤0.05; no line,>0.05). Bar and error bars represent median with 95% confidence interval. (E) Population frequency of synonymous variants is unlinked to TSS at fourfold degenerate I-P3s. Statistical significance is represented as in D, and the bar and error bars represent median and 95% confidence intervals.While degeneracy at the third codon position is a general feature of mtDNA-encoded amino acids, these results are not necessarily informative about the possibility that strong selection acts upon specific P3s encoded by the mitochondrial genome. To further explore the extent to which individual P3s might be under selection, I focused my attention upon codons for which the first and second positions, and therefore the encoded amino acids, are 100% identical in an alignment consisting of 1251 mammals and an outlier reptile sequence (Iguana iguana) used to root an inferred phylogenetic tree. Leucine codons could not be included in this analysis, as degeneracy at the first codon position always led to substitution between codons recognized by the L1 and L2 tRNAs at mammalian protein alignment sites harboring only leucine. I found that 560/561 (99.8%) of the resulting set of P3s (‘I-P3s’, indicating identity of codon positions one and two throughout the alignment) can be inhabited by any nucleotide permitting synonymous substitution (Fig. 3C). Only the I-P3 of the codon annotated in humans as encoding the COX3 starting methionine appears to be totally constrained with respect to nucleotide choice, as this P3 is always occupied by guanine in mammals. Interestingly, COX3 is reported to be unique among mitochondrial polypeptides for its lack of an amino-terminal formyl-methionine (Walker et al., 2009), although whether there is a mechanistic relationship between these two observations remains to be determined. Nearly complete degeneracy among mammalian I-P3s again argues against strong selection upon synonymous nucleotide substitution within mammals based upon codon preference and the thermodynamics of base pairing, while also arguing against selection due to the presence of conserved protein binding sites.I then calculated for each I-P3 the total substitution score (TSS), a sum of substitutions occurring at a specific site throughout an inferred mammalian phylogenetic tree (Akpınar et al., 2021 preprint). I found, with few exceptions, that nearly all fourfold degenerate, twofold degenerate pyrimidine, and twofold degenerate purine I-P3s have been subject to base substitution tens, or even hundreds of times, during approximately 200 million years of mammal evolution (Fig. 3D), a result quite consistent with minimal selection upon nucleotide choice at mitochondrial P3s. However, here I did note statistically significant divergence between TSS distributions at I-P3s when comparing amino acids within a given degeneracy class. These results suggest, when also considering the findings described above, that while synonymous base substitutions are not under strong selective pressure, weak selection may have helped to shape some mitochondrial third codon positions throughout the course of mammalian evolution.Next, I asked whether a low frequency of human variants at fourfold degenerate I-P3s would correspond with lower mammal-derived TSS values of corresponding positions, a potential indicator of selection extending across mammals to humans. Here, I placed variation occurring at I-P3s into the classes ‘absent’ (zero counts among 195983 HelixMTdb samples), ‘ultra-rare’ (variant frequency <0.01%), ‘rare’ (variant frequency <1% and ≥0.01%), ‘low-frequency’ (variant frequency ≥1% and Finally, I explored how the tabulation of heteroplasmic samples within the HelixMTdb might be informative regarding potential selection on synonymous and non-synonymous substitutions. Most human cells harbor hundreds of mtDNA molecules. Repeated encounters with a variant in a heteroplasmic state, where the variant is not found in all of the sequenced mtDNA molecules, is often considered to be a signal of pathogenicity, as homoplasmy of a deleterious variant is expected to lead to a fitness defect (Ye et al., 2014). However, if a synonymous change to mtDNA is neutral, whether a synonymous variant is encountered as heteroplasmic or homoplasmic should be a function of drift and the number of cell divisions since the initial appearance of the novel mutation in a population (Chinnery et al., 2000; Schaack et al., 2020; Stewart and Chinnery, 2015). I plotted the frequency of heteroplasmy calls against the number of samples harboring the selected variant for those substitutions encountered in at least ten HelixMTdb samples. Synonymous variants decreased in the frequency at which they were labelled as heteroplasmic as the population frequency of encounters increased, consistent either with neutral drift toward homoplasmy or with selection (Fig. 4). However, a trend toward a higher frequency of heteroplasmy calls for non-synonymous variants than for synonymous variants was easily visualized at lower population frequencies, again consistent with a general lack of selection on human synonymous variation.
Fig. 4.
Reduced selection on synonymous substitution compared to non-synonymous substitution is indicated by an analysis of variant heteroplasmy. Only variants represented by at least ten samples in HelixMTdb are plotted. The grey curve imposed upon the figure highlights a notable divergence in heteroplasmic sample fractions at low variant frequency that becomes apparent when comparing synonymous and non-synonymous variants.
Reduced selection on synonymous substitution compared to non-synonymous substitution is indicated by an analysis of variant heteroplasmy. Only variants represented by at least ten samples in HelixMTdb are plotted. The grey curve imposed upon the figure highlights a notable divergence in heteroplasmic sample fractions at low variant frequency that becomes apparent when comparing synonymous and non-synonymous variants.
DISCUSSION
Taken together, my findings indicate that human variation at mtDNA-encoded P3s is mostly constrained by the substitution rates of each nucleotide. More specifically, variation appears highly restricted by the reduced likelihood of transversion relative to transition in mtDNA, a phenomenon well documented in earlier studies of humans and other mammals (Aquadro and Greenberg, 1983; Belle et al., 2005; Brown and Simpson, 1982; Kennedy et al., 2013; Kumar, 1996; Tamura and Nei, 1993; Vermulst et al., 2007; Wakeley, 1996; Zaidi et al., 2019). Any role for selection at synonymous sites is relatively minor, with strength of selection potentially dependent upon the specific amino acid under analysis. The low selection on codon choice encountered during my study is quite consistent with earlier, more limited analyses of mitochondrial codon choice (Castellana et al., 2011; Faith and Pollock, 2003; Jia and Higgs, 2008; Uddin and Chakraborty, 2017) and with the highly streamlined tRNA set used during mitochondrial protein synthesis.The strong link that I have revealed between mutational biases and human variant frequencies in the large HelixMTdb dataset highlights the difficulties in assigning potential pathogenicity to non-synonymous variants based, in part, upon the variant frequency in the population (Bomba et al., 2017; Gibson, 2012; McInnes et al., 2021; Povysil et al., 2019; Zuk et al., 2014). Accordingly, attempts to link rare and de novo variation to disease are likely to be most successful when the mutational biases for each nucleotide can be estimated and properly taken into account.
MATERIALS AND METHODS
Calculation of protein changes caused by single nucleotide substitutions from the human reference sequence
The human reference mtDNA sequence and accompanying annotation (accession NC_012920.1) was downloaded from GenBank and used as input for the script ‘amino_acid_changes_caused_by_mtDNA_nucleotide_changes_in_reference.py’. This script generates a file reporting the protein sequence change associated with each simulated single base substitution to each human mitochondrial protein coding gene.
Analysis of merged human mtDNA variation
The HelixMTdb dataset (Bolze et al., 2019 preprint) was downloaded on August 26, 2021 (https://www.helix.com/pages/mitochondrial-variant-database). The script ‘shape_HelixMTdb.py’ takes as input the HelixMTdb dataset and the output of ‘amino_acid_changes_caused_by_mtDNA_nucleotide_changes_in_reference.py’, then links HelixMTdb variation to simulated amino acid changes. A subsequent analysis of transitions and transversions from the reference sequence was performed by running the script ‘transitions_transversions.py’ on the output of ‘shape_HelixMTdb.py’.To check the diversity of mtDNA sequences tabulated within HelixMTdb, and to further explore this dataset, I used the script ‘check_HelixMTdb_diversity.py’, which tests how many silent P3 changes from the reference sequence can be identified for each amino acid position, further classified by twofold or fourfold P3 degeneracy. To simplify this analysis, leucine and serine codons were excluded from this analysis, since these amino acids are associated with tRNAs of both twofold and fourfold P3 degeneracy.
Detection of third codon position selection by analysis of mammalian mtDNAs
Records for mammalian reference mtDNAs were obtained using the Organelle Genome Resources provided by the National Center for Biotechnology Information Reference Sequence project (NCBI RefSeq, Release 207, https://www.ncbi.nlm.nih.gov/genome/organelle/) (O'Leary et al., 2016). All accessions not containing ‘NC_’ at the beginning of their accession name were removed, and this list of accessions was used to download full GenBank records [‘mammalian_mtDNA_NC_AUG_26_2021_noroot.gb’] using the NCBI Batch Entrez server (https://www.ncbi.nlm.nih.gov/sites/batchentrez). These GenBank records were analyzed by the script ‘third_codon_position_selections_all_mammals.py’ to determine total counts and frequencies of P3 bases for each amino acid. Records not containing all of the following coding sequence annotations were discarded: ‘ND1’, ‘ND2’, ‘COX1’, ‘COX2’, ‘ATP8’, ‘ATP6’, ‘COX3’, ‘ND3’, ‘ND4L’, ‘ND4’, ‘ND5’, ‘ND6’, ‘CYTB’.To calculate TSSs (Akpınar et al., 2021 preprint) for I-P3 positions, the GenBank record for I. iguana (NC_002793.1) was added to the set of mammalian GenBank records [‘mammalian_mtDNA_NC_AUG_26_2021_Iguana_root.gb’]. The script ‘I-P3_Part_1.py’ was used to extract and align sequences from the resulting input GenBank file. Again, accessions without all of the following coding sequence annotations were discarded by this script: ‘ND1’, ‘ND2’, ‘COX1’, ‘COX2’, ‘ATP8’, ‘ATP6’, ‘COX3’, ‘ND3’, ‘ND4L’, ‘ND4’, ‘ND5’, ‘ND6’, ‘CYTB’. Accessions with any coding sequence found duplicated in another accession were also discarded. This script calls upon MAFFT v7.487 (Katoh and Standley, 2013) to align mtDNA-derived coding sequences using the FFT-NS-2 algorithm. A concatenated alignment of coding sequences output from this script was used to infer a maximum likelihood tree in RAxML-NG v1.0.3 (Kozlov et al., 2019) using a single partition, a GTR+FO+G4 m model of DNA change, and a seed of 777. Ten random and ten parsimony-based starting trees were used to initiate tree construction, and the average relative Robinson-Foulds distance (Robinson and Foulds, 1981) for the inferred trees was 0.01. 600 bootstrap replicates were generated using RAxML-NG v1.0.3, and a weighted Robinson-Foulds distance converged below a 1% cutoff value (seed of 2000). Felsenstein's Bootstrap Proportions (Felsenstein, 1985) [‘FBP_AUG_26_2021.raxml.support] and the Transfer Bootstrap Expectations (Lemoine et al., 2018) [‘TBE_AUG_26_2021.raxml.support’] were calculated and used to label the best scoring tree [‘T3_AUG_26_P3_REV.raxml.bestTree’]. This tree was used for downstream analyses after using FigTree 1.4.4 (https://github.com/rambaut/figtree/releases) to place the root upon the branch leading to I. iguana [‘AUG_26_P3_REV_tree_rooted_Iguana_iguana.nwk’].This rooted tree and the coding sequence alignments for each mitochondria-encoded protein were used to determine the TSSs associated with each class of P3 and to determine which bases can occupy mammalian I-P3sites. Most fourfold degenerate P3s within codons with identical first and second positions were analyzed using the ‘I-P3_Part_2_AGPRTV.py’ script, most twofold degenerate purine P3s were analyzed with ‘I-P3_Part_2_EKMQW.py’, and most twofold degenerate pyrimidine P3s were analyzed with ‘I-P3_Part_2_CDFHINY.py’. Any amino acids positions encoded by tRNA L1 in all mammals and the outgroup, and therefore harboring T and T at the first and second codon positions in all samples, were sought by script ‘I-P3_Part_2_L1.py’. Similarly, amino acids positions encoded by tRNA L2 in all mammals and the outgroup, and therefore harboring C and T at the first and second codon positions in all samples, would have been identified and analyzed by script ‘I-P3_Part_2_L2.py’. The twofold degenerate P3s associated with the use of tRNA S2 were analyzed by script ‘I-P3_Part_2_S2.py’, and the fourfold degenerate P3 associated with serines encoded by tRNA S1 were analyzed by script ‘I-P3_Part_2_S1.py’. Within each of these scripts, MAFFT v7.487 was used to perform alignments using the FFT-NS-2 algorithm, script ‘ungap_on_reference.py’ v1.0 (Dunn, 2021) was used to ungap alignments based on the human reference sequence, ancestral character predictions were made at internal nodes using RAxML-NG v1.0.3 (Kozlov et al., 2019), and seqkit v0.16.1 (Shen et al., 2016) was used while formatting the node names associated with ancestral sequences.
Comparison of total substitution scores to HelixMTdb dataset
Output of the above-mentioned scripts was combined into new tables and further annotated based upon whether the P3 data were obtained from twofold degenerate purine sites [‘two_fold_AG_P3_ALL.csv’], twofold degenerate pyrimidine sites [‘two_fold_CT_P3_ALL.csv’], or fourfold degenerate sites [‘four_fold_P3_ALL.csv’]. Further processing to quantify any relationship between TSS and general P3 type, TSS and amino acid, the link between (non-)synonymous mutation and sample counts in HelixMTdb was performed using the script ‘I-P3_Part_3.py’.
Statistical analyses
The Kolmogorov–Smirnov test, a non-parametric approach testing the hypothesis that two samples are drawn from different populations, was used to compare the total variant counts of synonymous and non-synonymous substitutions, to contrast total transversion versus transition counts for specific amino acids, and to compare TSS distributions between amino acids or frequency classes. Kolmogorov–Smirnov analyses were performed in Prism 9.2.0 or SciPy v1.7.1 (Virtanen et al., 2020). When analyzing I-P3 TSS values among different frequency classes, TSS distributions are arranged by amino acid, which are then compared by Kolmogorov–Smirnov testing only within a given category (fourfold degenerate, twofold degenerate purine, twofold degenerate pyrimidine). Moreover, the ‘common’ frequency class was not subject to statistical testing due to the limited number of variants found within this category. Correction for multiple testing (Bonferroni correction) was accomplished by multiplying each single test P-value by the number of tests.
Authors: Marc Vermulst; Jason H Bielas; Gregory C Kujoth; Warren C Ladiges; Peter S Rabinovitch; Tomas A Prolla; Lawrence A Loeb Journal: Nat Genet Date: 2007-03-04 Impact factor: 38.330
Authors: Giulio Genovese; Menachem Fromer; Eli A Stahl; Douglas M Ruderfer; Kimberly Chambert; Mikael Landén; Jennifer L Moran; Shaun M Purcell; Pamela Sklar; Patrick F Sullivan; Christina M Hultman; Steven A McCarroll Journal: Nat Neurosci Date: 2016-10-03 Impact factor: 24.884
Authors: Yang Luo; Katrina M de Lange; Luke Jostins; Loukas Moutsianas; Joshua Randall; Nicholas A Kennedy; Christopher A Lamb; Shane McCarthy; Tariq Ahmad; Cathryn Edwards; Eva Goncalves Serra; Ailsa Hart; Chris Hawkey; John C Mansfield; Craig Mowat; William G Newman; Sam Nichols; Martin Pollard; Jack Satsangi; Alison Simmons; Mark Tremelling; Holm Uhlig; David C Wilson; James C Lee; Natalie J Prescott; Charlie W Lees; Christopher G Mathew; Miles Parkes; Jeffrey C Barrett; Carl A Anderson Journal: Nat Genet Date: 2017-01-09 Impact factor: 41.307
Authors: Christian Fuchsberger; Jason Flannick; Tanya M Teslovich; Anubha Mahajan; Vineeta Agarwala; Kyle J Gaulton; Clement Ma; Pierre Fontanillas; Loukas Moutsianas; Davis J McCarthy; Manuel A Rivas; John R B Perry; Xueling Sim; Thomas W Blackwell; Neil R Robertson; N William Rayner; Pablo Cingolani; Adam E Locke; Juan Fernandez Tajes; Heather M Highland; Josee Dupuis; Peter S Chines; Cecilia M Lindgren; Christopher Hartl; Anne U Jackson; Han Chen; Jeroen R Huyghe; Martijn van de Bunt; Richard D Pearson; Ashish Kumar; Martina Müller-Nurasyid; Niels Grarup; Heather M Stringham; Eric R Gamazon; Jaehoon Lee; Yuhui Chen; Robert A Scott; Jennifer E Below; Peng Chen; Jinyan Huang; Min Jin Go; Michael L Stitzel; Dorota Pasko; Stephen C J Parker; Tibor V Varga; Todd Green; Nicola L Beer; Aaron G Day-Williams; Teresa Ferreira; Tasha Fingerlin; Momoko Horikoshi; Cheng Hu; Iksoo Huh; Mohammad Kamran Ikram; Bong-Jo Kim; Yongkang Kim; Young Jin Kim; Min-Seok Kwon; Juyoung Lee; Selyeong Lee; Keng-Han Lin; Taylor J Maxwell; Yoshihiko Nagai; Xu Wang; Ryan P Welch; Joon Yoon; Weihua Zhang; Nir Barzilai; Benjamin F Voight; Bok-Ghee Han; Christopher P Jenkinson; Teemu Kuulasmaa; Johanna Kuusisto; Alisa Manning; Maggie C Y Ng; Nicholette D Palmer; Beverley Balkau; Alena Stančáková; Hanna E Abboud; Heiner Boeing; Vilmantas Giedraitis; Dorairaj Prabhakaran; Omri Gottesman; James Scott; Jason Carey; Phoenix Kwan; George Grant; Joshua D Smith; Benjamin M Neale; Shaun Purcell; Adam S Butterworth; Joanna M M Howson; Heung Man Lee; Yingchang Lu; Soo-Heon Kwak; Wei Zhao; John Danesh; Vincent K L Lam; Kyong Soo Park; Danish Saleheen; Wing Yee So; Claudia H T Tam; Uzma Afzal; David Aguilar; Rector Arya; Tin Aung; Edmund Chan; Carmen Navarro; Ching-Yu Cheng; Domenico Palli; Adolfo Correa; Joanne E Curran; Denis Rybin; Vidya S Farook; Sharon P Fowler; Barry I Freedman; Michael Griswold; Daniel Esten Hale; Pamela J Hicks; Chiea-Chuen Khor; Satish Kumar; Benjamin Lehne; Dorothée Thuillier; Wei Yen Lim; Jianjun Liu; Yvonne T van der Schouw; Marie Loh; Solomon K Musani; Sobha Puppala; William R Scott; Loïc Yengo; Sian-Tsung Tan; Herman A Taylor; Farook Thameem; Gregory Wilson; Tien Yin Wong; Pål Rasmus Njølstad; Jonathan C Levy; Massimo Mangino; Lori L Bonnycastle; Thomas Schwarzmayr; João Fadista; Gabriela L Surdulescu; Christian Herder; Christopher J Groves; Thomas Wieland; Jette Bork-Jensen; Ivan Brandslund; Cramer Christensen; Heikki A Koistinen; Alex S F Doney; Leena Kinnunen; Tõnu Esko; Andrew J Farmer; Liisa Hakaste; Dylan Hodgkiss; Jasmina Kravic; Valeriya Lyssenko; Mette Hollensted; Marit E Jørgensen; Torben Jørgensen; Claes Ladenvall; Johanne Marie Justesen; Annemari Käräjämäki; Jennifer Kriebel; Wolfgang Rathmann; Lars Lannfelt; Torsten Lauritzen; Narisu Narisu; Allan Linneberg; Olle Melander; Lili Milani; Matt Neville; Marju Orho-Melander; Lu Qi; Qibin Qi; Michael Roden; Olov Rolandsson; Amy Swift; Anders H Rosengren; Kathleen Stirrups; Andrew R Wood; Evelin Mihailov; Christine Blancher; Mauricio O Carneiro; Jared Maguire; Ryan Poplin; Khalid Shakir; Timothy Fennell; Mark DePristo; Martin Hrabé de Angelis; Panos Deloukas; Anette P Gjesing; Goo Jun; Peter Nilsson; Jacquelyn Murphy; Robert Onofrio; Barbara Thorand; Torben Hansen; Christa Meisinger; Frank B Hu; Bo Isomaa; Fredrik Karpe; Liming Liang; Annette Peters; Cornelia Huth; Stephen P O'Rahilly; Colin N A Palmer; Oluf Pedersen; Rainer Rauramaa; Jaakko Tuomilehto; Veikko Salomaa; Richard M Watanabe; Ann-Christine Syvänen; Richard N Bergman; Dwaipayan Bharadwaj; Erwin P Bottinger; Yoon Shin Cho; Giriraj R Chandak; Juliana C N Chan; Kee Seng Chia; Mark J Daly; Shah B Ebrahim; Claudia Langenberg; Paul Elliott; Kathleen A Jablonski; Donna M Lehman; Weiping Jia; Ronald C W Ma; Toni I Pollin; Manjinder Sandhu; Nikhil Tandon; Philippe Froguel; Inês Barroso; Yik Ying Teo; Eleftheria Zeggini; Ruth J F Loos; Kerrin S Small; Janina S Ried; Ralph A DeFronzo; Harald Grallert; Benjamin Glaser; Andres Metspalu; Nicholas J Wareham; Mark Walker; Eric Banks; Christian Gieger; Erik Ingelsson; Hae Kyung Im; Thomas Illig; Paul W Franks; Gemma Buck; Joseph Trakalo; David Buck; Inga Prokopenko; Reedik Mägi; Lars Lind; Yossi Farjoun; Katharine R Owen; Anna L Gloyn; Konstantin Strauch; Tiinamaija Tuomi; Jaspal Singh Kooner; Jong-Young Lee; Taesung Park; Peter Donnelly; Andrew D Morris; Andrew T Hattersley; Donald W Bowden; Francis S Collins; Gil Atzmon; John C Chambers; Timothy D Spector; Markku Laakso; Tim M Strom; Graeme I Bell; John Blangero; Ravindranath Duggirala; E Shyong Tai; Gilean McVean; Craig L Hanis; James G Wilson; Mark Seielstad; Timothy M Frayling; James B Meigs; Nancy J Cox; Rob Sladek; Eric S Lander; Stacey Gabriel; Noël P Burtt; Karen L Mohlke; Thomas Meitinger; Leif Groop; Goncalo Abecasis; Jose C Florez; Laura J Scott; Andrew P Morris; Hyun Min Kang; Michael Boehnke; David Altshuler; Mark I McCarthy Journal: Nature Date: 2016-07-11 Impact factor: 69.504
Authors: Alina G Mikhailova; Alina A Mikhailova; Kristina Ushakova; Evgeny O Tretiakov; Dmitrii Iliushchenko; Victor Shamansky; Valeria Lobanova; Ivan Kozenkov; Bogdan Efimenko; Andrey A Yurchenko; Elena Kozenkova; Evgeny M Zdobnov; Vsevolod Makeev; Valerian Yurov; Masashi Tanaka; Irina Gostimskaya; Zoe Fleischmann; Sofia Annis; Melissa Franco; Kevin Wasko; Stepan Denisov; Wolfram S Kunz; Dmitry Knorre; Ilya Mazunin; Sergey Nikolaev; Jacques Fellay; Alexandre Reymond; Konstantin Khrapko; Konstantin Gunbin; Konstantin Popadin Journal: Nucleic Acids Res Date: 2022-10-14 Impact factor: 19.160