| Literature DB >> 32243542 |
Chase W Nelson1,2, Zachary Ardern3, Xinzhu Wei4,5.
Abstract
Purifying (negative) natural selection is a hallmark of functional biological sequences, and can be detected in protein-coding genes using the ratio of nonsynonymous to synonymous substitutions per site (dN/dS). However, when two genes overlap the same nucleotide sites in different frames, synonymous changes in one gene may be nonsynonymous in the other, perturbing dN/dS. Thus, scalable methods are needed to estimate functional constraint specifically for overlapping genes (OLGs). We propose OLGenie, which implements a modification of the Wei-Zhang method. Assessment with simulations and controls from viral genomes (58 OLGs and 176 non-OLGs) demonstrates low false-positive rates and good discriminatory ability in differentiating true OLGs from non-OLGs. We also apply OLGenie to the unresolved case of HIV-1's putative antisense protein gene, showing significant purifying selection. OLGenie can be used to study known OLGs and to predict new OLGs in genome annotation. Software and example data are freely available at https://github.com/chasewnelson/OLGenie (last accessed April 10, 2020).Entities:
Keywords: zzm321990 antisense protein (asp) gene; zzm321990 dzzm321990 N/dS; gene prediction; genome annotation; human immunodeficiency virus-1; open reading frame; overlapping gene (OLG); purifying (negative) selection
Mesh:
Year: 2020 PMID: 32243542 PMCID: PMC7531306 DOI: 10.1093/molbev/msaa087
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1Overlapping genes: reading frames and terminology. (A) The six possible protein-coding open reading frames (ORFs) of a double-stranded nucleic acid sequence. Codons are denoted with solid black boxes, each comprising three ordered nucleotide positions (1, 2, 3) with light gray boundaries. The reference gene frame is shown with a white background, whereas alternate gene frames are shown with a gray background. Frame relationships are indicated using the nomenclature of Wei and Zhang (2015), where “ss” indicates “sense–sense” (same-strand), “sas” indicates “sense–antisense” (opposite-strand), and the numbers indicate which codon position of the alternate gene (second number) overlaps codon position 1 of the reference gene (first number). For all alternate frames except sas13, one reference codon partially overlaps each of two alternate codons. (B) Example of an overlapping gene in the ss13 frame. A minimal overlapping unit of 6 nt is shown, comprising one reference gene codon and its two overlapping codons in the alternate gene. At position 2 of the reference codon (highlighted in yellow), three nucleotide changes are possible: two cause nonsynonymous changes in both genes (NN; nonsynonymous/nonsynonymous) and one causes a nonsynonymous change in the reference gene but a synonymous change in the alternate gene (NS; nonsynonymous/synonymous). No synonymous/nonsynonymous (SN) or synonymous/synonymous (SS) changes are possible at this site. Thus, this site is counted as two-thirds of an NN site and one-third of an NS site. Finally, a pair of sequences having a C/A or C/G difference at this site is counted as having 1 NN difference, whereas a pair of sequences having a C/T difference at this site is counted as having 1 NS difference. (C) Example calculation of dNN, dSN, dNS, and dSS for a pair of sequences with an overlapping gene in ss13. Codons are denoted with brackets above (reference gene) and below (alternate gene) each sequence. The distance d is calculated for each site type (NN, SN, NS, and SS) as the number of differences divided by the number of sites of that type. Because the first and last reference codons only partially overlap alternate codons, they are excluded from analysis and the numbers of sites sum to 15 (= 5 codons × 3 nt; codons 2–6). Numbers of sites are not an exact multiple of 1/3 because nucleotide 6 of sequence 2 (TTT; alternate codon TTG) does not tolerate a change to A, as this would lead to a stop codon in the alternate gene (TAG). Thus, this position is considered an SN site in sequence 1, but one-half of an NN site and one-half of an SN site in sequence 2, for a mean of 0.25 NN and 0.75 SN sites. The table shows the mean numbers of sites for the two sequences (sequence 1 = 4.33 NN, 5 SN, 5.67 NS, and 0 SS; sequence 2 = 5.83 NN, 4.5 SN, 4.67 NS, and 0 SS), used to calculate each d value. For a multiple sequence alignment, the mean number of differences and sites for all pairwise comparisons would be used.
Methods with Available Implementations for Detecting Selection in Overlapping Genes.
| Program | Reference | Target | Implementation | Method Description | Advantages and Limitations | Available from |
|---|---|---|---|---|---|---|
| OLGenie | This study | Protein-coding sequence | Perl | Estimates | Fast; applicable to multiple sequence alignments; tree-agnostic; conservative for purifying selection and high levels of divergence, but nonconservative for positive selection; loss of power for pairwise distance >0.1 and neighboring variants |
|
| “Frameshift” |
| Protein-coding sequence | R | Finds ORFs longer than expected by chance given nucleotide context; includes two complementary methods: “codon permutation” and “synonymous mutation” | Medium to high accessibility as an R script requiring minor modifications. Can only detect relatively long OLGs. Slow for long sequences. |
|
| “StopStatistics” |
| Protein-coding sequence | Python, bash | Tests for depletion of those stop codons in sas12 that would be synonymous in reference; also applicable to enrichment of start codons | Low accessibility; scripts specific to particular data sets |
|
| FRESCo |
| Constraint at synonymous sites | HYPHY batch language | Rates of nucleotide evolution across an alignment inferred using a maximum-likelihood model. Models of neutral and nonneutral evolution tested in sliding windows to infer regions with excess synonymous constraint | Suitable for short genomes/regions despite using a codon model; requires a phylogenetic tree; performs best at deep sequence coverage and increased sequence divergence |
|
| Wei–Zhang method |
| Protein-coding sequence | Perl | Estimates | Accurate but slow, especially for highly diverged sequences; tree-agnostic; outperforms Sabath et al. method (according to | http://www.umich.edu/~zhanglab/download/Xinzhu_GBE2014/index.htm, last accessed April 10, 2020. |
| Synplot2 |
| Constraint at synonymous sites | C++; Web-interface | Evolution at synonymous sites in a codon alignment compared to a null model of neutral evolution in order to infer sites with excess constraint; expected diversity at synonymous sites is set equal to diversity over the full alignment, and diversity is measured between sequential pairs around a phylogenetic tree | Medium accessibility; fast; limited use in the case of sas12; requires a phylogenetic tree; does not distinguish between coding and noncoding overlapping features |
|
|
| ||||||
| KaKi (“Multilayer”) |
| Unexpected variation at synonymous sites | C++ | Maximum-likelihood codon model approach that allows variation in both the synonymous and nonsynonymous substitution rates along a sequence; accounting for variability in the baseline substitution rate allows more reliable inference of positive selection | Low accessibility (requires an old Linux distribution to install); requires a phylogenetic tree; complex input and results; focus of explicit testing is on positive selection; applicable (but not specific) to protein-coding OLGs. |
|
| https://www.tau.ac.il/~talp/readme.txt, last accessed April 10, 2020. | ||||||
| Sabath et al. method |
| Protein-coding sequence | MATLAB | Maximum-likelihood framework for estimating | Slower than Wei–Zhang; not recommended for highly similar sequences (pairwise distance <0.08); similar to OLGenie in the use of 6 nt (“sextet”) units; only implemented for pairs of sequences; low accessibility and scalability |
|
| MLOGD |
| Protein-coding sequence | C++ | Simple statistics on properties of sequence variation by codon position, and a maximum-likelihood statistic (MLOGD) taking into account nucleotide and amino acid substitution rates and codon usage | Less sensitive at detecting OLGs than Synplot2 (according to |
|
Programs in descending order by year of publication; methods lacking implementations at active URLs are not listed.
Nomenclature for Overlapping Protein-Coding Reading Frames.
| Study | Frame | |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
|
|
|
|
|
|
| |
| OLGenie; | Reference (ss11) | ss12 | ss13 | sas11 | sas12 | sas13 |
|
| +1 | +3 | +2 | −3 | −2 | −1 |
|
| +0 | +2 | +1 | −1 | −2 | −0 |
|
| +0 | +2 | +1 | −c2 | −c1 | −c0 |
|
| 0 | 2 (same-strand) | 1 (same-strand) | 1 (opposite-strand) | 2 (opposite-strand) | 0 (opposite-strand) |
|
| 0 | −1 | +1 | rc-1 | rc+1 | rc0 |
|
| 0 | +2 | +1 | −1 | −2 | −3 |
|
| − | − | − | C1 | C3 | C2 |
|
| − | +2 | +1 | −1 | 0 | −2 |
|
| 0 | 2 | 1 | 5 | 3 | 4 |
Studies in descending order by year of publication.
Black denotes the reference frame and blue denotes the alternate frame; one alternate codon position is underlined to show overlap with reference codon position 1 (e.g., position 3 for ss13).
As reported by Lèbre and Gascuel (2017).
. 2Assessment of OLGenie using simulated sequences. Calibration plots show the accuracy and precision of OLGenie dN/dS estimates for the reference (top row; dNN/dSN) and alternate (bottom row; dNN/dNS) genes when mean pairwise distance is set to 0.0585 per site (median of biological controls). For each frame relationship, estimated dN/dS is shown as a function of the actual simulated value, indicated by horizontal black line segments (x axis values), and of the dN/dS value of the overlapping gene, indicated by color (left to right: purple = 0.1; blue = 0.5; green = 1.0; orange = 1.5; and red = 2.0). For example, all purple points in the top row refer to simulations with alternate gene dN/dS = 0.1, whereas all purple points in the bottom row refer to simulations with reference gene dN/dS = 0.1. To obtain highly accurate point estimates, each parameter combination (reference dN/dS, alternate dN/dS, frame) was simulated using 1,024 sequences of 100,000 codons (supplementary table S1, Supplementary Material online). Then, to obtain practical estimates of variance relevant to real OLG data, simulations were again carried out for each parameter combination so as to emulate our biological control data set: a sample size of 234, with sequence lengths (number of codons) and numbers of alleles (max 1,024) randomly sampled with replacement from the controls (supplementary table S2, Supplementary Material online). Error bars show SEM, estimated from replicates with defined dN/dS values (≤234) using 10,000 bootstrap replicates (reference codon unit). A transition/transversion ratio (R) of 0.5 (equal rates) was used; similar results are obtained using R = 2 (supplementary fig. S3, Supplementary Material online). Full simulation results are presented in supplementary tables S1–S6 and figures S1–S6, Supplementary Material online.
. 3Assessment of OLGenie using biological controls. (A and B) Receiver operating characteristic (ROC) curves for overlapping (alternate) gene prediction at varying P value cut-offs. The y axis shows the true-positive rate (sensitivity) and the x axis shows the false-positive rate (1−specificity). Curves show subsets of the data corresponding to differing minimum length (A) and maximum dN/dS (B) criteria, following the approach of Schlub et al. (2018), with red indicating the strictest criteria. The full data set is represented by purple in (A) (overlaps blue). Area under the curve (AUC) is reported in parentheses in the key (supplementary tables S6–S8, Supplementary Material online), and the ROC expected using random classification (AUC = 0.5) is shown as a diagonal gray line. Vertical dashed lines show mean false-positive rates for P value cut-offs of 0.001, 0.01, and 0.05 (left to right). The site-rich dNN/dNS ratio was used to analyze 234 controls (81 ss12 and 153 ss13): 58 positive (16 ss12 and 42 ss13) and 176 negative (65 ss12 and 111 ss13). Of these, 162 (30 positive, 132 negative) had length ≥ 300 nt, and 14 (10 positive, 4 negative) had dN/dS ≤0.2. (C) The HIV-1 env gene was analyzed in sas12 with the site-rich ratio dNN/dNS using 25-codon sliding windows (step size = 1 codon), limiting to codons with ≥6 defined (nongap) sequences. The hypothesized asp gene is located at codons 655–1,033 (supplementary table S15, Supplementary Material online). The y axis shows significance, calculated as the natural logarithm of the inverse P value, as suggested by Firth (2014), using Z tests of the null hypothesis that dNN = dNS (1,000 bootstrap replicates per window; reference codon unit). The horizontal dashed gray line shows the multiple comparisons P value threshold (0.000924) suggested by Meydan et al. (2019) and described in supplementary section S5, Supplementary Material online, that is, a threshold of 0.05/(CDS length/window size). Results for other frames are shown in supplementary figure S9, Supplementary Material online. Positive selection (red) refers to dN/dS > 1; purifying selection (blue) refers to dN/dS < 1. Sequence features are described in supplementary table S15, Supplementary Material online and shown here as shaded rectangles: yellow for hypothesized sas12 genes, green for the highly structured RNA Rev response element (RRE), and gray otherwise.