Literature DB >> 34450622

ProteinEvolverABC: Coestimation of Recombination and Substitution Rates in Protein Sequences by approximate Bayesian computation.

Abstract

MOTIVATION: The evolutionary processes of mutation and recombination, upon which selection operates, are fundamental to understand the observed molecular diversity. Unlike nucleotide sequences, the estimation of the recombination rate in protein sequences has been little explored, neither implemented in evolutionary frameworks, despite protein sequencing methods are largely used.
RESULTS: In order to accommodate this need, here I present a computational framework, called ProteinEvolverABC, to jointly estimate recombination and substitution rates from alignments of protein sequences. The framework implements the approximate Bayesian computation approach, with and without regression adjustments, and includes a variety of substitution models of protein evolution, demographics and longitudinal sampling. It also implements several nuisance parameters such as heterogeneous amino acid frequencies and rate of change among sites and, proportion of invariable sites. The framework produces accurate coestimation of recombination and substitution rates under diverse evolutionary scenarios. As illustrative examples of usage, I applied it to several viral protein families, including coronaviruses, showing heterogeneous substitution and recombination rates. AVAILABILITY: ProteinEvolverABC is freely available from https://github.com/miguelarenas/proteinevolverabc, includes a graphical user interface for helping the specification of the input settings, extensive documentation and ready-to-use examples. Conveniently, the simulations can run in parallel on multicore machines. SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34450622 PMCID： PMC8696103 DOI： 10.1093/bioinformatics/btab617

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Mutation and recombination play a central role in acquiring molecular diversity during the evolution of organisms. Consequently, the analysis of molecular evolution through the estimation of recombination and substitution (fixed mutation) rates constitutes a common object of study in evolutionary biology. Despite a variety of evolutionary methods are available to estimate the recombination rate in DNA sequences (Arenas, 2021; Martin ), none of them allow it in protein sequences. However, methods to study molecular evolution in protein sequences are necessary because most of protein sequencing methodologies [i.e. based on RMN and X-ray diffraction, among others (see for a review Edman and Begg, 1967; Smith, 2003), some of them developed before than DNA sequencing (see for a review Strasser, 2010)] and protein databases [i.e. PFAM, PDB and CATH, among others (see for a review Kwon )], do not involve DNA sequences. Analyzing protein evolution allows us to understand the direct consequences of selection from protein stability and function (Bastolla ; Pál ; Taverna and Goldstein, 2002; Wilke, 2012) and resolve phylogenetic relationships (e.g. Alvarez-Ponce, 2021; Arenas, 2020; Bastolla and Arenas, 2019; Gabaldón, 2005; Gupta, 1998; Pazos and Valencia, 2008), among others (Patthy, 2008). In this concern, molecular signatures of recombination were detected in proteins, with strong selection against intra-domain recombination (compared to recombination in protein domain boundaries) to conserve protein function and folding stability (Meyer ; Voigt ), but still the presence of recombination should be quantified by the estimation of the recombination rate. The estimation of the recombination rate is not trivial, especially in data presenting low genetic variation (note that recombination cannot be detected in absence of substitution events) or lack of phylogenetic tree discordance among recombinant fragments [a number of recombination estimation methods are based on this feature (e.g. de Oliveira Martins ) but not always the recombined fragments display different evolutionary histories (Arenas and Posada, 2010)]. Thus, the recombination rate can be underestimated in certain evolutionary scenarios (Martin ; Posada and Crandall, 2001). A recent example is the problematic detection of recombination along the SARS-CoV-2 genome due to the relatively low levels of genetic diversity displayed by this virus (VanInsberghe ). The estimation of the recombination rate can be affected by the substitution rate and vice versa [note that the recombination rate can exceed the substitution rate in some organisms such as HIV (Perez-Losada ; Shriner ) and affect estimations of diverse evolutionary parameters such as the synonymous and non-synonymous substitution rates (Anisimova ; Arenas and Posada, 2010; Del Amparo ; Shriner )]. Therefore, the coestimation of both parameters is convenient due to accounting for their parameters interactions (Lopes ). Indeed, the estimation of the recombination rate could be more complex at the protein level than at the nucleotide level because of the usually lower levels of sequence diversity in proteins compared to DNA (i.e. note that synonymous substitutions are not observed at the protein level). However, the accuracy of recombination rate estimates from protein sequences respect to estimates from DNA sequences was not yet formally evaluated. Next, at the protein level, probabilistic phylogenetic methods to analyze protein sequences traditionally mimic the substitution process with substitution models of evolution composed by a 20 × 20 exchangeability matrix of relative rates of change among amino acids and the 20 amino acid frequencies (Arenas, 2015; Yang, 2006), which differ from the non-coding DNA substitution models (4 × 4 exchangeability matrix and 4 nucleotide frequencies) and coding DNA substitution models (61 × 61 exchangeability matrix and 61 codon frequencies, stop codons are often excluded) (Arenas, 2015; Yang, 2006). These models can also include variation of the rate of change among sites and a proportion of invariable sites (Yang ). In this concern, it is known that accounting for the best fitting substitution model of protein evolution is convenient for evolutionary inferences (e.g. Arenas and Bastolla, 2020; Bordner and Mittelmann, 2014; Lemmon and Moriarty, 2004) although there is a recent discussion concerning its specific application to phylogenetic tree reconstructions (Spielman, 2020). In any case, accounting for the substitution model of evolution that best fits the data in the estimation of the recombination rate with probabilistic methods was found necessary (Lopes ). However, it is yet unclear if the estimation of the recombination rate with probabilistic methods could be affected by analyzing different molecular markers (i.e. molecular evolution is mimicked by different substitution models of evolution). In general, there is a need for methods that coestimate multiple evolutionary parameters (especially for evolutionary analyses of protein sequences) and their implementation into useful evolutionary frameworks. Coestimation methods usually require algorithms based on complex models of evolution and present limitations concerning the derivation of analytical formulae or can lead to computationally too expensive evaluations of likelihood functions (Beaumont and Rannala, 2004). In such situations, an alternative is the approximate Bayesian computation (ABC) approach (Beaumont, 2010; Csillery ), which considers computer simulations under complex evolutionary scenarios followed by a statistical adjustment for the estimation of parameters without the need of a likelihood function. Hence, taking advantage of ABC, we previously developed an ABC framework for the joint estimation of recombination, selection [by the non-synonymous/synonymous rate ratio, dN/dS (Del Amparo )] and substitution rates in coding sequences that outperformed other methods (including maximum-likelihood methods) in the estimation of the recombination rate (Arenas ; Lopes ). Here, I adapted that ABC method to the evolutionary analysis of protein sequences and I implemented it in a user-friendly computational framework called ProteinEvolverABC. ProteinEvolverABC provides the coestimation of substitution and recombination rates from multiple alignments of protein sequences under ABC based on computer simulations of coalescent evolutionary histories (coalescent modified with recombination, demographics and serial sampling) followed by protein evolution (under a variety of substitution models of evolution). The framework includes diverse nuisance parameters (simulated under user-specified prior distributions but not estimated) that are optional to provide a more realistic modeling (by accounting for their uncertainty), multiple summary statistics designed to extract the evolutionary information from protein sequences and ABC estimation under both rejection and multiple linear regression approaches. The framework was evaluated under different levels of substitution and recombination rates and showed an acceptable estimation performance. In addition, the framework was able to accurately estimate the recombination rates observed in coding sequences. As illustrative practical examples, I applied it to the analysis of some viral protein families of general interest.

2 System and methods

2.1 ProteinEvolverABC framework

The framework ProteinEvolverABC follows the standards of ABC to perform parameters estimation through four main steps: specification of input information including prior distributions, simulation of evolutionary histories and protein sequences evolution (upon those evolutionary histories), computation of summary statistics and, joint estimation of recombination and substitution rates with the rejection and regression statistical approaches. Details about these steps are provided below. (1) Specification of input information. Despite the ABC approach has some advantages respect to other analytical approaches in terms of flexibility when dealing with complex models (Arenas, 2015; Beaumont, 2010), it also requires some decisions (to be made by the user) that can affect the estimations. One of them is the specification of prior distributions for the parameters to be estimated and, optionally, for nuisance parameters. A list with the main parameters implemented in the framework is shown in Supplementary Table S1, Supplementary Material. The prior distribution for any parameter should be wide and sampled enough to include a sufficient number of simulated data with parameter values close to the true value (Beaumont, 2010). The user of ProteinEvolverABC has to specify prior distributions for the recombination and substitution rates and, optionally, for some nuisance parameters (i.e. amino acid frequencies, heterogeneity in the substitution rate across sites and proportion of invariable sites) that could help provide more realistic simulations. Indeed, the user has to specify a substitution model of protein evolution [that can be previously identified with frameworks such as ProtTest (Darriba )] and some parameters for the ABC estimation such as the tolerance (number of simulations used for the estimation with summary statistics closer to the summary statistics of the study dataset) or the estimation approach (i.e. rejection and multiple linear regression) (Beaumont ). The software documentation includes detailed recommendations for specifying every input parameter properly. (2) The simulation of protein sequences is internally performed with an adapted version of the simulator ProteinEvolver (Arenas ) to ABC. This simulator produces multiple alignments of protein sequences by the following two steps: (i) simulation of evolutionary histories with the coalescent modified with recombination and demographics (Hudson, 1983; Kingman, 1982), where note that the coalescent allows a rapid computation (Arenas, 2012) that is convenient for ABC methods that usually require a large number of simulated data to properly explore the parameters landscape (Beaumont, 2010) and, (ii) simulation of protein sequence evolution upon the previously simulated evolutionary histories under a user-specified substitution model of protein evolution (Supplementary Table S1) (Arenas, 2012; Yang, 2006), optionally including substitution rate variation among sites and proportion of invariable sites (Yang, 1994). ProteinEvolver is based on previous coalescent simulators (Arenas and Posada, 2007; Arenas and Posada, 2010) and has been widely used and validated (e.g. Arenas and Bastolla, 2020; Arenas ; Arenas ; Pascual-Garcia ). The simulations are parameterized sampling from the user-specified prior distributions (Supplementary Table S1) and, conveniently, can run in parallel on multicore computers. (3) The framework ProteinEvolverABC implements a total of 16 summary statistics to extract the evolutionary information from the observed and simulated protein data (Supplementary Table S2; Supplementary Material). The summary statistics comprise the mean, standard deviation, skewness and kurtosis of diversity, heterozygosity and pairwise sequence identity. They also include the number of segregating sites and three fast recombination tests [pairwise homoplasy index (Bruen ), neighbor similarity score (Jakobsen and Easteal, 1996) and maximum chi-squared (Maynard Smith, 1992)]. Some of these summary statistics (i.e. heterozygosity) are based on a previous ABC study to estimate substitution and recombination rates from codon sequences with acceptable error (Lopes ) and were adapted here to the analysis of protein sequences. (4) Finally, ProteinEvolverABC estimates the substitution and recombination rates with the abc R library (Csillery ). In particular, the framework estimates (i) the population recombination rate ρ = 4Nrl (McVean ), where N is the effective population size, r is the recombination rate per site and l is the sequence length; (ii) the population amino acid substitution rate θ = 4 Nµl, where µ is the substitution rate per site. Before performing the estimation, the framework includes some goodness-of-fit analyses that are useful to evaluate the fitting of the simulation model with the observed data and the information provided by every summary statistic. The implemented goodness-of-fit analyses include (i) distance between the distribution of summary statistic values from retained simulations and summary statistic value of the observed data for every summary statistic, (ii) evaluation of summary statistic values (for every summary statistic) with corresponding parameters (recombination and substitution rates) values of the simulation and, (iii) the two first principal components of a principal component analysis based on all the summary statistics from a sample of all the simulations, retained simulations and observed data (Csillery ). Further details about the implemented goodness-of-fit analyses can be found in the documentation and illustrative examples distributed with ProteinEvolverABC. The ABC estimation can be performed with the rejection and weighted multiple linear regression approaches, which are well established in the field (Beaumont, 2010; Beaumont ; Blum and François, 2010; Csillery ). In practice, the user has to specify the number of simulations, the prior distributions for the parameters of interest, the tolerance level and the estimation approach, as well as other minor options. The framework is distributed with a documentation that includes detailed recommendations, and illustrative examples, about the specifications (see also next section). In general, a total of 50 000 simulations can be a good starting point (see next section), but of course some datasets (especially those presenting low sequence identity) may require more simulations. The framework ProteinEvolverABC runs on the command line but includes a graphical user interface (GUI) to facilitate the input parameterization of the whole estimation procedure. Because the simulation step is usually slow, the framework can run the simulations in parallel on multicore machines, which allows a reduction of the computation time (see later). ProteinEvolverABC consists of a pipeline written in Java (GUI), C, Perl and R, and it is freely available from https://github.com/miguelarenas/proteinevolverabc. The package includes executable files, source code, detailed documentation and illustrative examples with input and output files.

2.2 ProteinEvolverABC validation

Despite a previous ABC study already showed accurate coestimation of recombination and substitution rates (even more accurate than maximum-likelihood methods) in the analysis of protein-coding data (Lopes ), here I performed an evaluation of ProteinEvolverABC under different evolutionary scenarios based on multiple combinations of substitution and recombination rates. I simulated protein sequences under different levels of ρ [0, 30, 60 and 90, which are levels of recombination that can be observed in nature (e.g. Arenas ; Carvajal-Rodriguez ; Lopes )] and θ [50, 100, 200, 300 and 400, which involve a wide range of sequence identity (from above 0.9 to below 0.6) fitting with observations in nature (e.g. Cornish-Bowden, 1977; Pascual-Garcia )], for alignments of 25 sequences with 500 amino acids, assuming a fixed effective population size of 1000 individuals (Carvajal-Rodriguez ) and the JTT empirical substitution model of protein evolution (Jones ). For every combination of parameters [4 × 5 = 20 combinations, which are more combinations than those evaluated in (Lopes )], I simulated 100 multiple sequence alignments (test datasets). Next, the framework was used to estimate ρ and θ in the test datasets under ABC based on a total of 50 000 simulations parameterized under the following prior distributions: ρ = Uniform(0,120) and θ = Uniform(0,500), which are prior distributions wider than those used in (Lopes ) and encompass values that are commonly observed in real data (e.g. Carvajal-Rodriguez ; Monteiro ; Perez-Losada ; Perez-Losada ; Stumpf and McVean, 2003). Following previous works (Arenas ; Lopes ), ABC estimates were obtained assuming an acceptance rate of 0.2% (100 points) in the adjustment with the rejection and the weighted multiple linear regression approaches. The validation of ProteinEvolverABC showed that the parameter estimates are generally accurate (the true values usually fell within the 95% confidence interval of the estimates; Fig. 1) and in good agreement with previous ABC methods (Arenas ; Lopes ). In particular, the results indicate that both parameters can be accurately estimated using 50 000 computer simulations under any implemented ABC estimation approach (rejection and multiple linear regression). As expected, the estimation of the recombination rate was more difficult at low levels of substitution and recombination rates (in agreement with Posada and Crandall, 2001) (see Fig. 1A with θ = 50 and ρ =30) showing a small overestimation. Increasing the number of simulations to 100 000 overall improved the accuracy of the estimations (also reduced the overestimation of the recombination rate at low levels of substitution and recombination rates) (Supplementary Fig. S1; Supplementary Material) but still, in general, 50 000 simulations were enough to obtain acceptable estimates concerning accuracy.

Fig. 1.

Accuracy of ProteinEvolverABC in the estimation of recombination and substitution rates under different evolutionary scenarios and based on ABC with 50 000 simulations. For each studied combination of ρ and θ (evolutionary scenario based on 100 simulations) the figure shows the estimates of ρ (above) and θ (below). The black bars indicate the true value. Clear and dark grey bars correspond to the mode of the estimated posterior distributions (using the rejection and multiple linear regression approaches, respectively, both based on 50 000 simulations) and error bars indicate the 95% confidence interval Since additional comparisons of accuracy of estimated recombination and substitution rates between ProteinEvolverABC and other evolutionary frameworks cannot be performed yet due to the lack of frameworks implementing these estimations in protein sequences, next I evaluated the accuracy of ProteinEvolverABC in the estimation of the recombination and substitution rates present in coding data. The aim is to evaluate the bias of estimating these parameters from protein sequences respect to the codon level. I applied the CoalEvol framework (Arenas and Posada, 2014) to simulate coding sequences under different levels of ρ (0 and 60) and θ (50, 200 and 400) for alignments of 25 sequences with 500 codons (1500 nucleotides), assuming a fixed effective population size of 1000 individuals (Carvajal-Rodriguez ) and the GY94 substitution model of codon evolution with dN/dS = 1 (Goldman and Yang, 1994). For every combination of parameters (2 × 3 = 6 combinations), I simulated 100 multiple sequence alignments. Next, the multiple alignments of codon sequences were translated to protein sequences considering the universal genetic code. Finally, ProteinEvolverABC (using the previously described parameterization of prior distributions, total number of simulations and acceptance rate) was applied to estimate the recombination and substitution rates. The accuracy of ProteinEvolverABC in estimating recombination and substitution rates observed in coding data differed between these parameters. The recombination rate was accurately estimated (Fig. 2), suggesting that the framework can capture signatures of recombination observed at the nucleotide level. However, the substitution rate was underestimated, especially when it is large (Fig. 2). This means that the substitution rate occurring at the nucleotide level cannot be estimated at the protein level. This is not surprising because note that synonymous codon substitutions are not transferred to protein sequences. In any case, users of ProteinEvolverABC should take this issue into account.

Fig. 2.

Accuracy of ProteinEvolverABC in the estimation of recombination and substitution rates present in coding data. For each studied combination of ρ and θ (evolutionary scenario based on 100 simulations) the figure shows the estimates of ρ (above) and θ (below). The black bars indicate the true value (recombination and substitution rates present in coding data). Clear and dark grey bars correspond to the mode of the estimated posterior distributions (using the rejection and multiple linear regression approaches, respectively, both based on 50 000 simulations) and error bars indicate the 95% confidence interval

2.3 Illustrative application to viral protein families

As illustrative practical examples of the application of ProteinEvolverABC (and also to provide an idea about running times) I analyzed 8 real datasets of viral protein families (Table 1). These protein families include coronaviruses [motivated by the current pandemic and since they have shown genetic signatures of recombination (e.g. Graham and Baric, 2010; Zhu )] and aspartyl proteases [due to its interest as a molecular target of antiretroviral therapies (e.g. Ghosh ; van Leeuwen ) and where recombination could favor the emergence of resistance to therapies (e.g. Fraser, 2005; Shi )]. Indeed, note that the coalescent can be used to mimic virus evolution (Bouckaert ; Pybus and Rambaut, 2009). All the protein families were downloaded from the PFAM database (Finn ) and realigned with MAFFT (Katoh and Standley, 2013). I ran a total of 50 000 simulations under the prior distributions used for the analysis of the simulated data (see above). The analyses of these datasets took several hours and, as expected, they ran faster when using more cores (Supplementary Fig. S2; Supplementary Material). In particular, increasing the number of processors always reduced computer times, especially in big datasets where simulations (which are parallelized) require more time. However, the decrease of computer time as a function of the number of processors was not linear (Supplementary Fig. S2) because other ABC phases could not be parallelized. Also as expected, bigger datasets, with more and longer sequences, required longer computer times (Supplementary Figs S2 and S3AB; Supplementary Material). By contrast, the computer time was similar for different substitution models of protein evolution (Supplementary Fig. S3C), which was also expected since all the empirical substitution models of protein evolution present exchangeability matrices with same dimension (20 × 20) (Arenas, 2015; Yang, 2006). The goodness-of-fit analyses produced acceptable results with summary statistics of real data within the distribution of summary statistics of simulated data (including retained simulated data; Supplementary Figs S4–S11, Supplementary Material). The results (Table 1) showed that both recombination and substitution rates are heterogeneous among coronavirus protein families and that the aspartyl protease protein family presents a high recombination rate in agreement with previous observations (e.g. Speranskaya ; Sun ).

Table 1.

Recombination and substitution rates estimated with ProteinEvolverABC for the studied protein families

Description of the protein family	PFAM code, number of sequences, sequence length	Mode	Mean	Median	97.5% HPDI
Coronavirus small envelope protein E	PF02723, 27, 82	(ρ) 48.68	(ρ) 53.95	(ρ) 48.79	(ρ) 47.64–119.00
Coronavirus small envelope protein E	PF02723, 27, 82	(θ) 179.40	(θ) 182.51	(θ) 179.44	(θ) 170.45–237.06
Coronavirus 2′-O-methyltransferase	PF06460, 5, 299	(ρ) 46.67	(ρ) 46.53	(ρ) 46.63	(ρ) 45.49–47.01
Coronavirus 2′-O-methyltransferase	PF06460, 5, 299	(θ) 230.62	(θ) 290.14	(θ) 272.73	(θ) 132.93–477.30
Coronavirus non-structural protein NS12.7	PF04753, 9, 109	(ρ) 1.30	(ρ) 2.27	(ρ) 1.69	(ρ) 0.40–5.82
Coronavirus non-structural protein NS12.7	PF04753, 9, 109	(θ) 125.72	(θ) 120.98	(θ) 123.10	(θ) 67.07–162.22
Coronavirus replicase NSP7	PF08716, 5, 83	(ρ) 95.02	(ρ) 82.39	(ρ) 92.75	(ρ) 5.89–98.06
Coronavirus replicase NSP7	PF08716, 5, 83	(θ) 237.74	(θ) 267.56	(θ) 238.56	(θ) 186.48–495.16
Coronavirus replicase NSP8	PF08717, 5, 198	(ρ) 38.45	(ρ) 44.99	(ρ) 41.83	(ρ) 14.59–103.27
Coronavirus replicase NSP8	PF08717, 5, 198	(θ) 323.13	(θ) 303.82	(θ) 309.27	(θ) 167.88–468.66
Coronavirus RNA synthesis protein NSP10	PF09401, 6, 122	(ρ) 28.45	(ρ) 44.19	(ρ) 29.52	(ρ) 19.45–119.99
Coronavirus RNA synthesis protein NSP10	PF09401, 6, 122	(θ) 166.01	(θ) 176.39	(θ) 169.29	(θ) 146.82–237.30
Betacoronavirus viroporin	PF11289, 5, 273	(ρ) 21.03	(ρ) 25.49	(ρ) 22.00	(ρ) 8.34–51.98
Betacoronavirus viroporin	PF11289, 5, 273	(θ) 49.43	(θ) 65.17	(θ) 54.80	(θ) 36.07–127.61
Aspartate protease	PF09668, 10, 124	(ρ) 106.23	(ρ) 106.61	(ρ) 106.26	(ρ) 105.58–111.09
Aspartate protease	PF09668, 10, 124	(θ) 178.27	(θ) 197.65	(θ) 183.42	(θ) 115.15–357.84

Note: The first column provides information about the protein family and the second column includes the PFAM code, number of sequences and sequence length in amino acids, respectively. For each parameter (recombination rate ρ or substitution rate θ), the table presents the mode, mean, median and 97.5% HPDI (highest posterior density interval) of the estimated posterior distribution. The full posterior distributions are shown in Supplementary Figures S12–S19, Supplementary Material.

Recombination and substitution rates estimated with ProteinEvolverABC for the studied protein families Note: The first column provides information about the protein family and the second column includes the PFAM code, number of sequences and sequence length in amino acids, respectively. For each parameter (recombination rate ρ or substitution rate θ), the table presents the mode, mean, median and 97.5% HPDI (highest posterior density interval) of the estimated posterior distribution. The full posterior distributions are shown in Supplementary Figures S12–S19, Supplementary Material.

3 Discussion

There is a general need for evolutionary frameworks implementing the estimation of the recombination rate in protein data. Motivated by this aspect, here I introduce an ABC framework for the coestimation of substitution and recombination rates from a multiple alignment of protein sequences. The user can specify diverse prior distributions for a variety of evolutionary parameters (including nuisance parameters, Supplementary Table S1) and perform ABC estimation under both rejection and multiple linear regression approaches considering protein data simulated under an empirical substitution model of protein evolution (among a variety of implemented substitution models; Table S1) upon coalescent evolutionary histories with and without recombination. An advantage of using computer simulations in an analytical approach is to consider the influence of parameters interactions in their estimation, which is often omitted in other analytical methods such as maximum likelihood (Li and Stephens, 2003). However, ABC estimation methods usually require computational efforts due to the large number of simulations necessary to cover the parameters landscape described by the prior distributions (although if the user has prior information can reduce the range of the prior distribution and/or choose an informative prior distribution sampling frequently near to the true parameter value, thus reducing the required number of simulations), but the accuracy of the estimates may justify the effort. Similarly to the preceding ABC methods of ProteinEvolverABC (Arenas ; Lopes ), I found that ProteinEvolverABC provides accurate estimates under different evolutionary conditions. In addition, it can estimate the recombination rate that is observed at the nucleotide level, although (as expected) it underestimates the substitution rate observed at the nucleotide level. However, but comparably to other estimation methods, the user has to consider that the accuracy of the estimates provided by ProteinEvolverABC could be affected by processes ignored in the framework such as a small population size (note that the coalescent requires a much larger population size compared to the sample size), among others. Moreover, analyzing protein fragments (i.e. domains) is not recommended, especially if the fragment is conserved, because of the lack of evolutionary information that could produce statistically unsupported estimates. In general, the user of an evolutionary framework (including ProteinEvolverABC) should be cautious interpreting the results, take into account the estimated credible interval and consider that other biological processes can affect the estimates. Indeed, care should be taken when specifying ABC settings such as number of simulations, prior distributions or tolerance. In this concern, ProteinEvolverABC is distributed as a package with a GUI that helps to specify the entire analysis, a detailed documentation that includes theory, usage and recommendations about the input parameters and, several illustrative practical examples. Concerning the prior distributions, in absence of previous biological information the user may specify uniform distributions. However, if there is any prior information (i.e. a suspected low substitution rate) the user can define more precise prior distributions that could improve the accuracy of the estimation (Beaumont, 2010). In any case, the prior distributions should include biologically reasonable parameter values and prior expectations of the parameter values in the study data. In general, I recommend applying by default the settings presented in the validation section (50 000 simulations and acceptance rate of 0.2%). However, for complex real data (i.e. data presenting very high or low genetic diversity), increasing the number of simulations and acceptance rate [i.e. up to 10 times (following Lopes )] can be useful to obtain more accurate estimates. Indeed, repeating an analysis increasing the number of simulations, exploring different acceptance rates or even a different set of the implemented summary statistics (evaluated with the goodness-of-fit analyses), can help in finding a proper input parameterization to obtain reliable estimates. As illustrative examples, the framework was applied to several viral protein families, especially from coronavirus. Coronavirus protein families showed contrasting recombination and substitution rates. It is remarkable the low recombination rate estimated for the protein families NS12.7 (studies involving recombination in this protein were not found) and viroporins [in agreement with the minor recombination events detected in (Lulla and Firth, 2020) and that can be expected due to conservation for maintaining viral pathogenicity (Nieva ) and thus being ideal target for antiviral treatments (Nieva )] and, the large recombination rate estimated for the replicase NSP7 [where recombinant forms of SARS-CoV NSP7 and NSP8 could greatly improve the association of these proteins (te Velthuis )]. Concerning the aspartyl protease protein family, it presented a high recombination rate that agrees with previous studies (e.g. Speranskaya ; Sun ).

Data Availability

ProteinEvolverABC is freely available at https://github.com/miguelarenas/proteinevolverabc.

Funding

This work was supported by the Spanish Ministerio de Ciencia e Innovación through the Grants [RYC-2015-18241] and [PID2019-107931GA-I00]. Funding for open access charge: Universidade de Vigo/CISUG. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Conflict of Interest: none declared. Click here for additional data file.

72 in total

1. Evaluation of methods for detecting recombination from DNA sequences: computer simulations.

Authors: D Posada; K A Crandall
Journal: Proc Natl Acad Sci U S A Date: 2001-11-20 Impact factor: 11.205

2. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites.

Authors: Maria Anisimova; Rasmus Nielsen; Ziheng Yang
Journal: Genetics Date: 2003-07 Impact factor: 4.562

3. ProtASR: An Evolutionary Framework for Ancestral Protein Reconstruction with Selection on Folding Stability.

Authors: Miguel Arenas; Claudia C Weber; David A Liberles; Ugo Bastolla
Journal: Syst Biol Date: 2017-11-01 Impact factor: 15.683

4. A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Authors: N Goldman; Z Yang
Journal: Mol Biol Evol Date: 1994-09 Impact factor: 16.240

Review 5. Recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission.

Authors: Rachel L Graham; Ralph S Baric
Journal: J Virol Date: 2009-11-11 Impact factor: 5.103

6. The fine-scale structure of recombination rate variation in the human genome.

Authors: Gilean A T McVean; Simon R Myers; Sarah Hunt; Panos Deloukas; David R Bentley; Peter Donnelly
Journal: Science Date: 2004-04-23 Impact factor: 47.728

7. The SARS-coronavirus nsp7+nsp8 complex is a unique multimeric RNA polymerase capable of both de novo initiation and primer extension.

Authors: Aartjan J W te Velthuis; Sjoerd H E van den Worm; Eric J Snijder
Journal: Nucleic Acids Res Date: 2011-10-29 Impact factor: 16.971

8. Biochemical characterization of recombinant Avihepatovirus 3C protease and its localization.

Authors: Di Sun; Mingshu Wang; Xingjian Wen; Sai Mao; Anchun Cheng; Renyong Jia; Qiao Yang; Ying Wu; Dekang Zhu; Shun Chen; Mafeng Liu; Xinxin Zhao; Shaqiu Zhang; Xiaoyue Chen; Yunya Liu; Yanling Yu; Ling Zhang
Journal: Virol J Date: 2019-04-29 Impact factor: 4.099

9. Evolutionary dynamics of the human pseudoautosomal regions.

Authors: Bruno Monteiro; Miguel Arenas; Maria João Prata; António Amorim
Journal: PLoS Genet Date: 2021-04-19 Impact factor: 5.917