Literature DB >> 35060594

Inverse Potts model improves accuracy of phylogenetic profiling.

Tsukasa Fukunaga^1,2, Wataru Iwasaki^3,4,5,6,7,8.

Abstract

MOTIVATION: Phylogenetic profiling is a powerful computational method for revealing the functions of function-unknown genes. Although conventional similarity metrics in phylogenetic profiling achieved high prediction accuracy, they have two estimation biases: an evolutionary bias and a spurious correlation bias. While previous studies reduced the evolutionary bias by considering a phylogenetic tree, few studies have analyzed the spurious correlation bias.
RESULTS: To reduce the spurious correlation bias, we developed metrics based on the inverse Potts model (IPM) for phylogenetic profiling. We also developed a metric based on both the IPM and a phylogenetic tree. In an empirical dataset analysis, we demonstrated that these IPM-based metrics improved the prediction performance of phylogenetic profiling. In addition, we found that the integration of several metrics, including the IPM-based metrics, had superior performance to a single metric. AVAILABILITY: The source code is freely available at https://github.com/fukunagatsu/Ipm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2022 PMID： 35060594 PMCID： PMC8963296 DOI： 10.1093/bioinformatics/btac034

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Genome sequences of many species have been determined, and accordingly, many function-unknown genes have been discovered. Revealing the functions of these function-unknown genes is an important research topic, but it is too time-consuming to experimentally verify the functions of all the genes. Therefore, the computational predictions of these gene functions are essential, and various methods have long been developed in bioinformatics. Phylogenetic profiling is one such analysis method. In this method, when two ortholog groups (OGs) have similar occurrence patterns among species in a table of OGs, the two OGs are presumed to be functionally related (Kensche ; Moi ; Niu ; Pellegrini ; Stupp ; Tremblay ; Tsaban ). Although phylogenetic profiling was first proposed to detect protein–protein interactions, this method in principle captures any functional relationships between genes. Phylogenetic profiling has been widely used to estimate the functions of function-unknown genes in various phylogenetic groups from prokaryotes to eukaryotes (Kumagai ; Sherill-Rofe ). In conventional phylogenetic profiling, similarities in occurrence patterns between two OGs are directly calculated from a table of OGs. This direct calculation implicitly assumes that the species included in the table of OGs are independent of each other. This assumption is, however, incorrect because the species have evolutionary relationships. In other words, the conventional calculation of similarity introduces an evolutionary bias in the estimation. Therefore, methods that consider a phylogenetic tree were proposed and showed good performance (Barker ; Cohen ; Moi ; Ta ; Vert, 2002). Another possible estimation bias is the spurious correlation bias between two OGs. In statistics, spurious correlation means that two unrelated (or weakly related) variables appear to be strongly related due to the influence of confounding factors. As a simple example, suppose there are functional relationships between OGs A and B and OGs A and C, but no (or weak) functional relationship between OGs B and C. In this case, OGs B and C can show similar occurrence patterns by bypassing OG A, which is a confounding factor. In real cases, transitive correlations among many genes and evolutionary relationships between species result in complex patterns of spurious correlations. Ignoring the possibility of spurious correlations should negatively influence the accuracy of the function predictions, but few studies have analyzed the spurious correlation bias. Kim and Price considered the spurious correlation bias in phylogenetic profiling and showed that the bias could be reduced using partial correlation based on a Gaussian graphical model (Kim and Price, 2011). However, they did not explicitly deal with the evolutionary bias and implicitly assumed that tables of OGs follow the Gaussian distribution, but this assumption does not hold true for categorical data. Metrics commonly used for phylogenetic profiling are mutual information (MI), correlation coefficients and Jaccard coefficients. These metrics are local metrics calculated from only two OG profiles, and the locality causes spurious correlations whose confounding factors are the other OGs. Therefore, we can reduce spurious correlation biases by using global metrics calculated from all OG profiles. The inverse Potts model (IPM), also called direct coupling analysis or evolutionary coupling (Cocco ), is an analysis method for categorical datasets to calculate global metrics. The IPM has been applied to various biological data analyses, such as protein–protein interaction prediction (Cong ; Weigt ), protein structure prediction (Marks ; Muscat ), neural data analysis (Schneidman ; Watanabe ) and genome-wide association studies (Schubert ; Skwark ), and has improved prediction performance. Recently, Croce identified physically interacting protein domain pairs by applying the IPM to tabular data whose rows and columns are species and protein domains. They revealed that the IPM could detect interacting domain pairs with higher accuracy than simple correlation coefficients. Their study was similar to phylogenetic profiling, but their goal was to predict domain-domain interactions and not to estimate gene functional associations. In this study, we applied the IPM to phylogenetic profiling to accurately predict gene functions. We used direct information (DI) calculated based on the IPM as the global metric. We also developed DI that considers phylogenetic tree information to explicitly deal with the evolutionary bias. We investigated the performance of several metrics in phylogenetic profiling, and verified that the IPM-based metrics improved the accuracy of predicting gene functions. In addition, we found that the integration of several metrics, including the IPM-based metrics, has superior performance to a single metric.

2 Materials and methods

2.1 Input data

Two settings were assumed in our study: standard and evolutionary settings. Under the standard setting, the input data for our method is a table of OGs D, which consists of N species and L OGs. represents whether species i has OG j and takes either 0 or 1. Under the evolutionary setting, the input data for our method is a table of OG gain/losses D, which consists of N phylogenetic tree branches and L OGs. Given a phylogenetic tree and a table of OGs, gene-content evolutionary history is reconstructed to infer gene gain/losses on each branch of the tree. represents whether the gain/loss events of OG j occurred at edge i. The value takes 0, 1 or 2, indicating that there are no gene gain/loss events, gene gain events or gene loss events, respectively. For the experiments in this study, we used three empirical datasets: archaea (domain), micrococcales (order) and fungi (kingdom) (Fukunaga and Iwasaki, 2021). The tables of OGs were prepared by preprocessing OG data in the STRING database (Szklarczyk ). We ignored gene copy number information and removed OGs that were shared by <10% or more than 90% of the species to reduce the computational time to prepare D. The proportions of remaining OGs were 24.7%, 20.0% and 16.8% in archaea, micrococcales and fungi datasets, respectively, because the dataset contained many OGs with few genes. The computational time of our method is proportional to the square of the number of OGs, thus this reduced the computational time by 95%. The removed OGs were expected not to have significant impacts on the results because of their low information content. The archaea, micrococcales and fungi datasets consisted of 151 species and 2875 OGs, 111 species and 1905 OGs, and 123 species and 5786 OGs, respectively. Under the evolutionary setting, we prepared D by reconstructing the gene-content evolutionary history for the three empirical datasets. We used Mirage (Fukunaga and Iwasaki, 2021) with the BDARD model (Kim and Hao, 2014) and the PM model (default parameters were used for the others). Phylogenetic trees were supplied by the Genome Taxonomy Database release 89 (Parks ) for the archaea and micrococcales datasets and the SILVA database release 111 (Yarza ; Yilmaz ) for the fungi dataset.

2.2 The IPM

We introduce , which is the MI between OG a and OG b. The formula is as follows: where and are the relative frequencies of OG a taking i and OG a and OG b taking i and j, respectively, in the dataset D. Q is the maximum value that an OG can take (i.e. Q = 1 under the standard setting and Q = 2 under the evolutionary setting). The more OGs A and B depend on each other, the larger the . If becomes 0, OGs A and B are completely independent. Note that can detect not only gene pairs with similar occurrence patterns but also those with anti-correlated relationships (i.e. if a genome contains one of the genes, it unlikely contains the other). Several previous studies showed that anti-correlation relationships also provide clues to functions of function-unknown genes (Croce ; Kim and Price, 2011; Morett ). We defined standard MI (SMI) and EMI as MI calculated under the standard and evolutionary settings, respectively. is a local metric calculated from only two OG profiles and is vulnerable to spurious correlations. Therefore, we calculated a global metric using all OG profiles based on the IPM. We first formulate the joint probabilities of all OGs as follows (Cocco ): is the joint probability that OG a takes x for any a. is a weight parameter when OG a is x, and is also a weight parameter when OG a is x and OG b is x. Ω is the set of all possible combinations that all OGs can take, and Z is a normalizing constant, which is called the partition function. This probabilistic model is obtained by deriving a model that maximizes entropy under the following constraints: for all a and i and for all a, b, i and j. and are the marginal probabilities of and represent the probabilities of OG a taking i and OG a and OG b taking i and j, respectively. This model is generally called the Potts model in statistical physics (when Q = 1, this model is specifically called the Ising model). Note that this model is also a particular form of the Boltzmann machine or Markov random field. In the derivation of the Potts model, the number of substantial constraints is because and must be satisfied. On the other hand, the number of parameters in the model is , which is larger than the number of substantial constraints. This over-parameterization leads to the non-identification of the model. Therefore, it is necessary to introduce additional constraints on the parameters to reduce the degrees of freedom of the model. In this study, we used the following constraints, called lattice gas gauges, for ease of implementation (Cocco ): To calculate the parameters and analytically, we need to count all the combinations in Ω. However, its computational cost can become too large when L is large because the number of combinations becomes large. Therefore, these parameters are learned from the dataset in an unsupervised manner (Section 2.3). Then, using the estimated parameters, the dependence between OG a and OG b is measured as as follows (Weigt ): This definition is slightly different from the original definition (Weigt ). In the original DI calculation, was used instead of , and was re-calculated from . Similar to , the more OGs A and B depend on each other, the larger . Note that can also detect anti-correlated relationships. We defined standard DI (SDI) and EDI as the DI calculated under the standard and evolutionary settings, respectively. In addition to DI, Frobenius norm (FN) and average product correction (APC) are widely used metrics to quantify dependencies between two elements in the IPM. These metrics are gauge-dependent quantities, and the best gauge is the zero-sum gauge (Ekeberg ). On the other hand, DI has gauge-independent characteristics (Ekeberg ). Because we used lattice-gas gauges in this study, we used DI instead of FN and APC for the metrics.

2.3 Parameter estimation method

To date, various algorithms have been developed to estimate the parameters of the Potts model, for example, mean-field approximation (Morcos ), pseudo-likelihood maximization (Ekeberg ), adaptive cluster expansion (Barton ) and Markov chain Monte Carlo (MCMC) methods (Figliuzzi ). There is an approximate trade-off between the computational speed and estimation accuracy in these methods, that is, more accurate methods require longer run times. In this study, we focused on the estimation accuracy, and used the persistent contrastive divergence (PCD) method (Hinton, 2002; Tieleman, 2008), which is a variant of the MCMC method. We maximized the likelihood with the L2-regularization term to avoid overfitting the data in the PCD method. The algorithm for the PCD method is as follows. We first randomly sample K samples with replacement from the dataset D, and let the initial sampled dataset be D0. In this study, we set K to 200. In addition, we set all the initial parameters to 0. Next, we obtained the dataset D1 from D0 and the initial parameters based on the following Gibbs sampler: This sampling was performed LK times to obtain D1. Then, we calculated and , which are the relative frequencies of OG a taking i and OG a and OG b taking i and j in the dataset D1, respectively. Subsequently, the model parameters were updated using the following formula: and are the L2-regularization terms, and we used any 0, 0.01, 0.05, 0.1, 0.5, 1.0 or 5.0 as λ. Note that λ = 0 indicates simple likelihood maximization without the regularization terms. ϵ represents a learning rate, and we set either 0.01 or 0.001 as ϵ. After parameter estimation, we sampled dataset D2 from D1 using the estimated parameters. We finally adopted parameters after repeating the Gibbs sampling and the parameter update 3000 times.

2.4 Evaluation method

We assessed the prediction performance of each metric using association scores between two OGs provided in the STRING database (Szklarczyk ). The association scores in the STRING database were calculated by considering gene neighborhood conservation, gene fusion, co-expression, protein interaction experiments, other databases, text mining and occurrence patterns. Because occurrence patterns should not be used in the assessment, we recalculated the association scores by ignoring the occurrence pattern similarities. If the recalculated association score of an OG pair was larger than the threshold th, we regarded the OG pair as positive data; otherwise, we regarded it as negative data. We used the threshold th from 0.5 to 0.9 in 0.1 increments. The sizes of each dataset are listed in Supplementary Table S1. Note that the association scores of 0.7 and 0.9 are the lower limits of high and highest confidences, respectively, in the STRING database. We first investigated the overall discrimination performance of each metric using the area under the receiver operating characteristic curve (AUC) scores. The AUC scores were calculated using the pROC R package (Robin ). We also assessed the prediction accuracy of the OG pairs that were highly ranked by each metric. Specifically, we defined the highly ranked OG pairs as the top M OG pairs in each metric, and calculated the positive predictive values (PPVs) of these pairs (at th = 0.7). We used 100, 500, 1000, 5000 or 10 000 as M. In addition, we evaluated AUPR scores using the PRROC package for the analysis of highly ranked OG pairs (Grau ).

3 Results

3.1 Performances of single metrics

We first assessed the overall discrimination performance of the four metrics (SMI, EMI, SDI and EDI) based on the AUC scores. We investigated 14 combinations of seven λ values and two ϵ values as IPM hyperparameters for calculating the SDI and EDI. In the following analyses, we used the hyperparameters showing the best AUC score for each dataset and each th value. The AUC scores are listed in Supplementary Tables S2–S7. Both hyperparameters had a large impact on the prediction performance. In addition, the optimal hyperparameters differed depending on the dataset and the th value. We also found that the optimal hyperparameter λ was not 0.0 in many cases. This result means that L2-regularization was effective for achieving high discrimination performance. We checked the distribution of each metric after normalizing the maximum value to 1.0, and calculated the skewness (Supplementary Figs S1 and S2). We found that the distribution was skewed to the right in all cases, that is, only a portion of OG pairs obtained high scores in each metric. In addition, we discovered that the consideration of both gene-content evolutionary history and usage of the IPM increases the skewness of the distribution. These results suggest that the biases in SMI were reduced by the reconstruction of the gene content history and the IPM method. Figure 1A–C shows the results of the AUC analyses. We found that EMI outperformed SMI in all cases, which suggests that gene content history reconstruction is highly effective in phylogenetic profiling, which is consistent with previous studies (Barker ; Moi ; Ta ). SDI was always better than SMI, except for one case where similar performances were obtained (th = 0.9 in the micrococcales dataset). These results also suggest that the IPM is valuable for reducing biases containing spurious correlation and evolutionary biases. EDI showed the best performance in the archaea and micrococcales datasets, except for the same case where EMI and EDI showed comparable performances. On the other hand, SDI showed the best performance in the fungi dataset. A cause of the worse performance of EDI in the fungi dataset may be insufficient gene annotation. Although the recalculated STRING scores used gene neighborhood conservation and gene fusion, they are not effective in estimating eukaryotic protein functional relationships. We found that the proportion of positive data was much lower for the fungi dataset than for the other datasets (Supplementary Table S1). This suggests that many functionally related OG pairs were not annotated with high association scores in the fungi dataset.

Fig. 1.

(A–C) Overall discrimination performances of each metric using the AUC scores. The x-axis represents the th value, which defines positive dataset. The y-axis represents the AUROC score. (A), (B) and (C) panels represent results for the archaea, micrococcales and fungi datasets, respectively. (D–F) Prediction performances for highly ranked OG pairs of each metric (th = 0.7). The x-axis represents the M value. The y-axis represents the PPV. (D), (E) and (F) panels represent results for the archaea, micrococcales and fungi datasets, respectively. The yellow, blue, green and red colors represent SMI, EMI, SDI and EDI, respectively We next investigated the prediction accuracies of highly ranked (top M) OG pairs for each metric (Fig. 1D–F). In almost all cases, SMI exhibited the worst or near-worst performance. On the other hand, the best-performing metrics depended on the datasets and M. For example, when M was 1000, SDI, EMI and EDI showed the highest PPVs for the archaea, micrococcales and fungi datasets, respectively. We confirmed that AUPR scores, where the top-scored prediction has large effects, showed the similar tendency with the PPV scores (Supplementary Fig. S3). Thus, the reconstruction of gene content history and the IPM method generally increase performances, although whether EMI, SDI or EDI performs the best depends on the case.

3.2 Performances of integrated metrics

Because highly ranked OG pairs estimated by EMI, SDI and EDI showed the best performance depending on the conditions, we next investigated whether their integration showed better performance. There are four combination types for the integration: EMI and SDI, EMI and EDI, SDI and EDI, and all three metrics. For the integration, we first ordered the OG pairs in descending order by their scores for EMI, SDI and EDI. Then, for each combination, we sorted the OG pairs by any of the integration types that are the maximum, average or minimum values of their ranks in all metrics under consideration. We investigated the AUC, PPV and AUPR performances of 12 integrated metrics comprising four combination types and three integration types (Supplementary Tables S8–S16). We found that the best condition for the integrated metrics depends on the dataset and the threshold (th or M). As a general trend, while the integration by the minimum values showed the highest scores in the AUC analyses, the integration by the average values achieved the highest scores in the PPV and AUPR analyses. In addition, we found that the highest integrated metrics performed better than the highest single metrics in many cases (Fig. 2 and Supplementary Fig. S4). These results strongly suggest that while EMI, SDI and EDI are good metrics, they also lose useful information in functional estimation in its own way, which could be salvaged by their integration.

Fig. 2.

(A–C) Overall discrimination performances of integrated metrics using the AUC scores. The x-axis represents the th value, which defines positive dataset. The y-axis represents the AUROC score. (A), (B) and (C) panels represent results for the archaea, micrococcales and fungi datasets, respectively. (D–F) Prediction performances for highly ranked OG pairs of integrated metrics (th = 0.7). The x-axis represents the M value. The y-axis represents the PPV. (D), (E) and (F) panels represent results for the archaea, micrococcales and fungi datasets, respectively. The gray and black colors represent the highest single metric and integrated metric, respectively

3.3 Examples of the detected OG pairs

Finally, as examples of the highly ranked OG pairs, we show lists of the top five ranked OG pairs by the integration of all three metrics (Table 1). We used the average value as the integration type and regarded the value as the prediction score. Except for two cases, these OG pairs had recalculated STRING association scores above 0.9, which means that functional associations had the highest confidence. Most of these gene pairs had known functional relationships. For example, the first rank in the archaea dataset was a pair of ZnuA and ZnuB, which are components of the ABC-type zinc uptake system. As another example, the fifth rank in the micrococcales dataset was a pair of DnaC, which is involved in DNA replication, and COG4584, a transposase.

Table 1.

The lists of the top five OG pairs detected by the combination of all three metrics

Taxonomy	Rank	OG1	OG2	Prediction score	STRING score
	1	COG0803 (ZnuA)	COG1108 (ZnuB)	10.0	0.992
	2	COG1203 (Cas3)	COG1688 (Cas5)	14.7	0.996
Archaea	3	COG1108 (ZnuB)	COG1121 (ZnuC)	17.3	0.994
	4	COG2998 (TupA)	COG4662 (TupA)	21.3	0.999
	5	COG1336 (Cmr4)	COG1604 (Cmr6)	24.0	0.999
	1	COG3181 (TctC)	COG3333 (TctA)	1.7	0.989
	2	COG1135 (AbcC)	COG2011 (MetP)	7.0	0.995
Micrococcales	3	COG1464 (NlpA)	COG2011 COG2011 (MetP)	10.7	0.996
	4	COG1135 (AbcC)	COG1464 (NlpA)	12.3	0.987
	5	COG1484 (DnaC)	COG4584	12.7	0.986
	1	COG0043 (UbiD)	COG0163 (UbiX)	34.3	0.998
	2	KOG4501	NOG13474	143.3	0.0
Fungi	3	COG5441	COG5564	620.3	0.988
	4	COG0843 (CyoB)	COG1290 (QcrB)	682.3	0.969
	5	COG2051 (RPS27A)	KOG3504	774.7	0.0

The lists of the top five OG pairs detected by the combination of all three metrics The first exceptional pair with the recalculated STRING score of 0.0 was KOG4501 and NOG13474, which was ranked second in the fungi dataset. We further investigated the relationship between these two genes and found that they showed an anti-correlated relationship. An anti-correlated relationship is also a clue for gene-function estimation as explained earlier, and it should be noted that the recalculated STRING scores based on gene neighborhood conservation, gene fusion, co-expression, protein-interaction experiments, other databases, and text mining cannot detect signals of anti-correlated relationships. While the human gene belonging to KOG4501 has a known function that is involved in DNA damage repair (Brickner ), NOG13474 is a function-unknown gene. We argue that NOG13474 may have a DNA damage repair function as a complement of KOG4501. In addition, the second exceptional pair was COG2051 and KOG3504, which was ranked fifth in the fungi dataset. Because both these OGs are ribosomal proteins, the recalculated STRING score may suggest the insufficient annotation.

4 Discussion

In this study, we evaluated the effectiveness of IPM in the phylogenetic profiling analysis. We constructed four metrics, SMI, EMI, SDI and EDI, based on whether a phylogenetic tree and the IPM were used. We then investigated the performance of the four metrics using the STRING datasets. We showed that SDI and EDI had the best performances in many cases. In addition, we revealed that predictions based on the combinations of EMI, SDI and EDI showed higher performance than predictions based on a single metric. These results demonstrated that the IPM is a powerful approach in phylogenetic profiling. Although even simple combinations of the metrics yielded good prediction results, more sophisticated methods of combining the metrics may provide better prediction results, for example, machine learning methods. A similar concept was proposed in studies on protein structure prediction based on IPM (Jones ; Wang ). These studies integrated various scores, such as co-evolutionary information using IPM, and predicted solvent accessibility information using supervised machine learning methods, such as deep learning. Theoretically, phylogenetic profiling methods detect any functional relationships regardless of whether they are physical or functional interactions. Thus, to discriminate types of identified relationships, other bioinformatic approaches need to be additionally employed. For example, by taking advantage of the recent breakthroughs of the AlphaFold2 (Jumper ) and AlphaFold-Multimer tools (Evans ), phylogenetic profiling will be used to specifically identify physically interacting protein pairs. We envision combining our method with the accurate protein structure prediction methods in the near future. We assumed that the input phylogenetic tree and gene content evolutionary history were correct when calculating EMI and EDI. However, they were estimations and intrinsically subject to uncertainty. Such uncertainty should decrease the accuracy of phylogenetic profiling analysis in general (Hamada, 2014). One solution is to consider the distribution of the estimates by calculating the expected values (instead of counts) of gene gains and losses for each phylogenetic branch. Cohen , 2013) adopted this approach, but a comparison with other methods has not been conducted and further studies are required. Because this extension requires the use of continuous data, the Gaussian graphical model will need to be used for considering spurious correlations, instead of the Potts model for categorical data (Stein ). In this study, we analyzed only the relationships between two OGs; however, many OGs have higher-order functional relationships among three or more OGs (such as multi-protein complexes). Several studies have focused on the logic relationships of three OGs in phylogenetic profiling (Bowers ; Fukunaga and Iwasaki, 2020; Zhang ). An example of a logic relationship is for OGs A, B and C, which means that OG C needs both OGs A and B for its function. To date, logic relationship analysis in phylogenetic profiling used local metrics, thus the detection of such higher-order functional relationships based on global metrics is an essential future task. Technically, it is not difficult to extend the Potts model to include (more than) ternary relationships (Schmidt and Hamacher, 2017), but efficient parameter estimation and construction of large-scale datasets for precise parameter estimation will be difficult. Click here for additional data file.

47 in total

1. Identification of direct residue contacts in protein-protein interaction by message passing.

Authors: Martin Weigt; Robert A White; Hendrik Szurmant; James A Hoch; Terence Hwa
Journal: Proc Natl Acad Sci U S A Date: 2008-12-30 Impact factor: 11.205

2. Fighting against uncertainty: an essential issue in bioinformatics.

Authors: Michiaki Hamada
Journal: Brief Bioinform Date: 2013-06-26 Impact factor: 11.622

3. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life.

Authors: Donovan H Parks; Maria Chuvochina; David W Waite; Christian Rinke; Adam Skarshewski; Pierre-Alain Chaumeil; Philip Hugenholtz
Journal: Nat Biotechnol Date: 2018-08-27 Impact factor: 54.908

4. Mapping global and local coevolution across 600 species to identify novel homologous recombination repair genes.

Authors: Dana Sherill-Rofe; Dolev Rahat; Steven Findlay; Anna Mellul; Irene Guberman; Maya Braun; Idit Bloch; Alon Lalezari; Arash Samiei; Ruslan Sadreyev; Michal Goldberg; Alexandre Orthwein; Aviad Zick; Yuval Tabach
Journal: Genome Res Date: 2019-02-04 Impact factor: 9.043

5. How Pairwise Coevolutionary Models Capture the Collective Residue Variability in Proteins?

Authors: Matteo Figliuzzi; Pierre Barrat-Charlaix; Martin Weigt
Journal: Mol Biol Evol Date: 2018-04-01 Impact factor: 16.240

Review 6. Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models.

Authors: Richard R Stein; Debora S Marks; Chris Sander
Journal: PLoS Comput Biol Date: 2015-07-30 Impact factor: 4.475

7. Protein 3D structure computed from evolutionary sequence variation.

Authors: Debora S Marks; Lucy J Colwell; Robert Sheridan; Thomas A Hopf; Andrea Pagnani; Riccardo Zecchina; Chris Sander
Journal: PLoS One Date: 2011-12-07 Impact factor: 3.240

8. Co-evolution based machine-learning for predicting functional interactions between human genes.

Authors: Doron Stupp; Elad Sharon; Idit Bloch; Marinka Zitnik; Or Zuk; Yuval Tabach
Journal: Nat Commun Date: 2021-11-09 Impact factor: 14.919

9. Highly accurate protein structure prediction with AlphaFold.

Authors: John Jumper; Richard Evans; Alexander Pritzel; Tim Green; Michael Figurnov; Olaf Ronneberger; Kathryn Tunyasuvunakool; Russ Bates; Augustin Žídek; Anna Potapenko; Alex Bridgland; Clemens Meyer; Simon A A Kohl; Andrew J Ballard; Andrew Cowie; Bernardino Romera-Paredes; Stanislav Nikolov; Rishub Jain; Demis Hassabis; Jonas Adler; Trevor Back; Stig Petersen; David Reiman; Ellen Clancy; Michal Zielinski; Martin Steinegger; Michalina Pacholska; Tamas Berghammer; Sebastian Bodenstein; David Silver; Oriol Vinyals; Andrew W Senior; Koray Kavukcuoglu; Pushmeet Kohli
Journal: Nature Date: 2021-07-15 Impact factor: 49.962

1 in total

1. Mirage 2.0: fast and memory-efficient reconstruction of gene-content evolution considering heterogeneous evolutionary patterns among gene families.

Authors: Tsukasa Fukunaga; Wataru Iwasaki
Journal: Bioinformatics Date: 2022-06-30 Impact factor: 6.931

1 in total