Literature DB >> 35243506

Embracing Green Computing in Molecular Phylogenetics.

Abstract

Molecular evolutionary analyses require computationally intensive steps such as aligning multiple sequences, optimizing substitution models, inferring evolutionary trees, testing phylogenies by bootstrap analysis, and estimating divergence times. With the rise of large genomic data sets, phylogenomics is imposing a big carbon footprint on the environment with consequences for the planet's health. Electronic waste and energy usage are large environmental issues. Fortunately, innovative methods and heuristics are available to shrink the carbon footprint, presenting researchers with opportunities to lower the environmental costs and greener evolutionary computing. Green computing will also enable greater scientific rigor and encourage broader participation in big data analytics.

Entities: Chemical

Keywords: carbon footprint; green computing; molecular evolution; phylogenetics

Mesh：

Year: 2022 PMID： 35243506 PMCID： PMC8894743 DOI： 10.1093/molbev/msac043

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Many biological disciplines apply computational approaches to investigate evolutionary questions involving the origins of genes, evolutionary relationships of organisms, positive and negative selection, the evolution of biodiversity, and genotype–phenotype connections across the tree of life. The importance of these questions is reflected by the escalating use of software for molecular evolutionary analyses (fig. 1). Paradoxically, the means by which we explore the tree of life actually negatively impact that evolving tree of life, because computing has environmental costs. A computers’ energy usage manifests into carbon dioxide emissions. Many scientists are seriously assessing the environmental cost of data analysis and the carbon footprint left by molecular evolutionary studies (Tao et al. 2019; Kumar and Sharma 2021; Álvarez-Carretero et al. 2022; Grealey et al. 2022). In particular, Grealey et al. (2022) have recently assessed the energy utilization and the associated carbon footprint of bioinformatics, including phylogenetic analysis and genome assembly.

Fig. 1.

The use of computational methods in molecular evolution has been increasing quickly, as seen in the annual counts of new research articles citing the use of major software packages for molecular evolutionary and phylogenetic analyses. Citation counts for software packages were obtained from Google Scholar (last accessed January 25, 2022) for 2005–2020. See supplementary material, Supplementary Material online for more details on software versions included. Strategies are being developed to achieve energy savings in a quest for greener computing in the sciences and a healthier global ecology with health benefits to the general public (Jones 2018; Portegies Zwart 2020; Stevens et al. 2020; Strubell et al. 2020; Bender et al. 2021; Lannelongue, Grealey, Bateman, et al. 2021; Grealey et al. 2022). For example, cloud computing avoids idle time, as partial CPU and memory use in standalone computers wastes energy (Shehabi et al. 2016; Jones 2018). However, speeding up research computing through faster processors and parallelization demands extra energy and, thus, emits more greenhouse gases. Using idle GPUs to assist CPUs can also result in greener computing, but this approach depends on appropriate software implementations (Grealey et al. 2022). Interestingly, energy production has a much smaller carbon footprint in some countries (e.g., Norway and Switzerland), making them better locations for cloud computing (Lannelongue, Grealey, and Inouye 2021). Substantial reduction in energy costs can also be achieved by complementary means, which is the focus of this perspective. Here, I highlight conceptual and technical advances that can organically reduce computational time and memory of phylogenomics. I suggest that researchers choose methods, algorithms, and software practices that demand fewer compute cycles and less computer memory. These choices will diminish the carbon footprint of computational molecular evolution and be aligned with ecologically sound bioinformatic practices. These and future developments of resource-thrifty and accurate methods will amplify the impact of general strategies for greener computing.

Carbon Footprints of Phylogenetic and Phylogenomic Analyses

A standard protocol in molecular phylogeny is first to assemble a set of sequences and subject them to alignment procedures to establish base-by-base homology across sequences from different species and genes (Kumar and Filipski 2007). The resulting multiple sequence alignments (MSAs) become ready for molecular phylogenetics after proper postprocessing, including manual curation (Yang and Rannala 2012; Kapli et al. 2020).

Selecting the Optimal Model

In analyzing MSA, the usual first step is to estimate the substitution model that best describes the overall pattern of base changes. This analysis requires evaluating several models of nucleotide (or amino acid) substitution as well as models of rate variation across sites. Maximum likelihood (ML) tests of several nested and non-nested models under the Bayesian information criterion are frequently used. Model selection has a substantial carbon footprint for phylogenomic data sets. For example, an MSA of 1.3 million base pairs from 37 mammalian species took 106 CPU hours and 9.3 gigabytes (GB) of peak memory in ModelFinder to select the optimal model (Kalyaanamoorthy et al. 2017). According to the Green Algorithms (GA) resource (Lannelongue, Grealey, and Inouye 2021), this analysis would require 1.6 kilowatt-hours (kWh) of energy and have a carbon footprint of 0.62 kgCO2e. GA suggests that a tree will take 20 days to scrub the environment of the greenhouse gasses emitted (table 1a1)! We can save more than 90% of the energy and, thus, emit less than 10% of the greenhouse gas by usingModelTest-NG (Darriba et al. 2020) and jModelTest (Posada 2008) that will produce similar results (table 1a). Recent machine-learning approaches also promise to provide green alternatives (Abadi et al. 2020; Burgstaller-Muehlbacher et al. 2021). Also, a machine-learning method for detecting autocorrelated evolutionary rates in a phylogeny (CorrTest; Tao et al. 2019) requires a small fraction of the energy used by a comparable Bayes factor analysis (table 1b).

Table 1.

Carbon Footprints (gram CO2e) of Molecular Phylogenetic Analyses and Software for an MSA of 37 Mammalian Species and 1.3 Million Sites.

		Computer Resources			Environmental Impact
		Time	Memory	Energy	C-footprint	Trees
Function	Method/Tool	(h)	(peak, MB)	(kWh)	(g)	(days)
(a) Optimal substitution model selection
a1.	ModelFinder	106.0	9,300	1.64	617	20.1
a2.	jModelTest	8.8	3,700	0.12	44	1.5
a3.	ModelTest-NG	8.0	3,700	0.11	41	1.2
(b) Clock rate model selection
b1.	Bayes factor	2,500.0	46,000	51.00	19,220	540.0
b2.	CorrTest	0.2	4,000	<0.01	1	<0.1
(c) Phylogeny inference
c1.	Maximum likelihood	8.1	4,000	0.11	41	1.2
c2.	FastTree	0.7	700	0.01	3	0.1
c3.	Neighbor-joining	0.1	8	<0.01	<1	<0.1
(d) Statistical tests of phylogenies (ML)
d1.	Standard bootstrap	980.0	3,100	13.00	4,850	159.0
d2.	Rapid bootstrap	98.0	3,700	1.00	493	16.2
d3.	Little bootstrap	18.9	100	0.23	86	2.7
d4.	Little+ultrafast-bootstraps	0.9	200	0.01	4	0.1
d5.	Bayesian	857.9	22,000	17.00	6,490	210.0
(e) Relaxed clock dating
e1.	Bayesian (slow)	2,309.5	23,000	46.00	17,460	570.0
e3.	Bayesian (fast)	29.5	909	0.36	135	4.5
e3.	RelTime	0.1	8	<0.01	<1	<0.1

Note.—The C-footprint (Carbon footprint) is the amount (g) of CO2 released in the production of energy (kilowatt-hours, kWh) needed to power computers in the USA, estimated using the Green Algorithms website (Lannelongue, Grealey, and Inouye 2021). Tree days are calculated based on the information that a mature tree can scrub ∼917 g of CO2e per day (Grealey et al. 2022). The Supplementary Material online provides details on software used and the options applied.

Carbon Footprints (gram CO2e) of Molecular Phylogenetic Analyses and Software for an MSA of 37 Mammalian Species and 1.3 Million Sites. Note.—The C-footprint (Carbon footprint) is the amount (g) of CO2 released in the production of energy (kilowatt-hours, kWh) needed to power computers in the USA, estimated using the Green Algorithms website (Lannelongue, Grealey, and Inouye 2021). Tree days are calculated based on the information that a mature tree can scrub ∼917 g of CO2e per day (Grealey et al. 2022). The Supplementary Material online provides details on software used and the options applied.

Building a Molecular Phylogeny

Using an MSA and the best-fit substitution model, we can make a phylogeny representing the evolutionary histories of genes and species. ML and minimum evolution (ME) are two widely used model-based optimality principles for reconstructing phylogenetic trees (Nei and Kumar 2000). The neighbor-joining method (Saitou and Nei 1987), based on the ME principle and used in thousands of studies, has a negligible carbon footprint (table 1c3) compared with popular heuristic searches under the ML optimality criterion (table 1c1). Another approach that combines optimality criteria (FastTree) has an intermediate environmental impact (table 1c2). The accuracy of phylogenies produced by different techniques is comparable for many applications (Rosenberg and Kumar 2001; Price et al. 2010; Yoshida and Nei 2016), so researchers have many excellent options for reducing the environmental impact of their analyses.

Confidence Limits on Inferred Phylogenetic Groupings

Statistical evaluation of the robustness of inferred phylogenetic relationships is essential in evolutionary biology. Felsenstein’s (1985) bootstrap resampling has been the preferred approach, but it is computationally intensive, requiring the inference of hundreds of phylogenetic trees for pseudo-MSAs generated by sampling sites with replacement from the full data set. This analysis has a rather large carbon footprint (table 1d1), as does its Bayesian alternative that produces posterior probabilities for inferred evolutionary relationships (table 1d5). Many approximate energy-efficient methods are now available for phylogenomic data sets, including the technique Little Bootstraps (Sharma and Kumar 2021) for long sequences, and ultrafast bootstrapping (Minh et al. 2013) and Rapid bootstrapping (Stamatakis et al. 2008) for data sets containing large numbers of sequences. These approximate methods have much smaller carbon footprints than standard approaches (table 1d). Combining different techniques (Sharma and Kumar 2021) can save more than 99% in time, memory, and energy in testing the robustness of inferred phylogenies (table 1d4).

From Phylogenies to Timetrees

Another common phylogenetic analysis is the estimation of divergence times corresponding to speciations, gene duplications, and the evolution of new strains. Relaxed clock methods have revolutionized this practice (Kumar and Hedges 2016; Tao et al. 2020). Bayesian and RelTime methods produce estimates of similar quality (e.g., Barba-Montoya et al. 2020; Mello et al. 2021), but their energy requirements are dramatically different (table 1e). There is also a large difference in the carbon footprints imposed by slow and fast Bayesian implementations (table 1e). Consequently, researchers have a large spectrum of more environmentally friendly alternatives for molecular dating methods.

Green Software Implementations

Ultimately, efficient software implementation is the key to realizing the potential of all conceptional, methodological, and algorithmic innovations. The software design and resource utilization dictate energy consumption, so implementations that use less computer memory and time have a lower carbon footprint. Availability of software versions that can run on the cloud will also reduce carbon footprints. Another emerging area of improvement lies in creating stopping rules that can detect when further computing will not change the outcome significantly. For example, adaptive rules are being developed to automatically determine the number of bootstrap replicates needed for reliable confidence limits (Stamatakis 2014; Sharma and Kumar 2021). In the future, smarter software will avoid overcomputing, decreasing the carbon footprints of big data analyses.

Benefits beyond Environmental Sustainability

Computationally efficient analyses will also enhance the rigor of scientific research, reducing the resources required to assess the robustness of inferences to subsetting of data, choice of substitution models and strategies, and combining multigene data sets. Computationally efficient and economical computing will encourage researchers to evaluate the reproducibility of published results. The currently high computational demands of reproducibility studies put efforts to reproduce research results out of the reach of researchers lacking access to high-performance computing infrastructure. Greener computing is also a key to addressing equity, diversity, and sustainability in scientific research and education. Green computing requires fewer compute cycles and less computer memory. It reduces the expense of computational hardware and the cost of on-demand calculations. Economical computing makes computational research accessible to a broader community, as the research funding for scientific investigations is limited. Greener computing, therefore, will uniquely address economic disparities among researchers due to their local constraints. Greener alternatives for molecular phylogenetic analysis will increase participation by researchers worldwide in molecular evolutionary research and the genomic revolution in biology.

Concluding Remarks

In the Anthropocene, where massive planetary changes are taking place because of human activity, computing is often thought of as a “clean” practice, when in fact, it can be quite the opposite. All branches of biology need to re-evaluate their practices in keeping with the underlying goal of studying life in the first place. For computational analyses, with the routine assembly of big data sets, analytical practices of the past hamper research by the need for excessive computing time and memory. These obstacles hinder both rigorous scientific investigations and wider participation in molecular phylogenetics. Large carbon footprints of many currently popular approaches have negative impacts on the environment, human health, and the sustainability of scientific computing. Fortunately, many accurate and resource-thrifty methods and algorithms are available for molecular phylogenetics. Applying these methods synergistically with computer hardware optimizations will help us achieve greater scientific rigor and broader participation while minimizing financial and environmental costs. I see a bright future for green computing in which conceptual and technical advances will further diminish the carbon footprints of increasingly complex phylogenomic analyses.

Supplementary Material

Supplementary information is available at Molecular Biology and Evolution online. Click here for additional data file.

26 in total

1. Advances in Time Estimation Methods for Molecular Data.

Authors: Sudhir Kumar; S Blair Hedges
Journal: Mol Biol Evol Date: 2016-02-16 Impact factor: 16.240

Review 2. Multiple sequence alignment: in pursuit of homologous DNA positions.

Authors: Sudhir Kumar; Alan Filipski
Journal: Genome Res Date: 2007-02 Impact factor: 9.043

3. A rapid bootstrap algorithm for the RAxML Web servers.

Authors: Alexandros Stamatakis; Paul Hoover; Jacques Rougemont
Journal: Syst Biol Date: 2008-10 Impact factor: 15.683

4. How to stop data centres from gobbling up the world's electricity.

Authors: Nicola Jones
Journal: Nature Date: 2018-09 Impact factor: 49.962

5. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

6. Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps.

Authors: Sudip Sharma; Sudhir Kumar
Journal: Nat Comput Sci Date: 2021-09-22

7. ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models.

Authors: Diego Darriba; David Posada; Alexey M Kozlov; Alexandros Stamatakis; Benoit Morel; Tomas Flouri
Journal: Mol Biol Evol Date: 2020-01-01 Impact factor: 16.240

8. The Carbon Footprint of Bioinformatics.

Authors: Jason Grealey; Loïc Lannelongue; Woei-Yuh Saw; Jonathan Marten; Guillaume Méric; Sergio Ruiz-Carmona; Michael Inouye
Journal: Mol Biol Evol Date: 2022-03-02 Impact factor: 16.240

9. Using a GTR+Γ substitution model for dating sequence divergence when stationarity and time-reversibility assumptions are violated.

Authors: Jose Barba-Montoya; Qiqing Tao; Sudhir Kumar
Journal: Bioinformatics Date: 2020-12-30 Impact factor: 6.937

10. A Machine Learning Method for Detecting Autocorrelation of Evolutionary Rates in Large Phylogenies.

Authors: Qiqing Tao; Koichiro Tamura; Fabia U Battistuzzi; Sudhir Kumar
Journal: Mol Biol Evol Date: 2019-04-01 Impact factor: 16.240