| Literature DB >> 35143670 |
Jason Grealey1,2, Loïc Lannelongue3,4,5, Woei-Yuh Saw1, Jonathan Marten4, Guillaume Méric1,6, Sergio Ruiz-Carmona1, Michael Inouye1,3,4,5,7,8.
Abstract
Bioinformatic research relies on large-scale computational infrastructures which have a nonzero carbon footprint but so far, no study has quantified the environmental costs of bioinformatic tools and commonly run analyses. In this work, we estimate the carbon footprint of bioinformatics (in kilograms of CO2 equivalent units, kgCO2e) using the freely available Green Algorithms calculator (www.green-algorithms.org, last accessed 2022). We assessed 1) bioinformatic approaches in genome-wide association studies (GWAS), RNA sequencing, genome assembly, metagenomics, phylogenetics, and molecular simulations, as well as 2) computation strategies, such as parallelization, CPU (central processing unit) versus GPU (graphics processing unit), cloud versus local computing infrastructure, and geography. In particular, we found that biobank-scale GWAS emitted substantial kgCO2e and simple software upgrades could make it greener, for example, upgrading from BOLT-LMM v1 to v2.3 reduced carbon footprint by 73%. Moreover, switching from the average data center to a more efficient one can reduce carbon footprint by approximately 34%. Memory over-allocation can also be a substantial contributor to an algorithm's greenhouse gas emissions. The use of faster processors or greater parallelization reduces running time but can lead to greater carbon footprint. Finally, we provide guidance on how researchers can reduce power consumption and minimize kgCO2e. Overall, this work elucidates the carbon footprint of common analyses in bioinformatics and provides solutions which empower a move toward greener research.Entities:
Keywords: bioinformatics; carbon footprint; genomics; green algorithms
Mesh:
Year: 2022 PMID: 35143670 PMCID: PMC8892942 DOI: 10.1093/molbev/msac034
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Carbon Footprint of a Range of Bioinformatic Tasks.
| Task | Tool | Version | Details about the Experiments | Carbon Footprint | Tree-months | km in a Car (EU) | Running Time and Memory | Approximate Scaling (if known) | |
|---|---|---|---|---|---|---|---|---|---|
| Increase (%) | kgCO2e | ||||||||
| Genome scaffolding | SSPACE | 2.0 | Scaffolding 2.4 million long reads from human chromosome 14 ( |
|
| 0.0011 | 0.01 | 3 min 21 s | Linearly with number of reads. |
| 30 GB | |||||||||
| SOAPdenovo2 | r223 |
|
| 0.0016 | 0.01 | 4 min 52 s | |||
| 30 GB | |||||||||
| SGA | 0.9.43 |
|
| 0.032 | 0.17 | 1 h 35 min | |||
| 30 GB | |||||||||
| Genome scaffolding | SSPACE | 2.0 | Scaffolding 23 million short reads from human chromosome 14 ( |
|
| 0.0029 | 0.02 | 8 min 40 s | |
| 30 GB | |||||||||
| SOAPdenovo2 | r223 |
|
| 0.0039 | 0.02 | 1 min 38 s | |||
| 30 GB | |||||||||
| SGA | 0.9.43 |
|
| 0.14 | 0.74 | 7 h 05 min | |||
| 30 GB | |||||||||
| Genome assembly | Abyss | 2.0 | De novo assembly of a human genome from Illumina sequencing reads ( |
|
| 12 | 61 | 20 h | |
| 34 GB | |||||||||
| MEGAHIT | 1.0.6 |
|
| 16 | 86 | 26 h | |||
| 197 GB | |||||||||
| Metagenome assembly | MetaVelvet k101 | 1.2.01 | Metagenome assembly from 100 soil samples ( |
|
| 16 | 82 | 1 h 06 min | |
| 130 GB | |||||||||
| MEGAHIT | 1.0.3 |
|
| 84 | 439 | 15 h 36 min | |||
| 12 GB | |||||||||
| metaSPAdes | 3.8.0 |
|
| 203 | 1,065 | 29 h 24 min | |||
| 60 GB | |||||||||
| Metagenome classification (short read) | Kraken2 | 2.0.7 | Metagenomic classification of 5 Gb of randomly sampled reads from Zymo mock community (batch ZRC190633), containing yeast, Gram-negative, and positive bacteria ( |
|
| 0.0057 | 0.03 | 20 min | Linearly with number of reads. |
| 21 GB | |||||||||
| Centrifuge | 1.0.4 |
|
| 0.014 | 0.07 | 58 min | |||
| 12 GB | |||||||||
| Kraken/Bracken | 0.10.5/1.0.0 |
|
| 0.10 | 0.52 | 1 h 40 min | |||
| 154 GB | |||||||||
| Metagenome classification (long read) | MetaMaps | — |
|
| 19.91 | 104.27 | 209 h 53 min | ||
| 262 GB | |||||||||
| Phylogenetics | BEAST/BEAGLE | 1.8.4/2.1.2 | Codon substitution modeling of extant carnivores and a pangolin group. Nucleotide substitution and phylogeographic modeling of Ebola virus genomes. See |
|
| 0.013–0.33 | 0.069–1.72 | 3 min 30 s to 7 h 45 min | Power law with number of loci. |
| 2–8 GB | |||||||||
| Phylogenetics | RAxml/ExaML, PhyML, IQ-TREE, FastTree | 8.2.0/3.0.17, 20160530 1.4.2, 2.1.9 | Over 670,000 tree inferences on about 45,000 single-gene alignments and supermatrices from 19 empirical phylogenomic data sets with thousands of genes and around 200 taxa. ( |
|
| 3889 | 20,371 | 300,000 h | |
| 8 GB | |||||||||
| Phylogenetics | ExaML | — | A 322-million-bp MULTIZ alignment of putatively orthologous genome regions across all species, comprising approximately 30% of an average assembled avian genome. This corresponded to the maximal orthologous sequence obtainable across all orders of |
|
| 4769 | 24,983 | 367,920 h | |
| 8 GB | |||||||||
| RNA read alignment | HISAT2 | 2.0.0beta | Alignment of 10 million 100-base read pairs to Homo Sapiens hg19 genome ( | — |
| 0.0059 | 0.031 | 1 min 48 s | Linearly with number of reads. |
| 5 GB | |||||||||
| STAR | 2.5.0a |
|
| 0.011 | 0.055 | 6 min 01 s | |||
| 35 GB | |||||||||
| TopHat2 | 2.1.0 |
|
| 0.35 | 1.81 | 2 h 14 min | |||
| 16 GB | |||||||||
| Novoalign | 3.02.13 |
|
| 1.07 | 5.58 | 32 h 12 min | |||
| 64 GB | |||||||||
| RNA read alignment | HISAT2 | 2.0.0beta | Alignment of 10 million 100-base read pairs to Plasmodium falciparum genome ( |
|
| 0.0057 | 0.030 | 1 min 44 s | |
| 1 GB | |||||||||
| TopHat2 | 2.1.0 |
|
| 0.26 | 1.37 | 1 h 25 min | |||
| 13 GB | |||||||||
| STAR | 2.5.0a |
|
| 0.40 | 2.11 | 2 h 27 min | |||
| 8 GB | |||||||||
| Novoalign | 3.02.13 |
|
| 0.73 | 3.83 | 38 h 04 min | |||
| 21 GB | |||||||||
| RNA-seq QC pipeline | FastQC, TrimGalore, bbmap/clumpify, and STAR | -/v0.6.0/-/v2.7.0e | Quality control analysis of raw reads quality of 392 samples from the Childhood Asthma Study (in-house). |
|
| 59.97 | 314.11 | 485 h 12 min | |
| 8 GB | |||||||||
| Transcript isoform abundance estimation | Sailfish 1 core | 0.6.3 | Transcript isoform quantification of 100 million in silico reads generated from Flux Simulator with hg19 genome and GENCODE v19 annotation set ( |
|
| 0.0088 | 0.046 | 42 min | Linearly with the number of reads. |
| 7 GB | |||||||||
| Sailfish 16 cores |
|
| 0.039 | 0.21 | 14 min | ||||
| 7 GB | |||||||||
| Cufflinks 1 core | 2.1.1 |
|
| 0.049 | 0.26 | 3 h 30 min | |||
| 11 GB | |||||||||
| Cufflinks 16 cores |
|
| 0.30 | 1.56 | 1 h 45 min | ||||
| 12 GB | |||||||||
| RSEM 1 core | 1.2.18 |
|
| 0.63 | 3.28 | 47 h 10 min | |||
| 9 GB | |||||||||
| RSEM 16 cores |
|
| 1.53 | 8.00 | 8 h 50 min | ||||
| 21 GB | |||||||||
| GWAS | Bolt-LMM | 2.3 | Analyses of a single trait in UK Biobank ( |
|
| 5.13 | 26.87 | 60 h 58 min | Linearly with number of variants. |
| 100 GB | |||||||||
| Bolt-LMM | 1.0 |
|
| 18.86 | 98.81 | 224 h 10 min | |||
| 100 GB | |||||||||
| Cohort scale eQTL analysis | TensorQTL | 1.0.2 | Cis-eQTL mapping of 10.7 M SNPs against 18,373 genetic features in a cohort of 2,745 individuals (in-house). |
|
| 2.22 | 11.7 | 1 h 14 min | Nonlinearly with the number of traits or the sample size. |
| 192 GB | |||||||||
| LIMIX | 2.0.3 |
|
| 208.07 | 1,089.9 | 9,705 h | |||
| 41–221 GB | |||||||||
| Single cis-eQTL gene mapping | TensorQTL | — | Cis-eQTL mapping one gene from skeletal muscle in GTEx (v6p) ( |
|
| 0.00001 | 0.00004 | 0.11 s | |
| 52 GB | |||||||||
| FastQTL | — |
|
| 0.0002 | 0.001 | 30 s | |||
| 52 GB | |||||||||
| Molecular dynamics simulation | AMBER | 18 | Simulation of a Satellite Tobacco Mosaic Virus with 1,066,628 atoms for 100 ns |
|
| 19 | 102 | 75 h | |
| ( | |||||||||
| NAMD | 2.13 |
|
| 104 | 544 | 400 h | |||
| ( | |||||||||
| Molecular docking | Glide | 57111 | Molecular docking of four DUD systems, scaled to 1 m ligands ( |
|
| 14 | 74 | 1,027 h 47 min | |
| 0.05 GB | |||||||||
| rDock | — |
|
| 168 | 878 | 12,250 h | |||
| 0.05 GB | |||||||||
| AutoDock Vina | — |
|
| 561 | 2,938 | 40,972 h | |||
| 0.05 GB | |||||||||
Note.—Further details for each task are included in supplementary additional file 1, Supplementary Material online.
Note different simulation parameters between the two: AMBER18 (4fs timestep, 9 A cut-off) NAMD (2fs timestep with rigid bonds, 12 A cut-off with PME every two steps).
No memory included due to a lack of information.
Fig. 1.The effect of hardware choices and parallelization on carbon footprint. The carbon footprint of BEAST/Beagle implemented on multicore CPU or GPUs for three different tasks. The plots on the left detail both the running time and carbon footprint against the number of cores utilized. The plots on the right detail the running time solely against carbon footprint (contextualized with tree-months) for both CPUs and GPUs. The numerical data are available in supplementary table 2, Supplementary Material online.
Fig. 2.Impact of location and computational platform on carbon footprint. Carbon footprint (in kgCO2e, tree-months, and European car km) of a biobank scale 100 trait GWAS in various locations and platforms. Average data centers have a PUE of 1.67 (Andy 2019), Google cloud has a PUE of 1.11 (Efficiency – Data Centers – Google n.d.), Australia has a CI of 0.88 kgCO2e/kWh, the United States 0.453 kgCO2e/kWh, and the UK 0.253 kgCO2e/kWh (Carbonfootprint.Com – International Electricity Factors 2020).
Fig. 3.Over-allocating memory increases a given algorithm’s carbon footprint. We modeled how over-allocating the memory for a given algorithm increases its carbon footprint and this effect is increased for algorithms with larger memory requirements. Each plot details the percentage increase in carbon footprint as a function of memory overestimation for a variety of bioinformatic tools and tasks. The numerical data are available in supplementary table 1, Supplementary Material online.