| Literature DB >> 35173421 |
Batsirai M Mabvakure1,2,3, Raymond Rott4, Leslie Dobrowsky4, Peter Van Heusden5, Lynn Morris1,2,6, Cathrine Scheepers1,2, Penny L Moore1,2,6.
Abstract
Next-generation sequencing (NGS) technologies have revolutionized biological research by generating genomic data that were once unaffordable by traditional first-generation sequencing technologies. These sequencing methodologies provide an opportunity for in-depth analyses of host and pathogen genomes as they are able to sequence millions of templates at a time. However, these large datasets can only be efficiently explored using bioinformatics analyses requiring huge data storage and computational resources adapted for high-performance processing. High-performance computing allows for efficient handling of large data and tasks that may require multi-threading and prolonged computational times, which is not feasible with ordinary computers. However, high-performance computing resources are costly and therefore not always readily available in low-income settings. We describe the establishment of an affordable high-performance computing bioinformatics cluster consisting of 3 nodes, constructed using ordinary desktop computers and open-source software including Linux Fedora, SLURM Workload Manager, and the Conda package manager. For the analysis of large antibody sequence datasets and for complex viral phylodynamic analyses, the cluster out-performed desktop computers. This has demonstrated that it is possible to construct high-performance computing capacity capable of analyzing large NGS data from relatively low-cost hardware and entirely free (open-source) software, even in resource-limited settings. Such a cluster design has broad utility beyond bioinformatics to other studies that require high-performance computing.Entities:
Keywords: High-performance computing; bioinformatics; cluster; data analysis; large data; low-cost systems; next-generation sequencing
Year: 2019 PMID: 35173421 PMCID: PMC8842485 DOI: 10.1177/1177932219882347
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1.Bioinformatics cluster architecture. (A) Schematic description of the cluster architecture, storage, and memory. (B) Bioinformatics cluster file system. The storage area is shown in blue, the user area for data analysis in green, and the restricted system files in red.
Figure 2.Access to bioinformatics programs on the cluster. (A) Bioinformatics programs on the cluster. (B) Example of a bash script to run programs on the cluster.
Comparison of the cluster performance with that of a high specification ordinary machine.
| Machine | Average speed and range (hours/mile steps) | Total time to complete 400 million steps (hours) | No. of analyses at a time |
|---|---|---|---|
| Cluster | 0.41 (0.39-0.43) | 164 | 18 |
| Ordinary machine: 2.8 GHz Core i7, 16GB memory | 0.78 (0.25-1.01) | 284 | 1 |
The cluster runs jobs faster compared with the ordinary machine and also runs multiple jobs in parallel, therefore reducing the total time to complete multiple analyses.
Figure 3.Cluster performance with large datasets of antibody repertoire data. (A) Total amount of data from the SONAR analyses, followed by the breakdown per participant for donors. (B) Number of antibody sequences analyzed using the bioinformatics cluster per donor and per time point. Heavy and light chain antibody sequences data are shown in black and gray, respectively.
Figure 4.NGS analysis flowchart using the bioinformatics programs on the cluster. The antibody repertoire analysis involves first organizing the data from the sequencing facility (data pre-processing), SONAR analysis, and post SONAR analysis that involves selection of clonally related sequences. The arrows show the flow of data through the various stages, and in some instances, data can go back and forth in certain stages (shown by the thin arrows).