| Literature DB >> 35012291 |
Anas Oujja1,2, Mohamed Riduan Abid1, Jaouad Boumhidi2, Safae Bourhnane1,3, Asmaa Mourhir1, Fatima Merchant4, Driss Benhaddou4.
Abstract
Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.Entities:
Keywords: RNA; SARS-COV-2; bioinformatics; data science; high-performance computing; longest common subsequence
Year: 2021 PMID: 35012291 PMCID: PMC8752974 DOI: 10.5808/gi.21056
Source DB: PubMed Journal: Genomics Inform ISSN: 1598-866X
Fig. 1.The general architecture of high-performance computing service in a private datacenter using Hadoop [25].
Fig. 2.Genomics value chain. (1) Sampling: collecting DNA/RNAs source, (2) Sequencing: generating the order of the nucleotides (A, T, C, G) in the DNA/RNA, (3) Analysis: compute dissimilarity between sequences using longest common subsequence algorithm, (4) Interpretation: translating observed results into knowledge, (5) Application: proposing solutions.
Fig. 3.Analysis flowchart. LCS, longest common subsequence.
Fig. 4.Longest common subsequence (LCS) algorithm.
Execution time of the comparisons of different portions of data using an ordinary laptop
| No. of sequences | No. of comparisons | Execution time (min) |
|---|---|---|
| 5 | 10 | 2 |
| 10 | 45 | 6 |
| 15 | 105 | 14 |
| 20 | 190 | 26 |
| 25 | 300 | 41 |
| 30 | 435 | 59 |
Foot notes
Fig. 5.Estimated execution time of longest common subsequence algorithm using different numbers of nodes.
Fig. 6.Range of the longest common subsequence (LCS) lengths. X and Y axis represent the sequences Z axis represent LCSs lengths.
Fig. 7.Dist algorithm.
Fig. 8.UPGMA (unweighted pair group method with arithmetic mean) algorithm.