| Literature DB >> 35118375 |
Jinxiang Chen1, Fuyi Li2,3,4, Miao Wang1, Junlong Li1, Tatiana T Marquez-Lago5,6, André Leier5,6, Jerico Revote2, Shuqin Li1, Quanzhong Liu1, Jiangning Song2,3.
Abstract
BACKGROUND: Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data.Entities:
Keywords: Hadoop; Simple Sequence Repeats (SSR); big data; next-generation sequencing; read pairs
Year: 2022 PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
Bioinformatics tools developed based on Big Data technologies for handling large-scale sequence datasets.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Hadoop | BigBWA (Abuín et al., | 2015 | Alignment | Yes | No |
| Spark | SparkBWA (Abuín et al., | 2016 | Yes | No | |
| Spark | SparkSW (Zhao et al., | 2015 | Yes | No | |
| Hadoop | Hadoop-BAM (Niemenmaa et al., | 2012 | Yes | No | |
| Spark | DSA (Bo et al., | 2017 | Yes | No | |
| Spark | CloudSW (Bo et al., | 2017 | Yes | No | |
| Spark | SparkBLAST (Castro et al., | 2017 | Yes | No | |
| Hadoop | Cloudblast (Matsunaga et al., | 2008 | Yes | No | |
| Hadoop | HAlign (Zou et al., | 2015 | Yes | No | |
| Hadoop | HSRA (Expósito et al., | 2018 | Yes | No | |
| Spark | PASTASpark (Abuín et al., | 2017 | Yes | No | |
| Hadoop | CloudAligner (Nguyen et al., | 2011 | Yes | Yes | |
| Hadoop | CloudBurst (Schatz, | 2009 | Yes | No | |
| Hadoop | BioPig (Nordberg et al., | 2013 | Sequence analysis | Yes | No |
| Hadoop | Halvade (Decap et al., | 2015 | Yes | No | |
| Hadoop | Halvade-RNA (Decap et al., | 2017 | Yes | No | |
| Spark | HiGene (Deng et al., | 2016 | Genome analysis | No | No |
| Spark | GATK-Spark (Li et al., | 2016 | No | No | |
| Spark | SparkSeq (Wiewiórka et al., | 2014 | Yes | No | |
| Hadoop | GATK (Mckenna et al., | 2010 | Yes | No | |
| Spark | MEC (Zhao et al., | 2017 | Error correction | Yes | No |
| Hadoop | MarDRe (Expósito et al., | 2017 | Removal of duplicate DNA reads | Yes | No |
| Spark | MetaSpark(Zhou et al., | 2017 | Metagenomic read recruitment | Yes | No |
| Spark | Spaler (Abu-Doleh and Catalyurek, | 2015 | No | No | |
| Hadoop & | SA-BR-MR and SA-BR-Spark (Dong et al., | 2017 | Sequence assembly | No | No |
| Hadoop & | Falco (Yang et al., | 2017 | RNA-seq processing | Yes | No |
| Spark | SpaRC (Shi et al., | 2019 | Clustering analysis | Yes | No |
| Hadoop &Spark | GMQL (Masseroli et al., | 2019 | NGS tertiary data analysis | Yes | Yes |
| Hadoop | SeqPig (Schumacher et al., | 2014 | Sequence processing | Yes | No |
Figure 1The overall framework of the BigFiRSt methodology. BigFiRSt contains two modules. (A) BigFLASH is used to merge short read pairs. (B) BigPERF is used to mine SSRs contained in reads.
Figure 2The detailed workflow of (A) BigFLASH and (B) BigPERF.
Figure 3The overall architecture of the Hadoop cluster in the experiment.
Configurations for each machine used in the experiment.
|
|
| |
|---|---|---|
|
|
| |
| CPU in each node | Intel Core Processor (Skylake, IBRS) | Intel Core Processor (Skylake, IBRS) |
| The number of cores in each node | 8 | 32 |
| RAM Memory in each node | 32 GB | 128 GB |
| Disk in each node | 650 GB SDD General Purpose disk | 5 TB SDD General Purpose disk |
Main characteristics of the input datasets for read pairs merging.
|
|
|
|
|
|
|---|---|---|---|---|
| D1 | SRR642648 | 99356100 | 100 | 52.2 |
| D2 | SRR642751 | 179922078 | 100 | 99.2 |
| D3 | SRR622459 | 1222689201 | 100 | 584.8 |
Experimental results for merging read pairs by the original FLASH algorithm.
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
| ||||||||
|
|
|
|
|
|
|
|
| ||
| D1 | 1141.159 | 1036.956 | 1238.376 | 985.176 | 87,066 | 95,815 | 80,231 | 1,00,851 | 72.03% |
| D2 | 1579.457 | 1371.293 | 1431.130 | 1594.532 | 1,13,914 | 1,31,206 | 1,25,720 | 1,12,837 | 29.33% |
| D3 | 9821.888 | 9258.983 | 8867.265 | 9260.385 | 1,24,486 | 1,32,054 | 1,37,888 | 1,32,034 | 12.7% |
Average execution time for merging read pairs by BigFLASH in the cluster.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
|
|
|
|
|
|
|
|
| |
| D1 | 433.835 | 293.052 | 200.535 | 165.563 | 2.630 | 3.538 | 6.175 | 5.950 |
| D2 | 946.030 | 482.686 | 335.814 | 283.556 | 1.670 | 2.840 | 4.262 | 5.623 |
| D3 | 5360.550 | 3162.039 | 2354.566 | 1704.687 | 1.832 | 2.928 | 3.766 | 5.432 |
Execution time of all map tasks of BigFLASH in five experiments.
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| D1 | 8 | 2857.762 | 2722.378 | 2644.078 | 2677.849 | 2626.778 |
| 16 | 3808.937 | 3912.256 | 4098.451 | 3800.208 | 3839.815 | |
| 24 | 4254.219 | 3953.122 | 3927.272 | 3840.440 | 3914.564 | |
| 32 | 4347.978 | 4218.607 | 4292.595 | 4350.655 | 4452.450 | |
| D2 | 8 | 5894.834 | 6289.555 | 6053.989 | 6255.050 | 6087.408 |
| 16 | 6625.868 | 6716.897 | 6722.830 | 6382.543 | 6497.228 | |
| 24 | 7225.822 | 6868.207 | 6736.306 | 6786.839 | 6720.426 | |
| 32 | 7775.674 | 7731.089 | 7676.768 | 7913.205 | 7626.669 | |
| D3 | 8 | 34644.069 | 33557.823 | 34702.026 | 36898.111 | 35523.228 |
| 16 | 35653.879 | 45796.694 | 45262.908 | 44955.438 | 43740.268 | |
| 24 | 43463.962 | 42863.979 | 45251.582 | 57696.282 | 65181.182 | |
| 32 | 49702.159 | 49089.883 | 49223.694 | 48896.363 | 48600.491 | |
Amount of data processed in the Map phase of BigFLASH.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
|
|
|
|
|
|
|
|
| |
| D1 | 2,57,041 | 3,82,931 | 5,74,468 | 7,10,922 | 2.952 | 3.997 | 7.160 | 7.049 |
| D2 | 2,05,922 | 4,09,591 | 6,02,577 | 7,20,183 | 1.808 | 3.122 | 4.793 | 6.383 |
| D3 | 2,44,084 | 4,25,709 | 5,52,586 | 7,71,923 | 1.961 | 3.224 | 4.007 | 5.846 |
Input datasets for mining SSRs.
|
|
|
|
|
|
|---|---|---|---|---|
| D1‘ | MSRR642648 | 71568961 | 100–200 | 14.4 |
| D2‘ | MSRR642751 | 52777550 | 12.1 | |
| D3‘ | MSRR622459 | 155236691 | 30.4 |
Running information of PERF original algorithm.
|
|
|
|
| ||||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
| |||
| D1‘ | 16311.800 | 4,388 | 6 | 3 | 5 | 0 | 500 |
| D2‘ | 12418.600 | 4,250 | |||||
| D3‘ | 44976.800 | 3,451 | |||||
Execution time of BigPERF for searching SSRs.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
|
|
|
|
|
|
|
|
| |
| D1‘ | 718.982 | 437.236 | 319.176 | 273.860 | 22.687 | 37.307 | 51.106 | 59.562 |
| D2‘ | 576.286 | 364.331 | 260.621 | 227.397 | 21.549 | 34.086 | 47.650 | 54.611 |
| D3‘ | 1633.061 | 893.130 | 706.073 | 658.399 | 27.541 | 50.359 | 63.700 | 68.312 |
Execution times of all map tasks of BigPERF in five experiments.
|
|
|
| ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
| ||
| D1‘ | 8 | 4714.821 | 4748.217 | 4734.551 | 4719.849 | 4705.533 |
| 16 | 5695.032 | 6065.955 | 6049.859 | 6015.885 | 6108.858 | |
| 24 | 6568.444 | 6510.503 | 6434.264 | 6381.585 | 6398.347 | |
| 32 | 7350.969 | 7285.193 | 7149.244 | 7183.113 | 7102.169 | |
| D2‘ | 8 | 3823.328 | 3795.953 | 3752.347 | 3776.384 | 3752.877 |
| 16 | 4816.881 | 4883.854 | 4815.179 | 4839.188 | 4855.779 | |
| 24 | 5351.771 | 5187.211 | 5184.548 | 5221.581 | 5214.872 | |
| 32 | 6055.070 | 5837.347 | 5917.858 | 5793.088 | 5863.227 | |
| D3‘ | 8 | 11080.473 | 11080.752 | 11109.434 | 11066.792 | 11126.353 |
| 16 | 12746.333 | 12839.853 | 12649.824 | 12608.740 | 12767.178 | |
| 24 | 15301.268 | 15432.642 | 15299.017 | 15233.146 | 15261.768 | |
| 32 | 20687.485 | 19501.560 | 19232.093 | 17596.133 | 17472.989 | |
Amount of data processed by BigPERF in the Map phase.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
| |||||||
|
|
|
|
|
|
|
|
| |
| D1‘ | 1,06,037 | 1,79,307 | 2,54,866 | 3,07,540 | 24.165 | 40.863 | 58.082 | 70.087 |
| D2‘ | 97,732 | 1,63,493 | 2,32,012 | 2,77,620 | 22.996 | 38.469 | 54.591 | 65.322 |
| D3‘ | 1,11,955 | 1,95,230 | 2,43,420 | 2,62,862 | 32.441 | 56.572 | 70.536 | 76.170 |
Figure 4Runtime performance comparison between BigFLASH and FLASH for merging read pairs.
Figure 5Runtime performance comparison between BigPERF and PERF for mining SSRs.