| Literature DB >> 35046988 |
Yu Chen1, DongLiang You1, TianJiao Zhang1, GuoHua Wang1,2.
Abstract
In the field of genome assembly, contig assembly is one of the most important parts. Contig assembly requires the processing of overlapping regions of a large number of DNA sequences and this calculation usually takes a lot of time. The time consumption of contig assembly algorithms is an important indicator to evaluate the degree of algorithm superiority. Existing methods for processing overlapping regions of sequences consume too much in terms of running time. Therefore, we propose a method SLDMS for processing sequence overlapping regions based on suffix array and monotonic stack, which can effectively improve the efficiency of sequence overlapping regions processing. The running time of the SLDMS is much less than that of Canu and Flye in dealing with the sequence overlap interval and in some data with most sequencing errors occur at both the ends of the sequencing data, the running time of the SLDMS is only about one-tenth of the other two methods.Entities:
Keywords: algorithm; application; contig assembly; genome assembly; overlapping regions; sequence analysis
Year: 2022 PMID: 35046988 PMCID: PMC8761809 DOI: 10.3389/fpls.2021.813036
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
FIGURE 1The overall workflow of the SLDMS.
FIGURE 2The elements stored in three arrays.
FIGURE 3The position of str2 in the suffix set of str1.
FIGURE 4(A) Schematic diagram of the monotonic stack maintenance process. (B) Diagram of the process of obtaining overlapping regions information.
FIGURE 5(A) Maintenance process diagram of monotone stack and (B) Overlap regions information acquisition flow diagram.
FIGURE 6Update process diagram of the min-heap: (A) from bottom to top and (B) from top to bottom.
The PacBio-HiFi dataset used in the experiment and its description.
| Dataset | Size(MB) | Description |
|
| 579 | WGS of |
| 1,914 | WGS of | |
| 1,131 | WGS of | |
| 2,190 | WGS of | |
| 1,567 | WGS of | |
| 2,760 | WGS of |
FIGURE 7Time required for different software programs to run PacBio-HiFi datasets to find the overlapping regions.
FIGURE 8Time required for different software programs to run ultrahigh-accuracy simulation datasets to find overlapping regions.
FIGURE 9The time required for different software to run simulation datasets with errors to find overlapping regions.