| Literature DB >> 35729491 |
Vladimir Nikolić1,2, Amirhossein Afshinfard1,2, Justin Chu1,2, Johnathan Wong1, Lauren Coombe1, Ka Ming Nip1,2, René L Warren1, Inanç Birol3,4.
Abstract
BACKGROUND: De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes.Entities:
Keywords: Bloom filters; De novo assembly; Repeat resolution; Scalable; Short reads
Mesh:
Year: 2022 PMID: 35729491 PMCID: PMC9215042 DOI: 10.1186/s12859-022-04790-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1H. sapiens parameter sweep QUAST results. NGA50 and misassembly scaffold metrics with and without RResolver. High-quality assemblies lean towards top left corner, with high contiguity and low misassemblies. The text labels indicate the offset between and used for each data point. Some text labels (for smaller triangles) and overlapping data points are omitted to reduce crowdedness in the plot, while keeping the trends. All RResolver data points have higher NGA50 than the corresponding baseline assembly, and some have fewer misassemblies. For bp datasets, picking the highest increases NGA50 the most while keeping misassembly increase moderate. For datasets, picking the highest is not necessarily optimal as it leads to increased misassemblies, and a is a good empirical choice for balancing NGA50 increase and minimizing misassemblies
Fig. 2H. sapiens subsampled coverage QUAST results. NGA50 and misassemblies plots for a bp and a bp human dataset. The text labels indicate the offset between and used for each data point. Each subplot ABySS base assembly uses optimal value. As in Fig. 1, the highest is a good choice for bp datasets, and an offset of 60 works well for bp, giving a good contiguity improvement without increasing misassemblies too much
Fig. 3C. elegans and A. thaliana QUAST results. NGA50 and misassembly plots for C. elegans and A. thaliana. The heuristic is used, limited by read size of 110 bp and 151 bp. Both datasets see an improvement in contiguity, with a moderate increase in misassemblies in some cases. For C. elegans , no resolveable repeats were found and hence no change in assembly quality