| Literature DB >> 30367597 |
Shaun D Jackman1, Lauren Coombe2, Justin Chu2, Rene L Warren2, Benjamin P Vandervalk2, Sarah Yeo2, Zhuyi Xue2, Hamid Mohamadi2, Joerg Bohlmann3, Steven J M Jones2, Inanc Birol2.
Abstract
BACKGROUND: Genome sequencing yields the sequence of many short snippets of DNA (reads) from a genome. Genome assembly attempts to reconstruct the original genome from which these reads were derived. This task is difficult due to gaps and errors in the sequencing data, repetitive sequence in the underlying genome, and heterozygosity. As a result, assembly errors are common. In the absence of a reference genome, these misassemblies may be identified by comparing the sequencing data to the assembly and looking for discrepancies between the two. Once identified, these misassemblies may be corrected, improving the quality of the assembled sequence. Although tools exist to identify and correct misassemblies using Illumina paired-end and mate-pair sequencing, no such tool yet exists that makes use of the long distance information of the large molecules provided by linked reads, such as those offered by the 10x Genomics Chromium platform. We have developed the tool Tigmint to address this gap.Entities:
Keywords: 10x Genomics Chromium; Assembly correction; Genome scaffolding; Genome sequence assembly; Linked reads
Mesh:
Year: 2018 PMID: 30367597 PMCID: PMC6204047 DOI: 10.1186/s12859-018-2425-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An assembly of a hypothetical genome with two linear chromosomes is assembled in three contigs. One of those contigs is misassembled. In its current misassembled state, this assembly cannot be completed by scaffolding alone. The misassembled contig must first be corrected by cutting the contig at the location of the misassembly. After correcting the missasembly, each chromosome may be assembled into a single scaffold
Fig. 2The block diagram of Tigmint. Input files are shown in parallelograms. Intermediate files are shown in rectangles. Output files are shown in ovals. File formats are shown in parentheses
Genome assemblies of both short and long read sequencing were used to evaluate Tigmint
| Sample | Sequencing Platform | Assembler |
|---|---|---|
| HG004 | Illumina | ABySS |
| HG004 | Illumina | DISCOVARdenovo |
| HG004 | 10x Chromium | Supernova |
| HG004 | PacBio | Falcon |
| NA12878 | Oxford Nanopore | Canu |
The GIAB sample HG004 is also known as NA24143. See “Availability of data and material” to access the sequencing data and assemblies
Fig. 3Assembly contiguity and correctness metrics of HG004 with and without correction using Tigmint prior to scaffolding with ARCS. The most contiguous and correct assemblies are found in the top-left. Supernova assembled linked reads only, whereas the others used paired end and mate pair reads
The assembly contiguity (scaffold NG50 and NGA50) and correctness (number of misassemblies) metrics with and without correction using Tigmint prior to scaffolding with ARCS
| Sample | Assembly | NG50 (Mbp) | NGA50 (Mbp) | Misass. | Reduction |
|---|---|---|---|---|---|
| HG004 | ABySS | 3.65 | 3.09 | 790 | NA |
| ABySS+Tigmint | 3.47 | 3.09 | 574 | 216 (27.3%) | |
| ABySS+ARCS | 9.91 | 7.86 | 823 | NA | |
| ABySS+Tigmint+ARCS | 26.39 | 16.43 | 641 | 182 (22.1%) | |
| HG004 | DISCO+ABySS | 10.55 | 9.04 | 701 | NA |
| DISCO+ABySS+Tigmint | 10.16 | 9.04 | 666 | 35 (5.0%) | |
| DISCO+ABySS+ARCS | 29.20 | 17.05 | 829 | NA | |
| DISCO+ABySS+Tigmint+ARCS | 35.31 | 23.68 | 804 | 25 (3.0%) | |
| HG004 | DISCO+BESST | 7.01 | 6.14 | 568 | NA |
| DISCO+BESST+Tigmint | 6.77 | 6.14 | 493 | 75 (13.2%) | |
| DISCO+BESST+ARCS | 27.64 | 15.14 | 672 | NA | |
| DISCO+BESST+Tigmint+ARCS | 33.43 | 19.40 | 603 | 69 (10.3%) | |
| HG004 | Supernova | 38.48 | 12.65 | 1005 | NA |
| Supernova+Tigmint | 17.72 | 11.43 | 923 | 82 (8.2%) | |
| Supernova+ARCS | 39.63 | 13.24 | 1052 | NA | |
| Supernova+Tigmint+ARCS | 27.35 | 12.60 | 998 | 54 (5.1%) | |
| HG004 | Falcon | 4.56 | 4.21 | 3640 | NA |
| Falcon+Tigmint | 4.45 | 4.21 | 3444 | 196 (5.4%) | |
| Falcon+ARCS | 18.14 | 9.71 | 3,801 | NA | |
| Falcon+Tigmint+ARCS | 22.52 | 11.97 | 3,574 | 227 (6.0%) | |
| NA12878 | Canu | 7.06 | 5.40 | 1688 | NA |
| Canu+Tigmint | 6.87 | 5.38 | 1600 | 88 (5.2%) | |
| Canu+ARCS | 19.70 | 10.12 | 1736 | NA | |
| Canu+Tigmint+ARCS | 22.01 | 10.85 | 1,626 | 110 (6.3%) | |
| Simulated | ABySS | 9.00 | 8.28 | 272 | NA |
| ABySS+Tigmint | 8.61 | 8.28 | 217 | 55 (20.2%) | |
| ABySS+ARCS | 23.37 | 17.09 | 365 | NA | |
| ABySS+Tigmint+ARCS | 30.24 | 24.98 | 320 | 45 (12.3%) |
ABySS and DISCOVARdenovo are assemblies of Illumina sequencing. Supernova is an assembly of linked read sequencing. Falcon is an assembly of PacBio sequencing. Canu is an assembly Oxford Nanopore sequencing. Data simulated with LRSim is assembled with ABySS
Fig. 4Assembly contiguity and correctness metrics of HG004 corrected with NxRepair, which uses mate pairs, and Tigmint, which uses linked reads. The most contiguous and correct assemblies are found in the top-left
Fig. 5Assemblies of Oxford Nanopore sequencing of NA12878 with Canu and PacBio sequencing of HG004 with Falcon with and without correction using Tigmint prior to scaffolding with ARCS
Fig. 6The alignments to the reference genome of the ABySS assembly of HG004 before and after Tigmint. The reference chromosomes are on the left in colour, the assembly scaffolds on the right in grey. No translocations are visible after Tigmint
Fig. 7a. b. c. d. Effect of varying the window and span parameters on scaffold NGA50 and misassemblies of three assemblies of HG004