| Literature DB >> 30598087 |
Kun-Tze Chen1, Hsin-Ting Shen1, Chin Lung Lu2.
Abstract
BACKGROUND: One of the important steps in the process of assembling a genome sequence from short reads is scaffolding, in which the contigs in a draft genome are ordered and oriented into scaffolds. Currently, several scaffolding tools based on a single reference genome have been developed. However, a single reference genome may not be sufficient alone for a scaffolder to generate correct scaffolds of a target draft genome, especially when the evolutionary relationship between the target and reference genomes is distant or some rearrangements occur between them. This motivates the need to develop scaffolding tools that can order and orient the contigs of the target genome using multiple reference genomes.Entities:
Keywords: Bioinformatics; Contig; Multiple reference genomes; Scaffolding; Sequencing
Mesh:
Year: 2018 PMID: 30598087 PMCID: PMC6311912 DOI: 10.1186/s12918-018-0654-y
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Fig. 1Pseudo-code description for the multiple reference-based scaffolding algorithm we used to implement Multi-CSAR
Fig. 2Schematic workflow of Multi-CSAR: a A target genome T={c1,c2,c3,c4} and three single reference-derived scaffolds S1=(+c1,+c2,+c3), S2=(+c2,+c3,+c4) and S3=(−c2,−c1,−c4,−c3) that are assumed to be obtained by applying CSAR on three reference genomes R1,R2 and R3, respectively, with equal weight of one. b The contig adjacency graph G constructed by using S1,S2 and S3, where the dashed lines denote the edges with zero weight. c A maximum weighted perfect matching derived by applying Blossom V on G. d By removing the minimum weighted edge from M, we obtain such that M′∪C contains no cycles, where the dotted lines denote the edges in C. e The final scaffold (+c1,+c2,+c3,+c4) of T constructed based on the edge connections in M′
Summary of the five testing datasets
| Organism | No. of replicons | No. of contigs | No. of references | Genome size (Mbp) | GC% |
|---|---|---|---|---|---|
| 4 | 1,223 | 4 | 8.05 | 65.9 | |
| 1 | 451 | 25 | 4.64 | 50.8 | |
|
| 1 | 116 | 13 | 4.41 | 65.6 |
| 7 | 564 | 2 | 4.60 | 67.4 | |
|
| 3 | 170 | 35 | 2.90 | 32.0 |
Average performance of the evaluated multiple reference-based scaffolders on the five testing datasets
| Scaffolder | Sen. | Pre. | Cov. | NGA50 | #Scaf. | Time | |
|---|---|---|---|---|---|---|---|
| Multi-CSAR (NUCmer) |
| 90.8 |
|
|
| 9 |
|
| Multi-CSAR (PROmer) | 89.3 | 90.4 | 89.8 | 92.5 | 1,016,308 |
| 6.3 |
| Ragout | 79.0 |
| 84.4 | 87.4 | 992,966 | 84 | 24.8 |
| MeDuSa | 78.2 | 81.9 | 80.0 | 83.3 | 671,001 | 26 | 3.8 |
The values of sensitivity (abbreviated as ‘Sen.’), precision (abbreviated as ‘Pre.’), F-score and genome coverage (abbreviated as ‘Cov.’) are displayed in percentage (%), and the size of NGA50 in base pairs (bp). The column ‘#Scaf.’ gives the number of scaffolds returned by each scaffolder and the column ‘Time’ displays the running time in minutes. The best result in each column is shown in bold
Average performance of Multi-CSAR on the five testing datasets when using the sequence identity-based weighting scheme
| Scaffolder | Sen. | Pre. | Cov. | NGA50 | #Scaf. | Time | |
|---|---|---|---|---|---|---|---|
| Multi-CSAR (NUCmer) |
|
|
|
|
| 10 |
|
| Multi-CSAR (PROmer) | 89.4 | 90.5 | 89.9 | 92.8 | 1,045,489 |
| 6.3 |
The values of sensitivity (abbreviated as ‘Sen.’), precision (abbreviated as ‘Pre.’), F-score and genome coverage (abbreviated as ‘Cov.’) are displayed in percentage (%), and the size of NGA50 in base pairs (bp). The column ‘#Scaf.’ gives the number of resulting scaffolds and the column ‘Time’ displays the running time in minutes. The best result in each column is shown in bold