| Literature DB >> 32085722 |
Ksenia Khelik1, Geir Kjetil Sandve1, Alexander Johan Nederbragt1,2, Torbjørn Rognes3,4.
Abstract
BACKGROUND: Advances in whole genome sequencing strategies have provided the opportunity for genomic and comparative genomic analysis of a vast variety of organisms. The analysis results are highly dependent on the quality of the genome assemblies used. Assessment of the assembly accuracy may significantly increase the reliability of the analysis results and is therefore of great importance.Entities:
Keywords: Assembly accuracy assessment; Assembly errors; Genome assembly; Illumina paired-end reads; Structural variant detection
Year: 2020 PMID: 32085722 PMCID: PMC7035700 DOI: 10.1186/s12859-020-3414-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Path gap exclusion. The black line represents an assembly. The assembly regions marked by red colour correspond to repeated regions. The repeated regions are identical or near-identical copies of the same repeat or copies of different repeats. The arrows represent read paths. (a) Exclusion of a path gap fully covered by a read path of the same type and another orientation. The rectangles between read paths indicate path gaps. Path gap 1 is excluded due to the presence of a required read path. The path gaps marked by number 2 are not excluded and require further analysis. (b) Exclusion of a path gap appeared due to alternation of paths of different types. The black squares mark the locations of assembly errors. The rectangles between read paths indicate path gaps that are not excluded. The path gaps marked by number 3 is not excluded due to the repetition of read path types (e.g. the Single forward-oriented path is followed by another Single forward-oriented path instead of the Single-Multiple forward-oriented path). The path gaps marked by number 4 are not excluded because one read path type is missed (e.g. Multiple forward-oriented path is followed by Single forward-oriented path instead of Multiple-Single forward-oriented path)
Fig. 2Error location adjustment. The black line represents an assembly. The arrows represent read paths of any type. The rectangles represent initial path gaps. The red areas in the rectangles in cases a) and b) correspond to the adjusted path gaps with the shortened beginning and end, respectively
Fig. 3ROC-like plot based on the simulated datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions
Fig. 4ROC-like plot based on the simulated datasets with varying sequencing coverage. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying sequencing coverage (indicated with different symbols)
Fig. 5ROC-like plot based on Assemblathon 1 datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for seven tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions
Fig. 6ROC-like plot based on bacterial genome datasets with varying flanking region size. The sensitivity and false discovery rate (FDR) are plotted for six tools (indicated with different colours) using varying flanking region sizes (indicated with different symbols). The flanking region size corresponds to the amount of slack allowed in the position of correct predictions