| Literature DB >> 30180802 |
Alina Ott1,2, James C Schnable3,4,5, Cheng-Ting Yeh1,4, Linjiang Wu6, Chao Liu6,7, Heng-Cheng Hu8,9,10, Clifton L Dalgard8,9,11, Soumik Sarkar6, Patrick S Schnable12,13,14.
Abstract
BACKGROUND: Short read DNA sequencing technologies have revolutionized genome assembly by providing high accuracy and throughput data at low cost. But it remains challenging to assemble short read data, particularly for large, complex and polyploid genomes. The linked read strategy has the potential to enhance the value of short reads for genome assembly because all reads originating from a single long molecule of DNA share a common barcode. However, the majority of studies to date that have employed linked reads were focused on human haplotype phasing and genome assembly.Entities:
Keywords: Genome assembly; Long molecule sequencing; Polyploid assembly
Mesh:
Year: 2018 PMID: 30180802 PMCID: PMC6122573 DOI: 10.1186/s12864-018-5040-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Types of assembled contigs and alignments to REF contigs. a A contig pair is a pair of contigs which are the only contigs originating from a single scaffold. b Some scaffolds contain “N”s that denote scaffolding of contigs from pairs of reads or linked reads with common barcodes. After removal of “N”s, the remaining sequences are termed LR contigs or REF contigs, depending on the origin of the scaffold. Removal of 40 bases from both ends of an LR contig results in a trimmed LR contig. c Trimmed or untrimmed LR contigs are aligned to the REF contigs. Alignments are categorized as fully aligned, where the entire contig aligns to a REF contig; alignments with tails, where a region of the LR contig aligns to a REF contig but a region at either or both ends of the LR contig does not align to the REF contig; or uncategorized, where the LR contig extends past the edge of a REF contig. d LR contigs with tails are divided into two regions: the aligned region and the tail region. Tails can be removed in silico to generate a set of tail-derived contigs. e LR contigs with tails that fully align to a unique location in the genome on the same or a different REF contig are termed chimeric LR contigs
Fig. 2Illustration of machine learning methodology. A gene sequence is converted to a state sequence that forms a Markov chain; the Markov chain is encoded using a Probabilistic Finite State Automation (PFSA); the transition matrix of the PFSA is used as an input to the deep convolutional neutral network (CNN) for classifying the gene sequence
Fig. 4Coverage of the pseudomolecule level assembly of foxtail millet by syntenic proso millet scaffolds. Green horizontal lines indicate each of the nine foxtail millet chromosomes. Boxes in red and blue indicate syntenic regions from individual proso millet scaffolds. Boxes are tiled above (blue) and below (red) in such a way as to avoid double coverage of the foxtail millet genome by multiple scaffolds on the same side (Methods)
Alignment of contigs from different data sets to the AGPv2 reference genome using BLAST. To be classified as aligned, contigs must match the reference with ≥95% identity
| No. Alignments | No. Contigs (%) from | ||||||
|---|---|---|---|---|---|---|---|
| LR Contigs | Trimmed LR Contigs | MAGIs | Sim 1a | Sim 2b | Sim 3c | ABySS | |
| 1 (Unique) | 222,531 (95.0) | 222,245 (95.1) | 109,665 (96.1) | 256,255 (98.7) | 262,727 (98.8) | 257,049 (98.7) | 4,614,334 (42.9) |
| 2 | 2145 (0.9) | 2194 (0.9) | 1347 (1.2) | 1985 (0.8) | 2014 (0.8) | 1971 (0.8) | 1,520,610 (14.1) |
| > 2 | 629 (0.3) | 535 (0.2) | 138 (0.1) | 568 (0.2) | 488 (0.2) | 716 (0.3) | 4,268,325 (39.7) |
| 0 | 8848 (3.8) | 8931 (3.8) | 3024 (2.6) | 762 (0.3) | 719 (0.3) | 790 (0.3) | 350,313 (3.26) |
| Total | 234,153 | 233,905 | 114,173 | 259,570 | 265,968 | 260,526 | 10,753,582 |
aSimulation 1: 50 kb molecule length and 400 M reads
bSimulation 2: 80 kb molecule length and 400 M reads
cSimulation 3: 50 kb molecule length and 800 M reads
Categorization of contig alignments
| Category | No. Contigs Uniquely Aligned to REF (% of total classified) | ||||||
|---|---|---|---|---|---|---|---|
| LR Contigs | Trimmed LR Contigs | MAGIs | Sim 1a | Sim 2b | Sim 3c | ABySS | |
| Fully aligned | 156,841 (79.8) | 179,237 (91.0) | 101,265 (94.1) | 233,396 (94.0) | 240,866 (94.3) | 233,758 (94.2) | 4,532,012 (98.6) |
| With tails | 39,754 (20.2) | 17,812 (9.04) | 6317 (5.87) | 14,785 (5.96) | 14,443 (5.66) | 14,429 (5.81) | 63,859 (1.39) |
| Unclassified | 25,936 | 25,196 | 2083 | 8074 | 7438 | 8862 | 18,463 |
| Total Classified | 196,595 | 197,049 | 107,582 | 248,181 | 255,309 | 248,187 | 4,595,871 |
aSimulation 1: 50 kb molecule length and 400 M reads
bSimulation 2: 80 kb molecule length and 400 M reads
cSimulation 3: 50 kb molecule length and 800 M reads
Fig. 3Conservation of gene order between the foxtail millet reference genome and pairs of scaffolds from the proso millet linked read assembly spanning the same region. The foxtail millet reference genome is shown in the center panel with genes indicated by gray arrows and protein coding exons by green squares. Proso millet scaffolds are shown above and below the foxtail millet genome. Red and blue lines connect gene regions from the foxtail millet genome with homologous sequence in the respective proso millet scaffolds