| Literature DB >> 28985356 |
Masaomi Hatakeyama1,2,3, Sirisha Aluri2, Mathi Thumilan Balachadran1,2,4, Sajeevan Radha Sivarajan1,2,4, Andrea Patrignani2, Simon Grüter2, Lucy Poveda2, Rie Shimizu-Inatsugi1, John Baeten5, Kees-Jan Francoijs5, Karaba N Nataraja4, Yellodu A Nanja Reddy6, Shamprasad Phadnis7, Ramapura L Ravikumar7, Ralph Schlapbach2, Sheshshayee M Sreeman4, Kentaro K Shimizu1,8.
Abstract
Finger millet (Eleusine coracana (L.) Gaertn) is an important crop for food security because of its tolerance to drought, which is expected to be exacerbated by global climate changes. Nevertheless, it is often classified as an orphan/underutilized crop because of the paucity of scientific attention. Among several small millets, finger millet is considered as an excellent source of essential nutrient elements, such as iron and zinc; hence, it has potential as an alternate coarse cereal. However, high-quality genome sequence data of finger millet are currently not available. One of the major problems encountered in the genome assembly of this species was its polyploidy, which hampers genome assembly compared with a diploid genome. To overcome this problem, we sequenced its genome using diverse technologies with sufficient coverage and assembled it via a novel multiple hybrid assembly workflow that combines next-generation with single-molecule sequencing, followed by whole-genome optical mapping using the Bionano Irys® system. The total number of scaffolds was 1,897 with an N50 length >2.6 Mb and detection of 96% of the universal single-copy orthologs. The majority of the homeologs were assembled separately. This indicates that the proposed workflow is applicable to the assembly of other allotetraploid genomes.Entities:
Keywords: allotetraploid genome; finger millet; hybrid de novo assembly; whole-genome optical mapping
Year: 2018 PMID: 28985356 PMCID: PMC5824816 DOI: 10.1093/dnares/dsx036
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Hybrid de novo assembly workflow. In the figure, the white box represents a process, the gray box represents a data entity, the blue solid arrow shows the process flow, and the black dashed arrow shows the data flow. The hybrid de novo assembly was executed as follows: (1) contig assembly using only paired-end reads (Platanus ver. 1.2.4), (2) hybrid contig assembly using contigs >1 kb from the first step and long reads (DBG2OLC commit id: 7f1712ae76b015b1f4efee81a5d4e1305701197a), (3) polishing of the assembled contig with paired-end reads (Pilon ver. 1.21) (NGS contigs), (4) using the polished contigs in the two subsequent independent steps, (4-1) scaffolding of the polished contigs with mate-pair reads (NGS scaffolds), (4-2) after converting scanned molecules into a CMAP file and optical genome mapping to the NGS contigs, (5) de novo assembly of scanned molecules based on the parameters of the molecular-quality report and autonoise (autodetection of appropriate noise parameters) with the NGS contigs (BNG contigs), (6) hybrid scaffolding with both NGS scaffolds (step 4-1 output) and BNG contigs (step 5 output), and (7) gap closing (GMclose ver. 1.6) of the scaffolds with all the contigs generated in the first step (step 1 output). Finally, the workflow produced the super scaffolds. The raw data preprocesses, such as adapter trimming and removing chimera code from mate-pair reads, are not shown in this workflow.
Raw sequence data
| Library type | DNA|RNA | Sequencer | Insert length (bases) | Number of bases | Coverage depth |
|---|---|---|---|---|---|
| Paired-end | DNA | Illumina NextSeq | 250 | 93,783,186,606 | 63× |
| Paired-end | DNA | Illumina NextSeq | 550 | 63,519,282,198 | 42× |
| Paired-end | DNA | Illumina NextSeq | 700 | 50,551,135,834 | 34× |
| Mate-pair | DNA | Illumina MiSeq | 3,000 | 789,059,46 | 0.053× |
| Mate-pair | DNA | Illumina MiSeq | 5,000 | 364,070,088 | 0.24× |
| Mate-pair | DNA | Illumina MiSeq | 8,000 | 1,138,183,671 | 0.76× |
| Mate-pair | DNA | Illumina MiSeq | 20,000 | 1,366,910,280 | 0.91× |
| PacBio CLR | DNA | PacBio RS II | – | 25,086,179,201 | 16.7× |
| Paired-end | RNA | Illumina HiSeq4000 | – | 148,856,244,082 | 99.2× |
The number of sequenced raw data in each library and the coverage depth based on the genome size, 1.5 Gb as estimated by flow cytometry, are shown.
Assembly statistics at each assembly step
| Statistics | Contig assembly | Hybrid assembly | Scaffold | Hybrid scaffold |
|---|---|---|---|---|
| Total sequences | 2,812,919 | 6,374 | 2,387 | 1,897 |
| Total bases | 1,307,217,455 | 1,067,045,564 | 1,069,478,541 | 1,188,784,944 |
| Min sequence length | 115 | 3,408 | 3,911 | 1,244 |
| Max sequence length | 27,802 | 2,072,336 | 5,237,708 | 13,553,037 |
| Average sequence length | 464.72 | 167,405.96 | 448,042.96 | 626,665.76 |
| Median sequence length | 154.00 | 102,084.50 | 237,045.00 | 92,178.00 |
| N25 length | 3,419 | 520,874 | 1,581,824 | 5,029,714 |
| N50 length | 1,410 | 285,549 | 905,318 | 2,683,090 |
| N75 length | 311 | 145,770 | 457,768 | 1,232,573 |
| N90 length | 133 | 76,519 | 216,067 | 419,121 |
| (G + C)s | 43.15% | 43.83% | 43.57% | 40.98% |
| Ns | 0.00% | 0.00% | 0.73% | 6.63% |
Assembly statistics at each step are summarized. Contig assembly represents the contigs that were assembled using Platanus with preprocessed paired-end reads. Hybrid assembly represents the contigs that were generated using DBG2OLC with the Platanus contigs (> 1 kb long) and PacBio long reads. Scaffold is the result of the SSPACE standard scaffolding with mate-pair reads, and Hybrid scaffold is the hybrid scaffolding result with both Bionano Genomics de novo assembled contigs and the SSPACE scaffolds, followed by GMclose gap closing.
Figure 2.Annotation edit distance (AED) and benchmark of universal single-copy orthologs. (A) The distribution of AED tagged as a gene by MAKER. (B) Benchmark result of the plant set of universal single-copy orthologs produced by BUSCO. In each bar, the pale-blue color indicates the number of genes that were detected completely as single-copy genes in the genome, the dark-blue color indicates the number of genes that were detected as single-copy genes but were counted more than twice, the yellow color indicates the number of genes that were detected as single-copy genes but not completely, and the red color indicates the number of single-copy genes that were not detected among the plant universal single-copy orthologs. The ‘genome’ is the BUSCO result that was obtained by searching the assembled genome FASTA file, and the ‘transcripts’ and ‘proteins’ constitute the BUSCO result that was obtained by searching the transcript FASTA file and amino acid FASTA file produced by MAKER.