| Literature DB >> 28469237 |
Huilong Du1,2, Ying Yu1, Yanfei Ma1, Qiang Gao1, Yinghao Cao1, Zhuo Chen1,2, Bin Ma1, Ming Qi1, Yan Li1, Xianfeng Zhao1, Jing Wang1, Kunfan Liu1, Peng Qin3, Xin Yang1, Lihuang Zhu1, Shigui Li3, Chengzhi Liang1,2.
Abstract
A high-quality reference genome is critical for understanding genome structure, genetic variation and evolution of an organism. Here we report the de novo assembly of an indica rice genome Shuhui498 (R498) through the integration of single-molecule sequencing and mapping data, genetic map and fosmid sequence tags. The 390.3 Mb assembly is estimated to cover more than 99% of the R498 genome and is more continuous than the current reference genomes of japonica rice Nipponbare (MSU7) and Arabidopsis thaliana (TAIR10). We annotate high-quality protein-coding genes in R498 and identify genetic variations between R498 and Nipponbare and presence/absence variations by comparing them to 17 draft genomes in cultivated rice and its closest wild relatives. Our results demonstrate how to de novo assemble a highly contiguous and near-complete plant genome through an integrative strategy. The R498 genome will serve as a reference for the discovery of genes and structural variations in rice.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28469237 PMCID: PMC5418594 DOI: 10.1038/ncomms15324
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Schematic depiction of the construction of super-contigs.
(a) Whole-genome assembly from corrected SMRT sequences (thin black lines) generates a set of WGS contigs (c1–c5). (b) Sequence tags (stacked short lines) from each fosmid pool are used to retrieve corrected PacBio sequences (black lines), which are then assembled into fosmid contigs (for example, FC1–FC5). (c) Many of the WGS contigs are grouped together and anchored onto chromosomes based on their linkage relationship (LG1–LG12). For example, c1, c2, c4 and c5 are anchored onto a chromosome, but c3 is not (in the non-LG group). Note that c2 and c4 are close to c1 physically, but c5 is not close to them. (d) A simplified overlap graph is constructed using anchored WGS contigs in LG1, LG2 and LG12 and unanchored WGS contigs in the non-LG group as nodes and fosmid contigs as edges (blue lines). A sequence overlap between a WGS contig and a fosmid contig must be at least 5 kb to form a connection. Two WGS contigs can be connected by multiple fosmid contigs. (e) Three most reliable paths from (d), including two partial ones, are selected as described in the Methods section to build super-contigs. The blue lines represent fosmid contigs. The other coloured lines represent WGS contigs. The WGS contigs (including unanchored ones in green) on each path are connected to the best aligned fosmid contigs to form super-contigs. The overlap between two WGS contigs (for example, nodes 4 and 7) causes negative edge length as described in the Methods section. In such cases, the fosmid sequences were used for the overlapping regions to form the super-contig. In all other cases, the WGS contig sequences were used for the overlapping regions. Node 33 represents a chimeric WGS contig, which is split into two parts to be used separately in two super-contigs. Dashed lines represent alignment overhangs.
Comparison of basic sequence statistics of R498 and Nipponbare MSU7.
| Chr1 | 44,361,539 | 43,270,923 | 1 | 12 | Both | Single |
| Chr2 | 37,764,328 | 35,937,250 | 1 | 6 | Both | Both |
| Chr3 | 39,691,490 | 36,413,819 | 1 | 12 | Both | Both |
| Chr4 | 35,849,732 | 35,502,694 | 0 | 46 | Both | Both |
| Chr5 | 31,237,231 | 29,958,434 | 0 | 14 | Both | Single |
| Chr6 | 32,465,040 | 31,248,787 | 0 | 5 | Both | Single |
| Chr7 | 30,277,827 | 29,697,621 | 1 | 9 | Both | Both |
| Chr8 | 29,952,003 | 28,443,022 | 0 | 3 | Both | Single |
| Chr9 | 24,760,661 | 23,012,720 | 0 | 18 | Both | None |
| Chr10 | 25,582,588 | 23,207,287 | 0 | 28 | Both | Single |
| Chr11 | 31,778,392 | 29,021,106 | 1 | 69 | Both | None |
| Chr12 | 26,601,357 | 27,531,856 | 0 | 17 | Both | None |
| mtDNA | 527,116 | 490,520 | 0 | 0 | — | — |
| cpDNA | 134,546 | 134,525 | 0 | 0 | — | — |
| Total** | 390,983,850 | 373,870,564 | 5 | 239 | 24 | 13 |
Chr, chromosome; Len, length; Nip, Nipponbare; Tel, telomere.
**R498 GC content, 43.57%; Nip GC content, 43.53%.
*R498 centromere gap locations: chr1, 17,339,881–17,349,880; chr2, 13,684,078–13,694,077; chr3, 21,193,261–21,203,260; chr7, 12,695,696–12,705,695; chr11, 13,162,594–13,172,593. More potential duplication gaps are listed in Supplementary Table 3.
†The gaps in Nip are those at least 100 bp.
Figure 2Whole-genome comparison of R498 and Nip.
The aligned regions are represented by the crossing lines between each pair of pseudomolecules. A large inversion is observable on chromosome 6. Red rectangles above or below the pseudomolecules represent PVs (≥500 bp) relative to each other. Black rectangles indicate the position of centromere-surrounding sequences defined in Nip reference genome (http://rice.plantbiology.msu.edu/annotation_pseudo_centromeres.shtml), and blue rectangles represent the regions containing centromere-specific tandem repeats of RCS2 in R498. Left or right black arrows at the end of each pseudomolecule indicate the presence of telomere repeats. Red arrows indicate the locations of rDNAs at the start of chromosomes 9 and 10.
Comparison of genome features between R498 and Nipponbare MSU7.
| Number | 38,714 | 36,775 | |
| % | 42.05 | 40.43 | |
| Number | 2,548,071 | 2,548,071 | |
| Number | 226,771 | 239,390 | |
| Base | 1,034,898 | 1,077,080 | |
| Number | 3,432 | 3,411 | |
| Base | 887,602 | 856,902 | |
| Number | 1,301 | 1,328 | |
| Base | 938,805 | 962,372 | |
| Number | 5,407 | 4,973 | |
| Base | 19,316,383 | 16,710,125 | |
| Number | 1,524 | 726 | |
| Base | 28,857,940 | 13,663,983 | |
| Number | 170 | 92 | |
| Base | 16,851,530 | 7,960,212 | |
| Number | 8,402 | 7,119 | |
| Base | 65,964,658 | 39,296,692 |
AV, absence variation; PV, presence variation; SNP, single-nucleotide polymorphism.
An insertion in R498 is equivalent to a deletion in Nip, and vice versa. A PV in R498 is equivalent to an AV in Nip, and vice versa.
Figure 3Comparison of genes and PAVs among rice genomes.
(a) Line plot showing the number of homologous genes between R498 and Nip. The proteins coded by genes in R498 and Nip were BLASTed against each other, and the mutual sequence coverage and alignment identity of the best-matched protein pairs was computed. The total number of genes compared was: R498, 38,714; Nip 36,775. The x axis indicates the percentage of either coverage or identity as the threshold. The y axis indicates the number of the genes in R498 or Nip that are aligned to best-matched homologous genes in the other genome with coverage or identity above the threshold. (b) Synteny view of the PSTOL1 gene region between R498 (Chr12: 14,818,743–15,035,152 bp), Nip and Kasalath (GenBank accession no. AB458444.1). Different amplification or loss of LTR retrotransposons between R498 and Nip is evident. (c) Synteny view of rice blast resistance gene locus for Piz-t in R498 (Chr6: 10,079,870–10,277,132 bp), Pi2 from indica cultivar C101A51 and Pi9 from O. minuta (GenBank accession nos DQ352453 and DQ285630) to Nip on chromosome 6. This region is composed of multiple NBS-LRR domain genes, transposons and pseudogenes. Several PAVs of 13–65 kb were shown, which contain different type of LTR transposons. (d) Statistics of the PVs on R498 and Nip that are shared by one or more of the 17 draft genomes. The names and data sources of the 17 draft genomes were listed in Supplementary Data 6. Two PVs on a reference genome relative to two draft genomes were defined as shared if at least 75% of their sequences overlap. The y axis indicates the number of PVs that are shared under each genome number on x axis.