| Literature DB >> 32350267 |
Cong Shi1,2, Wei Li3, Qun-Jie Zhang3, Yun Zhang1, Yan Tong1, Kui Li3, Yun-Long Liu1, Li-Zhi Gao4,5.
Abstract
Exploiting novel gene sources from wild relatives has proven to be an efficient approach to advance crop genetic breeding efforts. Oryza granulata, with the GG genome type, occupies the basal position of the Oryza phylogeny and has the second largest genome (~882 Mb). As an upland wild rice species, it possesses renowned traits that distinguish it from other Oryza species, such as tolerance to shade and drought, immunity to bacterial blight and resistance to the brown planthopper. Here, we generated a 736.66-Mb genome assembly of O. granulata with 40,131 predicted protein-coding genes. With Hi-C data, for the first time, we anchored ~98.2% of the genome assembly to the twelve pseudo-chromosomes. This chromosome-length genome assembly of O. granulata will provide novel insights into rice genome evolution, enhance our efforts to search for new genes for future rice breeding programmes and facilitate the conservation of germplasm of this endangered wild rice species.Entities:
Mesh:
Year: 2020 PMID: 32350267 PMCID: PMC7190833 DOI: 10.1038/s41597-020-0470-2
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Libraries and read statistics used for the O. granulata genome assembly.
| Libraries | Illumina sequencer | Insert size (bp) | Read length (bp) | Raw data (Mb) | Raw sequence coverage (×) |
|---|---|---|---|---|---|
| Paired-End | HiSeq. 2500 | 260 | 150 | 18,198.87 | 22.98 |
| HiSeq. 2500 | 260 | 150 | 19,450.86 | 24.56 | |
| HiSeq. 2500 | 260 | 150 | 18,386.95 | 23.22 | |
| Mate Pairs | HiSeq. 2000 | 2,940 | 100 | 18,499.36 | 23.36 |
| HiSeq. 2000 | 2,920 | 100 | 16,957.87 | 21.41 | |
| HiSeq. 2000 | 2,980 | 100 | 15,201.03 | 19.19 | |
| HiSeq. 2500 | 8,960 | 125 | 15,150.44 | 19.13 | |
| HiSeq. 2000 | 19,800 | 101 | 11,536.84 | 14.57 | |
| Hi-C | HiSeq. X Ten | 200–500 | 150 | 109,413.57 | 138.15 |
Note that the sequencing coverage is calculated by the genome size of 792 Mb.
Clean RNA-Seq data of O. granulata from seven tissues.
| RNA source tissues | Read length (bp) | Number of paired-end reads | Clean data (bp) |
|---|---|---|---|
| Panicles at the booting stage | 126 | 15,198,948 | 3,360,266,967 |
| Panicles when flowering | 126 | 14,134,157 | 3,026,281,452 |
| Panicles at the grain-filling stage | 126 | 14,272,224 | 3,105,851,172 |
| Flag leaves | 126 | 15,492,612 | 3,382,030,990 |
| Stem | 126 | 14,072,253 | 3,059,093,232 |
| Shoots of seedlings | 126 | 13,954,519 | 3,029,673,209 |
| Roots of seedlings | 126 | 13,081,148 | 2,845,575,334 |
| Total | 100,205,861 | 21,808,772,356 |
Fig. 1Cytogram of the fluorescence intensity of O. sativa ssp. japonica cv. Nipponbare, O. granulata and Z. mays ssp. mays var. B73 nuclei isolated with Otto’s buffer. All CV values were <5%.
Fig. 2The 17-mer distribution of sequencing reads from O. granulata. The occurrence of 17-mers was calculated using GCE based on the sequencing data from short-insert-size libraries (insert size ≤500 bp) of O. granulata.
Scaffold length distribution of the O. granulata genome.
| Scaffold length | Number | Scaffold length (bp) | Average length (bp) | Percentage (%) |
|---|---|---|---|---|
| >1 kb | 2,389 | 736,656,379 | 308,353 | 99.99 |
| >10 kb | 1,582 | 734,295,743 | 464,156 | 99.68 |
| >50 kb | 1,321 | 727,772,083 | 550,925 | 98.79 |
| >100 kb | 1,146 | 715,037,410 | 623,941 | 97.06 |
| >200 kb | 925 | 682,041,900 | 737,342 | 92.59 |
| >300 kb | 738 | 635,988,431 | 861,772 | 86.33 |
| >500 kb | 518 | 550,118,292 | 1,062,004 | 74.68 |
| >800 kb | 290 | 403,838,490 | 1,392,546 | 54.82 |
| >1 Mb | 215 | 336,057,751 | 1,563,059 | 45.62 |
Assembly statistics of the O. granulata genome sequence.
| Chromosome ID | Scaffold number | Chromosome length (bp) |
|---|---|---|
| 1 | 134 | 80,745,213 |
| 2 | 137 | 77,995,952 |
| 3 | 154 | 77,713,834 |
| 4 | 108 | 71,071,801 |
| 5 | 95 | 64,488,131 |
| 6 | 115 | 58,352,620 |
| 7 | 111 | 57,695,860 |
| 8 | 102 | 55,414,587 |
| 9 | 79 | 54,629,218 |
| 10 | 66 | 45,527,259 |
| 11 | 82 | 44,217,632 |
| 12 | 82 | 35,346,218 |
| Unmapped | 1,128 | 13,587,283 |
| Total | 2,393 | 736,785,608 |
Fig. 3Comparisons of gene features among O. granulata and the three other species (A. thaliana, S. bicolor and O. sativa). Gene features include gene length, CDS length, exon length and intron length.
Comparison of gene models among O. granulata and A. thaliana from Capparidales and the three grasses, namely, rice, maize and sorghum.
| Genome size (Mb) | 737 | 373 | 727 | 2,068 | 120 |
| Gene number (#) | 40,131 | 39,045 | 32,824 | 40,602 | 27,416 |
| Gene models (#) | 49,486 | 49,066 | 39,195 | 40,602 | 35,386 |
| Gene length (Mb) | 125.53 | 111.45 | 120.86 | 170.84 | 60.48 |
| Coding sequences (Mb) | 35.78 | 41.56 | 38.60 | 44.70 | 33.40 |
| Number of introns (#) | 125,141 | 130,147 | 115,947 | 166,986 | 123,389 |
| Total intron length (Mb) | 78.84 | 54.72 | 51.76 | 105.44 | 11.62 |
| Avg intron length (bp) | 630 | 420 | 446 | 631 | 94 |
Validation and functional annotation of the O. granulata protein-coding genes.
| Methods | Number | Percentage (%) | |
|---|---|---|---|
| Validation | Protein supported | 23,871 | 59.48 |
| RNA-Seq supported | 19,094 | 47.58 | |
| Protein or RNA-Seq supported | 28,823 | 71.82 | |
| Functional annotation | Swiss-Prot | 22,348 | 55.69 |
| KEGG | 7,365 | 18.35 | |
| GO | 19,458 | 48.49 | |
| PFAM | 21,533 | 53.66 | |
| NR | 27,522 | 68.58 | |
| Total annotated | 34,436 | 85.81 |
Conserved non-coding RNA genes in the O. granulata genome.
| ncRNA types | Number | Average length (bp) | Total length (bp) | % of genome |
|---|---|---|---|---|
| tRNA | 1,003 | 74.9 | 75,160 | 0.0103 |
| rRNA (8S) | 181 | 113.5 | 20,548 | 0.0028 |
| rRNA (18S) | 21 | 1,642.3 | 34,487 | 0.0047 |
| rRNA (28S) | 19 | 4,139.9 | 78,659 | 0.0108 |
| snoRNA | 295 | 113.6 | 33,501 | 0.0046 |
| snRNA | 101 | 142.7 | 14,408 | 0.0020 |
| miRNA | 257 | 114.7 | 29,471 | 0.0040 |
Summary of the annotated repeat sequences in the O. granulata genome.
| Transposable elements | Length (bp) | Percentage (%) |
|---|---|---|
| DNA transposons | 72,407,795 | 9.83 |
| | 10,454,636 | 1.42 |
| | 10,685,997 | 1.45 |
| Maverick | 900,730 | 0.12 |
| | 31,123,966 | 4.23 |
| TcMar-Stowaway | 7,334,046 | 1.00 |
| Tourist | 62,565 | 0.01 |
| | 8,611,368 | 1.17 |
| Helitron | 893,911 | 0.12 |
| Others | 2,340,576 | 0.32 |
| RNA transposons | 376,029,522 | 51.05 |
| Non-LTR retrotransposons | 1,584,873 | 0.22 |
| LINE | 1,459,061 | 0.20 |
| SINE | 125,812 | 0.02 |
| LTR retrotransposons | 374,444,649 | 50.83 |
| | 41,126,935 | 5.58 |
| | 278,699,663 | 37.83 |
| Others | 54,618,051 | 7.41 |
| Other repeats | 8,130,028 | 1.10 |
| Low complexity | 764,382 | 0.10 |
| Simple repeats | 3,685,309 | 0.50 |
| Unknown | 3,680,337 | 0.50 |
| Total | 456,567,345 | 61.98 |
Occurrence of simple sequence repeats (SSRs) in the O. granulata genome.
| Repeat type | Number | Average length (bp) | Total length (kb) | Proportion (%) |
|---|---|---|---|---|
| Mono-nucleotide | 9,397 | 14 | 132.89 | 5.14 |
| Di-nucleotide | 35,467 | 16 | 576.31 | 22.28 |
| Tri-nucleotide | 64,905 | 13 | 860.99 | 33.28 |
| Tetra-nucleotide | 53,484 | 13 | 676.26 | 26.14 |
| Penta-nucleotide | 11,720 | 16 | 182.83 | 7.07 |
| Hexa-nucleotide | 8,366 | 19 | 157.78 | 6.10 |
| Total | 183,339 | 14 | 2,587.06 | 100 |
Note that the minimum repeat unit size was set at twelve for mono-nucleotides, at six for di-nucleotides, at four for tri-nucleotides, and at three for tetra- to hexa-nucleotides.
Comparison of the two O. granulata genome assemblies.
| Entries | This study | IRGC Acc. No. 102117[ | |
|---|---|---|---|
| Genome sequencing | Source country | China | India |
| Genome size (Mb) * | 792 | 785 | |
| Sequencing technology | Illumina; Hi-C | Illumina; PacBio | |
| Raw Illumina data (Gb) | 133.38 | 105.157 | |
| Sequence coverage (×)** | 167 | 131 | |
| Raw Hi-C/PacBio data (Gb) | 109.41 | 16.615 | |
| Sequence coverage (×)** | 137 | 21 | |
| Assembly statistics | Assembly size (Mb) | 736.66 | 776.96 |
| Whole-genome coverage (%) | 93 | 98.1 | |
| Contig N50 (kb) | 43.9 | 262.05 | |
| Contig number (#) | 29,963 | 4,618 | |
| Scaffold N50 (kb) | 916.3 | 262.05 | |
| Scaffold number (#) | 2,393 | 4,618 | |
| Largest scaffold (Mb) | 4.04 | 1.59 | |
| Length of anchored scaffolds (Mb) | 723.2 | — | |
| Anchoring rate (%) | 98.2 | — | |
| GC content (%) | 45.87 | 46.32 | |
| Gene annotation | Gene number (#) | 40,131 | 40,116 |
| Functionally annotated gene number (#) | 34,436 | 33,901 | |
| Complete BUSCO (%) | 96.53 | 95 | |
| Total gene length (Mb) | 125.53 | 102.74 | |
| Average gene length (bp) | 3,152 | 2,561.19 | |
| Total CDS length (bp) | 35.78 | 40.61 | |
| Average CDS length (bp) | 892 | 1,012.28 | |
| Number of exons (#) | 165,272 | 162,369 | |
| Average exon length (bp) | 283 | 250.1 | |
| Average exons per gene | 4.1 | 4.05 | |
| Total intron length (Mb) | 78.84 | 62.14 | |
| Number of introns (#) | 125,141 | 122,253 | |
| Average intron length (bp) | 630 | 508 | |
| ncRNA annotation | tRNA length (bp) | 75,160 | 82,079 |
| rRNA length (bp) | 133,694 | 99,297 | |
| miRNA length (bp) | 29,471 | 30,787 | |
| Repeat sequence annotation | Total repeat length (Mb) | 456.567 | 528.04 |
| Repeat percentage (%) | 61.98 | 67.96 | |
| DNA transposon length (bp) | 72,407,795 | 68,393,246 | |
| LINE length (bp) | 1,459,061 | 7,169,231 | |
| SINE length (bp) | 125,812 | 59,741 | |
| LTR length (bp) | 374,444,649 | 460,976,797 | |
| 41,126,935 | 54,901,814 | ||
| 278,699,663 | 407,036,517 | ||
| Others (bp) | 54,618,051 | 28,158,459 | |
| Other length (bp) | 8,130,028 | 1,696,213 |
*The genome size was estimated by the k-mer method;
**The genome size was estimated to be 800 Mb.
| Measurement(s) | DNA • RNA • transcriptome • genome coverage • sequence_assembly • sequence feature annotation |
| Technology Type(s) | DNA sequencing • RNA sequencing • flow cytometry method • computational modeling technique • sequence assembly process • sequence annotation |
| Sample Characteristic - Organism | Oryza granulata |