| Literature DB >> 29786699 |
António Marcos Ramos1,2, Ana Usié1,2, Pedro Barbosa1, Pedro M Barros3, Tiago Capote1, Inês Chaves4, Fernanda Simões5, Isabl Abreu3, Isabel Carrasquinho5, Carlos Faro6,7, Joana B Guimarães5, Diogo Mendonça5, Filomena Nóbrega5, Leandra Rodrigues1,2, Nelson J M Saibo3, Maria Carolina Varela5, Conceição Egas6,7, José Matos5,8, Célia M Miguel4,9, M Margarida Oliveira3, Cândido P Ricardo3, Sónia Gonçalves1,2.
Abstract
Cork oak (Quercus suber) is native to southwest Europe and northwest Africa where it plays a crucial environmental and economical role. To tackle the cork oak production and industrial challenges, advanced research is imperative but dependent on the availability of a sequenced genome. To address this, we produced the first draft version of the cork oak genome. We followed a de novo assembly strategy based on high-throughput sequence data, which generated a draft genome comprising 23,347 scaffolds and 953.3 Mb in size. A total of 79,752 genes and 83,814 transcripts were predicted, including 33,658 high-confidence genes. An InterPro signature assignment was detected for 69,218 transcripts, which represented 82.6% of the total. Validation studies demonstrated the genome assembly and annotation completeness and highlighted the usefulness of the draft genome for read mapping of high-throughput sequence data generated using different protocols. All data generated is available through the public databases where it was deposited, being therefore ready to use by the academic and industry communities working on cork oak and/or related species.Entities:
Mesh:
Year: 2018 PMID: 29786699 PMCID: PMC5963338 DOI: 10.1038/sdata.2018.69
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Microsatellites used for genotyping the 28 cork oak individuals for selection of the tree used in genome sequencing.
| For each microsatellite the motif, as well as the expected and observed sizes, are indicated. | ||||
|---|---|---|---|---|
| Ueno and Tsumura[ | QrOST1 (DN950446) | (AG)19 | 149–171 | 134–152 |
| QpD12 (CR627959) | (GCA)7 | 243–251 | 240–254 | |
| Steinkellner | QpZAG110 | (AG)15 | 206–262 | 200–260 |
| QpZAG9 | (AG)12 | 182–210 | 223–249 | |
| QpZAG15 | (AG)23 | 108–152 | 101–135 | |
| QpZAG36 | (AG)19 | 210–236 | 205–231 | |
| QpZAG46 | (AG)13 | 190–222 | 180–200 | |
| Kampfer | QrZAG20 | (TC)18 | 160–200 | 161–179 |
| QrZAG7 | (TC)17 | 115–153 | 115–133 | |
| QrZAG11 | (TC)22 | 238–263 | 255–281 | |
| Dow | MSQ4 | (AG)17 | 203–227 | 186–218 |
| MSQ13 | (TC)11 | 222–246 | 218–230 | |
| Isagi and Suhandono[ | QM3-50/QM50-3M | Composite | 253 | 273–289 |
| Sebastiani | Cmcs1 | (AT)7 | 104–108 | 104–123 |
| CEBAL | CB01 | (TC)2 | - | 88–106 |
| INIAV | D8 | (CA)20 | 141–151 | 139–155 |
Illumina DNA sequencing metrics, before and after preprocessing.
| When necessary, the reads were trimmed, using Sickle’s sliding window approach, to a minimum length of 120, 80 and 40 nucleotides, for the PE150, PE100 and MP libraries, respectively. The minimum quality over the set window size for each library type was 20. | |||||||
|---|---|---|---|---|---|---|---|
| Paired-End | 170 | HiSeq 2000 | 100 | 3 | 983,306,498 | 833,615,836 | 84.8 |
| 300 | HiSeq X Ten | 150 | 6 | 5,648,976,124 | 4,637,949,574 | 82.1 | |
| 500 | HiSeq 2000 | 100 | 3 | 924,063,928 | 750,835,264 | 81.3 | |
| 800 | HiSeq 2000 | 100 | 3 | 622,761,676 | 467,933,516 | 75.1 | |
| Mate-Pair | 2000 | HiSeq 2000 | 49 | 6 | 1,419,111,502 | 1,166,743,086 | 82.2 |
| 5000 | HiSeq 2000 | 49 | 3 | 501,967,462 | 404,751,694 | 80.6 | |
| 10000 | HiSeq 4000 | 49 | 3 | 202,568,340 | 124,339,922 | 61.4 | |
| 20000 | HiSeq 4000 | 49 | 3 | 258,232,918 | 157,943,444 | 61.2 |
NCBI Reference Sequence numbers for the chloroplast and mitochondrion genomes used in the preprocessing step.
| A total of 16 genomes were used, eight from each organelle, for a total of 15 distinct plant species. | |||
|---|---|---|---|
| NC_000932.1 | NC_001284.2 | ||
| NC_014674.1 | NC_014043.1 | ||
| NC_023801.1 | NC_016005.1 | ||
| NC_009143.1 | NC_014050.1 | ||
| NC_026790.1 | NC_016743.2 | ||
| NC_026913.1 | NC_018554.1 | ||
| NC_026907.1 | NC_016742.1 | ||
| NC_023959.1 | NC_028096.1 | ||
Figure 1Illumina DNA sequence data pre-processing workflow.
The pipeline included removal of low quality reads, as well as reads containing adapter sequences and undetermined nucleotides. The reads that remained were subsequently mapped to a set of chloroplast and mitochondrion genomes to remove the reads derived from these plastid genomes.
Illumina RNA sequencing metrics, before and after preprocessing.
| The reads were trimmed, when required, to a minimum length of 80 nucleotides, using Sickle[ | ||||
|---|---|---|---|---|
| Pollen | 1 | 197,725,257 | 192,418,402 | 97.3 |
| Leaf | 2 | 299,960,018 | 280,640,162 | 93.6 |
| Xylem | 2 | 361,255,569 | 338,218,694 | 93.6 |
| Inner bark | 2 | 311,378,053 | 291,162,581 | 93.5 |
| Phellem | 2 | 360,128,704 | 335,696,318 | 93.2 |
Figure 2K-mer distribution used for the estimation of genome size.
The distribution was determined with Jellyfish using a k-mer size of 23.
Metrics after integration of the paired-end assemblies generated during the process of producing the draft cork oak genome.
| The integration of the two paired-end assemblies was performed using GARM[ | |||
|---|---|---|---|
| ≥ 1,000 | 168,041 | 939,042,321 | 100 |
| ≥ 2,000 | 126,359 | 877,345,132 | 93.4 |
| ≥ 3,000 | 97,522 | 806,202,144 | 85.9 |
| ≥ 4,000 | 77,322 | 736,006,133 | 78.4 |
| ≥ 5,000 | 62,344 | 668,885,815 | 71.2 |
| ≥ 6,000 | 50,853 | 605,984,813 | 64.5 |
| ≥ 7,000 | 41,920 | 548,098,241 | 58.4 |
| ≥ 8,000 | 34,715 | 494,194,371 | 52.6 |
| ≥ 10,000 | 24,215 | 400,472,101 | 42.6 |
| ≥ 12,500 | 15,612 | 304,434,896 | 32.4 |
| ≥ 25,000 | 2,384 | 84,635,745 | 9.0 |
| ≥ 50,000 | 238 | 16,928,971 | 1.8 |
| ≥ 75,000 | 74 | 7,341,458 | 0.8 |
| ≥ 100,000 | 30 | 3,667,156 | 0.4 |
Assembly metrics for the draft cork oak genome.
| The number of scaffolds is indicated for different size ranges, which also include the total length and the percentage of the genome assembly for each size range. | |||
|---|---|---|---|
| ≥ 1,000 | 23,344 | 953,298,672 | 100 |
| ≥ 2,000 | 15,058 | 940,958,981 | 98.7 |
| ≥ 2,500 | 11,728 | 933,487,254 | 97.9 |
| ≥ 10,000 | 4,730 | 901,545,014 | 94.6 |
| ≥ 50,000 | 2,449 | 855,598,422 | 89.8 |
| ≥ 100,000 | 2,022 | 823,714,113 | 86.4 |
| ≥ 250,000 | 1,207 | 687,189,587 | 72.1 |
| ≥ 500,000 | 539 | 445,055,144 | 46.7 |
| ≥ 750,000 | 249 | 268,162,306 | 28.1 |
| ≥ 1,000,000 | 119 | 157,303,135 | 16.5 |
| ≥ 1,500,000 | 28 | 48,679,102 | 5.1 |
| ≥ 2,000,000 | 5 | 10,685,375 | 1.1 |
Functional annotation results of the 83,814 predicted cork oak transcripts.
| The results obtained using four different databases are presented, including the percentage of transcripts functionally annotated. | ||
|---|---|---|
| NCBI-nr-plants | 56,496 | 67.4 |
| SwissProt | 46,602 | 55.6 |
| Eggnog Viridiplantae | 49,518 | 59.1 |
| InterPro | 69,218 | 82.6 |
Results obtained with RepeatMasker for the cork oak draft genome.
| The whole set of scaffolds (23,344) was used in the run, with a total sequence length of 953.3 Mb. | |||
|---|---|---|---|
| Retroelements | 96,642 | 72,329,781 | 7.59 |
| SINEs: | 596 | 71,956 | 0.01 |
| Penelope | 3 | 438 | 0.00 |
| LINEs: | 24,465 | 12,440,942 | 1.31 |
| CRE/SLACS | 35 | 1,803 | 0.00 |
| L2/CR1/Rex | 0 | 0 | 0.00 |
| R1/LOA/Jockey | 0 | 0 | 0.00 |
| R2/R4/NeSL | 0 | 0 | 0.00 |
| RTE/Bov-B | 2,551 | 734,240 | 0.08 |
| L1/CIN4 | 21,858 | 11,703,615 | 1.23 |
| LTR elements: | 71,581 | 59,816,883 | 6.27 |
| BEL/Pao | 0 | 0 | 0.00 |
| Ty1/Copia | 31,107 | 25,406,691 | 2.67 |
| Gypsy/DIRS1 | 36,363 | 32,397,852 | 3.40 |
| Retroviral | 0 | 0 | 0.00 |
| DNA transposons | 32,632 | 7,066,773 | 0.74 |
| Hobo-Activator | 12,038 | 3,473,326 | 0.36 |
| Tc1-IS630-Pogo | 280 | 28,138 | 0.00 |
| En-Spm | 0 | 0 | 0.00 |
| MuDR-IS905 | 0 | 0 | 0.00 |
| PiggyBac | 0 | 0 | 0.00 |
| Tourist/Harbinger | 2,785 | 708,244 | 0.07 |
| Other (Mirage P-element, P-element, Transib) | 0 | 0 | 0.00 |
| Rolling-circles | 0 | 0 | 0.00 |
| Unclassified | 4,448 | 1,927,308 | 0.20 |
| Total interspersed repeats | 81,323,862 | 8.53 | |
| Small RNA | 1,303 | 233,092 | 0.02 |
| Satellites | 321 | 36,450 | 0.00 |
| Simple repeats | 700,496 | 25,912,290 | 2.72 |
| Low complexity | 127,202 | 6,580,532 | 0.69 |
Summary statistics for the mapping of several read datasets against the cork oak draft genome.
| The reads were downloaded from NCBI’s Sequence Read Archive and mapped to the draft genome using BWA-mem. | ||||
|---|---|---|---|---|
| Magalhães | ||||
| Medium drought | SRR1812375 | 587,184 | 439,362 | 74.8 |
| Severe drought | SRR1812376 | 473,117 | 342,671 | 72.4 |
| Well-watered | SRR1812377 | 645,761 | 484,889 | 75.1 |
| Rocheta | ||||
| Male flower | SRR1609152 | 659,399 | 493,298 | 74.8 |
| Female flower | SRR1609153 | 535,665 | 401,025 | 74.9 |
| Teixeira | ||||
| Good cork quality | SRR1009171 | 573,548 | 409,169 | 71.3 |
| Bad cork quality | SRR1009172 | 600,102 | 417,493 | 69.6 |
| Sebastiana | ||||
| ECM roots | SRR1012033 | 1,159,845 | 769,423 | 66.3 |
| Non-symbiotic roots | SRR1012034 | 969,271 | 636,048 | 65.6 |
| Somatic embryogenesis (Illumina PE) | ||||
| Embryo globular stage | SRX2239661 | 71,706,998 | 62,998,568 | 87.9 |
| Embryo heart/torpedo stage | SRX2239662 | 71,964,732 | 64,487,926 | 89.6 |
| Embryo immature cotyledonary stage | SRX2239663 | 84,482,546 | 73,969,188 | 87.6 |
| Embryo mature cotyledonary stage | SRX2239664 | 84,022,498 | 73,715,150 | 87.7 |
| Chaves | ||||
| Leaf | SRR988108 | 16,838,439 | 13,900,006 | 82.5 |
| Cork | SRR988109 | 9,333,712 | 7,223,415 | 77.4 |
| HL8 RNA-Seq (Illumina PE) | ||||
| Pollen | SRR5986741 | 192,418,402 | 161,872,000 | 84.1 |
| Leaf | SRR5986739 | 299,960,018 | 245,400,286 | 81.8 |
| Xylem | SRR5986738 | 361,255,569 | 296,903,800 | 82.2 |
| Inner bark | SRR5986740 | 311,378,053 | 251,537,756 | 80.8 |
| Phellem | SRR5986737 | 353,687,405 | 285,006,172 | 80.6 |