| Literature DB >> 24671744 |
Elena Barghini1, Lucia Natali, Rosa Maria Cossu, Tommaso Giordani, Massimo Pindo, Federica Cattonaro, Simone Scalabrin, Riccardo Velasco, Michele Morgante, Andrea Cavallini.
Abstract
Analyzing genome structure in different species allows to gain an insight into the evolution of plant genome size. Olive (Olea europaea L.) has a medium-sized haploid genome of 1.4 Gb, whose structure is largely uncharacterized, despite the growing importance of this tree as oil crop. Next-generation sequencing technologies and different computational procedures have been used to study the composition of the olive genome and its repetitive fraction. A total of 2.03 and 2.3 genome equivalents of Illumina and 454 reads from genomic DNA, respectively, were assembled following different procedures, which produced more than 200,000 differently redundant contigs, with mean length higher than 1,000 nt. Mapping Illumina reads onto the assembled sequences was used to estimate their redundancy. The genome data set was subdivided into highly and medium redundant and nonredundant contigs. By combining identification and mapping of repeated sequences, it was established that tandem repeats represent a very large portion of the olive genome (∼31% of the whole genome), consisting of six main families of different length, two of which were first discovered in these experiments. The other large redundant class in the olive genome is represented by transposable elements (especially long terminal repeat-retrotransposons). On the whole, the results of our analyses show the peculiar landscape of the olive genome, related to the massive amplification of tandem repeats, more than that reported for any other sequenced plant genome.Entities:
Keywords: Olea europaea; assembly of NGS reads; genome landscape; repetitive DNA; retrotransposons; tandem repeats
Mesh:
Substances:
Year: 2014 PMID: 24671744 PMCID: PMC4007544 DOI: 10.1093/gbe/evu058
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FRepeat abundance based on one genome equivalent of Illumina (top) and 454 reads (bottom) clustered using RepeatExplorer (see Materials and Methods). Each bar in the histograms shows the individual size (height) of each cluster and the size relative to the total (width). The composition of each cluster is indicated by color, and single-copy, unclustered sequences are reflected to the right of the vertical bar. For the most redundant clusters, the annotation is reported within the bar.
FThe assembly pipeline followed in these experiments to obtain a WGSAS.
Characteristics of Assembled Sequence Sets Obtained by CLC-BIO Genomic Workbench and Minimus 2 Assemblies after Different Splitting of Illumina Reads
| Split | Number of Subpackages | Subpackage Coverage | Number of Assembled Supercontigs | Mean Length | Mean Number of Mapped Reads | Average Coverage | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2.03× | 106,364 | 281.6 | 852.7 | 293.5 | 315 |
| 16 | 16 | 0.13× | 104,468 | 255.6 | 924.7 | 288.3 | 266 |
| 64 | 64 | 0.03× | 56,610 | 281.8 | 1601.0 | 645.9 | 299 |
| 256 | 256 | 0.008× | 29,825 | 301.1 | 2809.9 | 708.3 | 342 |
| 512 | 512 | 0.004× | 28,343 | 278.3 | 2873.9 | 779.6 | 313 |
aSupercontigs are contigs (as assembled by CLC-BIO) assembled to other contigs by Minimus 2.
Statistics of Partial Assemblies and of Final Assembly (WGSAS)
| Assembly | Number of Assembled Supercontigs and Contigs | Mean Length (nt) | Contig Length Range (nt) | |
|---|---|---|---|---|
| Illumina (Split 0) | 1,564,223 | 177.3 | 80–30,891 | 190 |
| Illumina (Split 16) | 592,971 | 155.2 | 80–22,320 | 164 |
| Illumina (Split 64) | 354,698 | 159.9 | 80–7,439 | 174 |
| Illumina (Split 256) | 289,536 | 162.6 | 80–4,768 | 181 |
| Illumina (Split 512) | 287,453 | 159.8 | 80–4,529 | 178 |
| Illumina (Total) | 1,949,661 | 195.6 | 80–30,891 | 214 |
| 454 | 1,096,975 | 445.0 | 80–12,778 | 723 |
| 454 + Illumina | 123,849 | 798.4 | 80–12,208 | 1,180 |
| WGSAS | 210,068 | 1,167.0 | 80–30,891 | 1,505 |
aSupercontigs are contigs (as assembled by CLC-BIO) assembled to other contigs by Minimus 2; contigs are sequences assembled by CLC-BIO that resulted as singleton after Minimus 2 assembly.
bMade by assembling contigs obtained with differently sized packages of Illumina reads.
FDistribution of mapped reads in the final assembly of the olive whole-genome database. Sequences were subdivided into redundant (average coverage > 16.2) and nonredundant (average coverage < 16.2).
FSequence composition of the OLEAREP database (HR and MR sequences).
The Largest Gene Families Represented in the Olive WGSAS
| Protein Encoded by the Gene Family | Number of Sequences |
|---|---|
| NBS-LRR disease resistance protein | 176 |
| Protein kinase domain-containing protein | 75 |
| Serine/threonine protein kinase | 54 |
| Pentatricopeptide repeat-containing protein | 31 |
| Cytochrome P450 | 25 |
| NB-ARC disease resistance protein | 22 |
| Ankyrin | 21 |
| Tyrosine kinase | 20 |
| ABC transporter F family | 20 |
| WD40 repeat-containing protein | 20 |
| Myb transcription factor | 17 |
| Glycosyltransferase | 15 |
| Glycosyl hydrolase | 15 |
Statistics of Mapping of Illumina Reads to the Whole-Sequence Data Set
| Sequence Data Set | Number of Reads | % of Genomic Reads | ||
|---|---|---|---|---|
| Matched genomic reads | 124,445,343 | 89.70 | ||
| RC | HR | 53,587,657 | 38.62 | |
| MR | 47,388,283 | 34.16 | ||
| NRC | 23,469,403 | 16.92 | ||
| Not matched genomic reads | 14,296,611 | 10.30 | ||
| Total genomic reads | 138,741,954 | 100.00 | ||
| Organellar reads | 13,203,073 | |||
| Total | 151,945,027 | |||
Percentage Distribution of Repeat Classes in the Olive Genome
| Sequence type | Order | Superfamily | Number of Contigs | Number of Matched Reads | Percentage | |
|---|---|---|---|---|---|---|
| Retrotransposons | Unclassified | 42 | 34,017 | 0.025 | ||
| (Class I) | LTR | 54,110 | 24,725,640 | 17.821 | ||
| 47,920 | 28,884,342 | 20.819 | ||||
| Retrovirus | 101 | 74,960 | 0.054 | |||
| Endogenous retrovirus | 4 | 6,314 | 0.005 | |||
| Solo-LTR | 52 | 18,355 | 0.013 | |||
| Unknown | 189 | 174,016 | 0.125 | |||
| LINE | L1 | 2384 | 1,739,119 | 1.253 | ||
| RTE | 453 | 123,845 | 0.089 | |||
| Unknown | 38 | 20,591 | 0.015 | |||
| Short-interspersed elements | tRNA | 268 | 64,093 | 0.046 | ||
| Total | 40.265 | |||||
| DNA transposons | Unclassified | 67 | 32,668 | 0.024 | ||
| (Class II, subclass I) | TIR | Tc1-Mariner | 217 | 74,711 | 0.054 | |
| hAT | 7,187 | 2,784,674 | 2.007 | |||
| Mutator | 5,790 | 3,335,678 | 2.404 | |||
| PiggyBac | 1 | 34 | 0.000 | |||
| PIF-Harbinger | 754 | 250,771 | 0.181 | |||
| CACTA | 1,212 | 496,957 | 0.358 | |||
| Crypton | Crypton | 7 | 2,054 | 0.001 | ||
| (ClassII, subclass II) | Helitron | Helitron | 1,297 | 672,682 | 0.485 | |
| Total | 5.514 | |||||
| Tandem repeats | 11,260 | 43,233,770 | 31.161 | |||
| rDNA | 356 | 1,932,081 | 1.393 | |||
| Unknown | 308 | 179,225 | 0.129 | |||
| No hits found | 74,292 | 14,584,090 | 10.512 | |||
| Total reads excluding organellar ones | 138,741,954 | |||||
Composition of the Sanger-Sequenced Small Insert Library
| Sequence Type | Order/Superfamily | Number of Sequences | Percentage | |
|---|---|---|---|---|
| DNA Transposons | Unclassified | 4 | 0.06 | |
| Subclass I | 321 | 5.16 | ||
| Subclass II | 49 | 0.79 | ||
| Total | 374 | 6.01 | ||
| Retrotransposons | Unclassified | 1 | 0.02 | |
| LTR/ | 1,110 | 17.83 | ||
| LTR/ | 1,277 | 20.51 | ||
| LTR/retrovirus | 32 | 0.51 | ||
| LINE | 59 | 0.95 | ||
| Total | 2,479 | 39.82 | ||
| Tandem repeats | 1,504 | 24.16 | ||
| rDNA | 103 | 1.65 | ||
| Similarity to genes | 513 | 8.24 | ||
| Unknown repeats | 80 | 1.31 | ||
| Unknown | 36 | 0.56 | ||
| No hits found | 1,137 | 18.26 | ||
| Total nuclear genomic sequences | 6,226 | |||
| Chloroplast | 149 | |||
| Mitochondrion | 33 | |||
| Total sequences | 6,408 | |||
aUnknown sequences that are assembled using CAP3 (see text).
Characteristics of the Main Tandem Repeat Families Observed in the Olive Genome
| Repeat Family | Already Known as | Length (nt) | GC Content (%) | Estimated % in the Genome |
|---|---|---|---|---|
| Oe80 | OeTaq80 | 80 | 45.4 | 10.33 |
| Oe178 | OeTaq178 | 178 | 43.2 | 9.69 |
| Oe86 | OeGEM86 | 86 | 36.0 | 4.91 |
| Oe179 | Not known | 179 | 36.0 | 4.39 |
| Oe218 | pOS218 | 218 | 41.8 | 4.29 |
| Oe51 | Not known | 51 | 33.5 | 0.78 |
aAccording to the number of matching Illumina reads.
FComposition of the tandem repeat class in the olive genome, based on the number of Illumina reads that map to the OLEAREP database.
FDistance tree of olive tandem repeats (100 sequences per family); bootstrap values higher than 0.4 are shown. Bar represents the nucleotide distance.
FNucleotide diversity (the number of nucleotide substitutions per site) of six tandem repeat families, calculated aligning 100 “real” sequences per family (the 100 sequences most similar to the consensus). Histograms labeled with the same letter are not significantly different (P > 0.05).
Average Coverage of a Sample of Olive LTR-Retrotransposons Measured Separately on LTR and Inter-LTR Regions
| Superfamily | Cluster Number | Average Coverage | LTR to Inter-LTR Ratio | |
|---|---|---|---|---|
| LTR | Inter-LTR | |||
| 24 | 1320.5 | 3816.5 | 0.346 | |
| 39 | 7107.8 | 5380.4 | 1.321 | |
| 48 | 3161.2 | 3119.2 | 1.013 | |
| 63 | 1451.9 | 1668.7 | 0.870 | |
| 66 | 2874.1 | 2186.8 | 1.314 | |
| 72 | 3068.2 | 1570.2 | 1.954 | |
| 86 | 1557.3 | 2444.8 | 0.637 | |
| 90 | 418.3 | 1475.4 | 0.284 | |
| 102 | 1422.8 | 1348.1 | 1.055 | |
| 108 | 507.1 | 1101.8 | 0.460 | |
| 112 | 1414.5 | 917.7 | 1.541 | |
| 114 | 1306.8 | 1211.7 | 1.078 | |
| 142 | 1098.0 | 1096.2 | 1.002 | |
| 165 | 744.8 | 797.5 | 0.934 | |
| 172 | 409.1 | 561.3 | 0.729 | |
| 178 | 1148.4 | 652.8 | 1.759 | |
| 212 | 983.4 | 520.4 | 1.890 | |
| 213 | 674.7 | 450.8 | 1.497 | |
| 239 | 509.9 | 497.6 | 1.025 | |
| 262 | 343.4 | 418.3 | 0.821 | |
| Mean | 1.077 | |||
| 45 | 5434.3 | 3318.0 | 1.638 | |
| 69 | 3669.1 | 1455.0 | 2.522 | |
| 146 | 10393.6 | 914.3 | 11.368 | |
| 149 | 1338.2 | 2626.0 | 0.510 | |
| 157 | 38338.0 | 869.2 | 44.107 | |
| 180 | 1208.5 | 658.0 | 1.837 | |
| Mean | 10.330 | |||