| Literature DB >> 27346392 |
Fernando Cruz1,2, Irene Julca2,3,4, Jèssica Gómez-Garrido1,2, Damian Loska2,3, Marina Marcet-Houben2,3, Emilio Cano5, Beatriz Galán6, Leonor Frias1,2, Paolo Ribeca1,2, Sophia Derdak1,2, Marta Gut1,2, Manuel Sánchez-Fernández7, Jose Luis García6, Ivo G Gut1,2, Pablo Vargas8,9, Tyler S Alioto10,11,12, Toni Gabaldón13,14,15,16.
Abstract
BACKGROUND: The Mediterranean olive tree (Olea europaea subsp. europaea) was one of the first trees to be domesticated and is currently of major agricultural importance in the Mediterranean region as the source of olive oil. The molecular bases underlying the phenotypic differences among domesticated cultivars, or between domesticated olive trees and their wild relatives, remain poorly understood. Both wild and cultivated olive trees have 46 chromosomes (2n).Entities:
Keywords: Annotation; Assembly; Genomics; Olive tree genome
Mesh:
Year: 2016 PMID: 27346392 PMCID: PMC4922053 DOI: 10.1186/s13742-016-0134-5
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Sequencing libraries and respective yields used for whole genome shotgun sequencing and fosmid pools
| Library | Mode | Name | Yield (Gb) |
|---|---|---|---|
| PE400 | 2*262 | 837G_B | 8.3 |
| PE400 | 2*312 | 837G_B | 68.0 |
| PE400 | 2*255 | 837G_B | 8.2 |
| PE560 | 2*312 | 846G_D | 33.9 |
| PE560 | 2*151 | 846G_D | 99.2 |
| PE560 | 2*500 | 846G_E_PCR | 14.1 |
| PE560 | 2*151 | 846G_E_PCR | 46.8 |
| PE725 | 2*151 | 837G_E_PCR | 96.3 |
| PE725 | 1*625 | 837G_E_PCR_2 | 15.2 |
| MP3k | 2*151 | T587 | 33.9 |
| MP5k | 2*151 | T586 | 40.3 |
| MP7k | 2*151 | T585 | 37.6 |
| MP10k | 2*151 | T584 | 42.7 |
| FP PE350 | 2*151 | 1FP to 96FP | 11.3* |
*mean yield
Fig. 1Kmer spectrum. Using Jellyfish v1.1.10, 17-mers were counted in a subset of whole genome shotgun paired-end reads corresponding to the PE560 2x150 sequencing run. The density plot of the number of unique kmer species (y axis) for each kmer frequency (x axis) is plotted. The homozygous peak is observed at a multiplicity (kmer coverage) of 52 x, while the heterozygous peak is observed at 26 x. The tail extending to the right represents repetitive sequences. The total number of kmers present in this subset was 71,902,584,399. From these data, the Genome Character Estimator (gce) estimates the genome size to be 1.32 Gb
Fig. 2Comparison of fosmid insert and fosmid-pool scaffold size distributions. Fosmid clone insert size estimates (black contiguous line) were obtained by mapping fosmid end sequences to our merged fosmid pool (FP) assembly. The fosmid end sequencing of only 155,000 unique clones resulted in a very high sequencing depth, so we set a lower threshold of 100 x for the number of times a given length was seen and counted each length only once. While this procedure results in underestimating the amplitude of the density peak, both the shape of the distribution and the mean insert size (36.7 kb) should be unaffected, while the standard deviation is likely an overestimate. The distribution of scaffold lengths from the 96 fosmid pool assemblies is given by the blue dashed line (scaffolds smaller than 2.5 kb were discarded to avoid noise)
Fig. 3Fosmid pool assembly pipeline. For each fosmid pool, a single paired-end (PE) library sequenced at 2 x 150 bp was first filtered and trimmed of pNGS vector sequences, as well as those of Escherichia coli and other common contaminants, including Olea europaea chloroplast sequences. Reads were assembled with ABySS, gapfilled with GapFiller, and contaminants removed using a BLAST homology search. A consistency check was performed, breaking the assemblies at any point inconsistent with the proper insert size and orientation of fosmid pool PE reads. The resulting contigs were scaffolded using whole genome shotgun (WGS) data, followed by another round of gapfilling, decontamination and consistency checking, this time including the new WGS data. To repair the consistency broken assembly, a final round of scaffolding, gapfilling and decontamination was performed
Fig. 4Overview of the complete assembly pipeline. The basic flow chart starting with the 96 fosmid pool assemblies is shown. Assemblies are shown in orange rounded rectangles. All computational steps are shown as octagons
Summary statistics of the Oe6 assembly
| Oe6Assembly | Length (bp) | Contiguity (bp) | Completeness (CEGMA) | |||
|---|---|---|---|---|---|---|
| N10 | N50 | N90 | Complete | Partial | ||
| Contigs | 1,264,682,749 (59,457) | 138,917 (695) | 52,353 (7,085) | 11,476 (25,802) | − | − |
| Scaffolds | 1,318,652,350 (11,038) | 1,088,680 (94) | 443,100 (901) | 110,965 (3099) | 98.8 % | 98.8 % |
Numbers of contigs/scaffolds are shown in parentheses
Fig. 5Overview of the annotation pipeline. Input data for annotation are shown at the top of the flow chart. Computational steps are shown in light blue and intermediate data are shown in white
RNA-Seq samples used for annotation
| Accession | Tissue | Varietal |
|---|---|---|
| ERS1146989 | Immature olives | Farga |
| ERS1146988 | Roots | Farga |
| ERS1135096 | Old leaves | Farga |
| ERS1135095 | Young leaves | Farga |
| ERS1135094 | Flowers | Farga |
| ERS1135093 | Flower buds | Farga |
| ERS1135092 | Green olives | Farga |
| SRP000653 | Fruits | Coratina |
| SRP005630 | Buds | Picual, Arbequina |
| SRP044780 | Leaves, Roots | Picual |
| SRP016074 | Fruits, leaves, stems and seeds | Picula x Arbequina |
| SRP017846 | Fruits | Istrska belica |
| SRP024265 | Leaves, Roots | Kalamon |
Weights given to each source of evidence when running Evidence Modeler r2012-06-25
| Type of evidence | Program | Weight |
|---|---|---|
| ABINITIO_PREDICTION | GeneMark | 1 |
| ABINITIO_PREDICTION | Augustus | 1 |
| ABINITIO_PREDICTION | geneid_v1.4 | 1 |
| ABINITIO_PREDICTION | GlimmerHMM | 1 |
| ABINITIO_PREDICTION | geneid_introns | 2 |
| ABINITIO_PREDICTION | Augustus_introns | 2 |
| ABINITIO_PREDICTION | GeneMark-ET | 2 |
| OTHER_PREDICTION | transdecoder | 2 |
| TRANSCRIPT | PASA | 10 |
| PROTEIN | SPALN | 10 |
Comparison of O. europaea with other plant species
| Species | Number of proteins | Average transcript length (bp) | Average coding sequence length (bp) | Average exons per transcript | Average exon length (bp) | Proteins with homologs in |
|
|---|---|---|---|---|---|---|---|
|
| 56,349 | 3,953 | 1,050 | 4.54 | 315 | 56,349 (100 %) | 56,349 (100 %) |
|
| 35,378 | 2,341 | 1,234 | 5.89 | 261 | 23,106 (65.3 %) | 32,796 (58.2 %) |
|
| 31,861 | 3,378 | 1,351 | 5.77 | 300 | 24,373 (76.5 %) | 42,458 (75.3 %) |
|
| 36,148 | 5,626 | 1,389 | 6.48 | 288 | 27,778 (76.8 %) | 38,448 (68.2 %) |
|
| 27,998 | 4,323 | 1,390 | 6.53 | 287 | 21,990 (78.5 %) | 37,264 (66.1 %) |
Average of the transcript length, coding sequence, exons per transcript and exon length of O. europaea, Arabidopsis thaliana, Erythranthe guttata, Solanum lycopersicum and Ricinus communis proteomes, the number of proteins with at least one homolog in O. europaea and the number of proteins of O. europaea with at least one homolog in the other species. The longest protein isoform per gene was used for homology search
Fig. 6Distribution of exons per coding sequence in the analyzed species. The number of exons per CDS feature (UTRs were ignored) was counted and the distribution plotted for the olive and each of the other four species for which we compared annotations. Similar distributions were observed for all species