| Literature DB >> 25432969 |
Dan M Bolser1, Arnaud Kerhornou1, Brandon Walts1, Paul Kersey2.
Abstract
Recent developments in DNA sequencing have enabled the large and complex genomes of many crop species to be determined for the first time, even those previously intractable due to their polyploid nature. Indeed, over the course of the last 2 years, the genome sequences of several commercially important cereals, notably barley and bread wheat, have become available, as well as those of related wild species. While still incomplete, comparison with other, more completely assembled species suggests that coverage of genic regions is likely to be high. Ensembl Plants (http://plants.ensembl.org) is an integrative resource organizing, analyzing and visualizing genome-scale information for important crop and model plants. Available data include reference genome sequence, variant loci, gene models and functional annotation. For variant loci, individual and population genotypes, linkage information and, where available, phenotypic information are shown. Comparative analyses are performed on DNA and protein sequence alignments. The resulting genome alignments and gene trees, representing the implied evolutionary history of the gene family, are made available for visualization and analysis. Driven by the case of bread wheat, specific extensions to the analysis pipelines and web interface have recently been developed to support polyploid genomes. Data in Ensembl Plants is accessible through a genome browser incorporating various specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These interfaces are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests and pollinators, facilitating the study of the plant in its environment.Entities:
Keywords: Comparative genomics; Functional genomics; Genetic variation; Genome browser; Transcriptomics; Triticeae
Mesh:
Year: 2014 PMID: 25432969 PMCID: PMC4301745 DOI: 10.1093/pcp/pcu183
Source DB: PubMed Journal: Plant Cell Physiol ISSN: 0032-0781 Impact factor: 4.927
Fig. 1Visualizing the bread wheat genome through the Ensembl Genomes browser interface. The user can view many layers of genome annotation in a highly customizable way. Tracks shown include (A) gene models, (B) assemblies and interhomoeologous variations from Brenchley et al. (2012), (C) RNA-Seq data, (D) variations from the AXIOM array and (E) transcript assemblies from T. turgidum. Additional tracks are shown for T. aestivum ESTs and UniGenes (purple and green), alignment blocks to O. sativa and B. distachyon (pink), repeats (grey) and GC content.
Fig. 2Detailed view of a gene tree in Ensembl Plants. The tree shows the inferred evolutionary history of the sucrose-6F-phosphate phosphohydrolase family protein in H. vulgare. The gene tree (left) shows the expected phylogenetic relationship for the gene between the species shown. Note that the sequence identifier of the wheat genes includes the name of the chromosome arm to which it belongs, i.e. 5DS for the short arm of chromosome 5 in the D-genome. Red squares indicate inferred duplication events in the history of the gene, and shaded gray triangles indicate collapsed branches. A pictographic representation of the underlying multiple sequence alignment is included on the gene tree pages (right).
Fig. 5View of the whole-genome alignment between wheat, rice and Brachypodium in Ensembl Plants. Pink bars and green blocks indicate aligned blocks between the rice and wheat (upper) and wheat and Brachypodium (lower) pairs of genomes. Transcripts are shown in red but the genomic features shown on each track are configurable.
Fig. 6Polyploid view of the whole-genome alignment within the bread wheat A, B and D component genomes. The image is defined as in Fig. 5. An additional feature track shows repeats annotated in all three genomes.
Fig. 3The transcript variation image for the H. vulgare MLOC_42.1 protein-coding transcript in Ensembl Plants. The image gives an overview of all the variants within the transcript in the context of the functional domains assigned to the protein. Upper boxes highlight the amino acid change, where applicable, and lower boxes give the alleles. Variants are color coded according to their consequence type, missense, synonymous and positional. A full list of consequence types is given here: http://www.ensembl.org/info/genome/variation/predicted_data.html. The transcripts, features and variations can be clicked to explore more information about each object.
Triticeae genomes in Ensembl Plants
| Species (strain) | Description | Estimated genome size Mb | Assembly size Mb | Genes |
|---|---|---|---|---|
| Barley, | Barley is an economically important crop and an important model of environmental diversity for development of wheat ( | 5,100 ( | 4,706 | 24,211 |
| Bread wheat, | An economically important food crop, accounting for >20% of global agricultural production ( | 16,974 ( | 4,460 | 108,569 |
| A-genome progenitor, | An einkorn wheat and the diploid progenitor of the bread wheat A-genome ( | 4,940 ( | 3,747 | 34,843 |
| D-genome progenitor, | The diploid progenitor of the bread wheat D-genome ( | 4,360 ( | 3,314 | 33,849 |
Some gene model statistics
| Species | Contig N50 (kbp) | Average gene length | Average exon number | % complete genes | % InterPro coverage |
|---|---|---|---|---|---|
| 3,090 | 2.5 | 77 | 68 | ||
| 3,582 | 5.6 | 99 | 86 | ||
| 0.9/3.2 | 2,812 | 5.4 | 76 | 84 | |
| 3.4/5.8 | 3,208 | 4.7 | 78 | 78 | |
| 4.5/6.2 | 2,935 | 4.9 | 100 | 77 | |
| 2.4/6.3 | 2,197 | 3.8 | 56 | 84 |
Contig N50 reported twice, once for the complete assembly/and once for just the gene-containing contigs.
Complete genes are defined as those starting with a methionine, ending in a stop codon.
InterPro coverage consists only of structural protein domains and functional motifs, excluding low complexity, coiled-coil, transmembrane and signal motifs.
A list of the standard computational analyses that are routinely run over all genomes in Ensembl Plants
| Pipeline name | Summary |
|---|---|
| Repeat feature annotation | Three repeat annotation tools are run, RepeatMasker, Dust and TRF. RepeatMasker was run with repeat libraries from Repbase as well as Triticeae specific repeats from TREP. |
| Non-coding RNA annotation | tRNAs and rRNAs are predicted using tRNAScan-SE ( |
| Feature density calculation | Feature density is calculated by chunking the genome into bins, and counting features of each type in each bin. |
| Annotation of external cross-references | Database cross references are loaded from a predefined set of sources for each species, using either direct mappings or by sequence alignment. |
| Ontology annotation | In addition to database cross references, ontology annotations are imported from external sources ( |
| Protein feature annotation | Translations are run through InterProScan ( |
| Gene trees | The peptide comparative genomics (Compara) pipeline ( |
| Whole-genome alignment | Whole-genome alignments are provided for closely related pairs of species based on LastZ or translated BLAT results. Where appropriate, |
| Short read alignment | Short reads are automatically downloaded from the SRA by study accession and are aligned to a given reference by using BWA ( |
| Variation coding consequences | For those species with data for known variations, the coding consequences of those variations are computed for each protein-coding transcript ( |
Fig. 4The matrix of whole-genome alignments between pairs of monocot genomes in Ensembl Plants. Cyan indicates that an alignment exists for the pair. Only one representative rice is shown, O. sativa, although each of the 10 rice genomes was aligned against each other (not shown).