| Literature DB >> 17148474 |
T J P Hubbard1, B L Aken, K Beal, B Ballester, M Caccamo, Y Chen, L Clarke, G Coates, F Cunningham, T Cutts, T Down, S C Dyer, S Fitzgerald, J Fernandez-Banet, S Graf, S Haider, M Hammond, J Herrero, R Holland, K Howe, K Howe, N Johnson, A Kahari, D Keefe, F Kokocinski, E Kulesha, D Lawson, I Longden, C Melsopp, K Megy, P Meidl, B Ouverdin, A Parker, A Prlic, S Rice, D Rios, M Schuster, I Sealy, J Severin, G Slater, D Smedley, G Spudich, S Trevanion, A Vilella, J Vogel, S White, M Wood, T Cox, V Curwen, R Durbin, X M Fernandez-Suarez, P Flicek, A Kasprzyk, G Proctor, S Searle, J Smith, A Ureta-Vidal, E Birney.
Abstract
The Ensembl (http://www.ensembl.org/) project provides a comprehensive and integrated source of annotation of chordate genome sequences. Over the past year the number of genomes available from Ensembl has increased from 15 to 33, with the addition of sites for the mammalian genomes of elephant, rabbit, armadillo, tenrec, platypus, pig, cat, bush baby, common shrew, microbat and european hedgehog; the fish genomes of stickleback and medaka and the second example of the genomes of the sea squirt (Ciona savignyi) and the mosquito (Aedes aegypti). Some of the major features added during the year include the first complete gene sets for genomes with low-sequence coverage, the introduction of new strain variation data and the introduction of new orthology/paralog annotations based on gene trees.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17148474 PMCID: PMC1761443 DOI: 10.1093/nar/gkl996
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Figure shows the growth in the number of genomes provided by Ensembl over the past 5 years. The discontinuities at the start of 2006 and 2005 represent the removal of the honeybee (Apis mellifera) and nematode (Caenorhabditis briggsae) Ensembl sites, respectively. The black line shows all genomes and the red line shows mammalian genomes.
Figure 2Figure shows a screenshot of part of an AlignSliceView web page from the elephant (Loxodonta africana) genome as an example of the output from the Ensembl gene build system when applied to low-coverage shotgun genomes. The top panel shows elephant genome sequence and the bottom panel shows the region of human genome sequence that aligns to it. In the DNA(contigs) track blue regions indicate sequence and blank regions indicate gaps. The track for elephant gives an idea of fragmentation of the genome assembly (the gaps in the track for human do not indicate gaps in the genome but rather gaps in the alignment between elephant and human). Elephant DNA contigs have been organized into ‘gene-scaffolds’ based on whole genome alignments (WGA) to a reference genome, in this case human (see text). Elephant transcripts, such as the reverse strand transcript ENSLAFT00000011080 shown here, are built by projecting protein-coding part of human transcripts through the WGA. In this case the elephant transcript has been built by projecting the annotated transcript C9orf138; however, there is no WGA alignment for the third exon of this transcript [the third exon from the right is positioned against a gap between contigs in the DNA(contigs) track]. As a result this exon is missing from the view of human transcripts, a fact that is indicated by the green dotted link linking exons 2 and 4 (CI138_HUMAN). The elephant transcript ENSLAFT00000011080 does contain this exon; however, because of the gap in the elephant sequence, only the exon length can be inferred from the corresponding human transcript, so the exon sequence is composed entirely of ‘N’s in the transcript and ‘X’s in the corresponding translation. Interestingly, in human a shorter alternative transcript is also annotated with a missing third exon (Q5VZT6_HUMAN); however, the form with the third exon appears to be conserved across mouse, rat and dog, suggesting that it is likely to be conserved in elephant too.
Completeness of gene builds on low sequence coverage genomes
| Genome | Raw unfiltered | Filtered chained (final gene set) | ||
|---|---|---|---|---|
| Base pair covered (%) | Exons missed (%) | Base pair covered (%) | Exons missed (%) | |
| Cow (3×) | 91.4 | 6.8 | 80.0 | 15.5 |
| Elephant | 80.0 | 17.6 | 64.8 | 30.3 |
| Rabbit | 82.0 | 16.7 | 65.7 | 30.8 |
| Armadillo | 77.0 | 21.3 | 59.4 | 36.6 |
| Tenrec | 83.2 | 15.1 | 69.1 | 27.0 |
This table shows fraction of base pairs of human Ensembl gene set exons covered by raw alignments to WGS scaffolds (first column) and in the filtered, chained gene-scaffolds presented as the final gene set (third column). Fraction of exons completely missed in each case is also shown (second and fourth column, respectively). All genomes are 2× WGS assemblies, except for cow which is 3× (see text).
Figure 3Figure shows the sequence variation across mouse strains for the transcript ENSMUST00000006949 in the new view TranscriptSNPView. Strain-specific SNPs were calculated by aligning mouse reads from different strains against the reference genome as described previously (25). This gene-centric view collapses the size of introns to focus on variation within exons, which are shown separately for each strain with the consequences of any SNPs on their coding sequence. The extent of resequencing coverage is also shown for each strain.
Figure 4Figure shows the gene tree panel from the GeneTreeView web page for the human FOXJ3 gene, generated by the Ensembl gene orthology/paralogy prediction pipeline. Most of the ortholog relationships are one to one; however, there is a one-to-many relationship to the fish lineage, where the gene appears to have duplicated. The relationship to the paralogous gene FOXJ2 can also be seen, where the orthologs in the fish lineage appear to have been lost. The full web page (not shown) includes links to view the tree structure in the Java applet ATV and view the protein sequence alignment upon which it is based in Java applet Jalview. The green bars represent the alignments of the protein translations upon which the tree is based, where shaded blocks represent aligned regions. Poor and fragmented alignments can be the cause of erroneous placements of genes in the tree, so the visualization of the alignment is useful when interpreting the tree.
Figure 5Figure shows a portion of the human genome in ContigView showing the transcript Q6PEX7_HUMAN. ContigView has a ‘Basepair view’ panel allowing DNA sequence and six frame translation to be examined, which is shown centered on the second exon of this transcript. The introduction of AJAX functionality to ContigView greatly simplifies navigation to precise locations in ‘Basepair view’. Click and drag in any ContigView panel and a red box is drawn. Upon mouse release a popup appears. In this example a region around the start of translation has been selected in ‘Detailed view’. The mouse gesture that the user needs to perform of ‘click…drag…release’ is shown by the annotation on the figure. Clicking on the first option in the popup would reposition base pair view around this feature. This functionality greatly improves the interactivity of the web interface and will be progressively incorporated into other Ensembl views.