| Literature DB >> 28365736 |
Magali Ruffier1, Andreas Kähäri1, Monika Komorowska1, Stephen Keenan1, Matthew Laird1, Ian Longden1, Glenn Proctor1, Steve Searle2, Daniel Staines1, Kieron Taylor1, Alessandro Vullo1, Andrew Yates1, Daniel Zerbino1, Paul Flicek1,2.
Abstract
The Ensembl software resources are a stable infrastructure to store, access and manipulate genome assemblies and their functional annotations. The Ensembl 'Core' database and Application Programming Interface (API) was our first major piece of software infrastructure and remains at the centre of all of our genome resources. Since its initial design more than fifteen years ago, the number of publicly available genomic, transcriptomic and proteomic datasets has grown enormously, accelerated by continuous advances in DNA-sequencing technology. Initially intended to provide annotation for the reference human genome, we have extended our framework to support the genomes of all species as well as richer assembly models. Cross-referenced links to other informatics resources facilitate searching our database with a variety of popular identifiers such as UniProt and RefSeq. Our comprehensive and robust framework storing a large diversity of genome annotations in one location serves as a platform for other groups to generate and maintain their own tailored annotation. We welcome reuse and contributions: our databases and APIs are publicly available, all of our source code is released with a permissive Apache v2.0 licence at http://github.com/Ensembl and we have an active developer mailing list ( http://www.ensembl.org/info/about/contact/index.html ). Database URL: http://www.ensembl.org.Entities:
Mesh:
Year: 2017 PMID: 28365736 PMCID: PMC5467575 DOI: 10.1093/database/bax020
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The core assembly schema.
Figure 2.The Ensembl web browser can display the differences between a patch region and its equivalent in the primary assembly. Genes that are present in both regions are identified as alt_alleles. (http://e87.ensembl.org/Homo_sapiens/Location/Multi?db=core;g=ENSG00000175164;r=CHR_HG2030_PATCH:133174055-133504218;r1=9:133173980-133504143:1;s1=Homo_sapiens–9).
Figure 3.Efficient Searching of genomic features To find all features between coordinates s and e (i.e. C, D and E) in the situation were the maximum length of a feature for this coordinate system is m (i.e. the length of feature F), we extract all features whose start lies between s – m and e, then exclude B, since it ends before s.
Figure 4.This ID History Map for the SCARN4 gene (http://e87.ensembl.org/Homo_sapiens/Gene/Idhistory?g=ENSG00000281516) aligns Ensembl release numbers, genomic assembly versions, and version numbers of that gene across multiple Ensembl IDs. The different updates in the version ID are represented as a chain of small nodes, connected by lines. The colour of the line reflects how well consecutive versions match, for recent releases. If a score was not calculated (typically in older releases), the line is grey.
A selection of resources mapped by the Ensembl cross reference system indicating the methods used to map, the Ensembl feature mapped to, any additional resources brought in by this association and whether the resource is used to name genes
| Resource | Mapping method | Linked to | Transitive resources | Used for naming |
|---|---|---|---|---|
| HGNC | Direct | Genes | Yes | |
| UniProt Knowledgebase (UniProtKB) | Direct, alignment | Proteins | PDBe | Yes |
| RefSeq mRNA | Location overlap, alignment | Transcripts | EntrezGene | Yes |
| RefSeq Proteins | Alignment | Proteins | No | |
| UniProt Archive (UniParc) | Checksum | Proteins | No | |
| RNACentral | Checksum | Non-coding transcripts | No | |
| MGI | Direct | Genes | Yes |
Figure 5.Sequences from external sources are aligned against Ensembl features. For ENST00000315596.14 (http://e87.ensembl.org/Homo_sapiens/Transcript/Similarity?db=core;g=ENSG00000083642;r=13:1-50000000;t=ENST00000315596), a number of predicted RefSeq peptide sequences have been aligned with small mismatches. The curated RefSeq peptide NP_055847 aligns perfectly but the corresponding mRNA sequence, NM_015032.3, does not, indicating that there is a difference in the UTR sequence of this transcript.