| Literature DB >> 33270643 |
Reece K Hart1, Andreas Prlić2.
Abstract
MOTIVATION: Access to biological sequence data, such as genome, transcript, or protein sequence, is at the core of many bioinformatics analysis workflows. The National Center for Biotechnology Information (NCBI), Ensembl, and other sequence database maintainers provide methods to access sequences through network connections. For many users, the convenience and currency of remotely managed data are compelling, and the network latency is non-consequential. However, for high-throughput and clinical applications, local sequence collections are essential for performance, stability, privacy, and reproducibility.Entities:
Year: 2020 PMID: 33270643 PMCID: PMC7714221 DOI: 10.1371/journal.pone.0239883
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1The sha512t24u digest in Python.
Summary of operations provided by the Python interface, REST interface, and refget protocol interface.
| Python interface | SeqRepo REST interface | refget protocol | |
|---|---|---|---|
| N/A | /v1/ping | /sequence/service-info | |
| store(seq, identifiers[]) | N/A | N/A | |
| fetch(ir, [start], [end]) sr[ir][start:end] | /v1/sequence/:ir ("start" and "end" query parameters optional) | /sequence/:digest ("start" and "end" query parameters optional) | |
| store(seq, identifiers[]) | N/A | N/A | |
| translate_identifier(ir) | /v1/metadata/:ir | /sequence/:digest/metadata |
All interfaces support rapid access to slices of chromosome-sized sequences. The SeqRepo provides two mechanisms to fetch sequence slices: A fetch() method, and a dict-style access that permits a SeqRepo instance to be accessed as a Python dictionary. The SeqRepo REST interface and refget protocol are read-only interfaces. "ir" denotes an identifier of the form namespace:alias; an alias may be used without namespace if it is globally unique. The refget protocol itself currently requires the use of digests for queries.
Fig 2Examples of the native Python interface.
SeqRepo retrieves sequences and metadata using conventional identifiers (i.e., from NCBI, Ensembl, GRCh, LRG, and other sources) and from digest identifiers (i.e., sha512t24u, ga4gh, md5, SEGUID). Identifiers are namespaced, and generally written as "
Fig 3Examples of the SeqRepo REST API.
See the SeqRepo repository for installation instructions and S2 Code for example details.
Timing results for remote and local sequence sources.
| NCBI E-Utilities | ENA refget | SeqRepo Python API | SeqRepo REST API | |
|---|---|---|---|---|
| 892 | 245 | 0.663 | 1.16 | |
| 1.12 | 4.08 | 1508 | 858 | |
| ≡1 | 3.64 | 1346 | 766 |
Timings are for 1,000 sequence lookups from 1) NCBI nucleotide sequences using the E-utilities interface, 2) European Nucleotide Archive using the refget protocol, 3) Local SeqRepo using the native Python interface, 4) Local SeqRepo interface using a SeqRepo REST API. Local SeqRepo access offers the best performance; using the SeqRepo REST API adds overhead, but enables access from other programming languages. Details are provided in S3 Code.