Andrew D Yates1,2, Jeremy Adams3,2, Somesh Chaturvedi1,4, Robert M Davies2,5, Matthew Laird1, Rasko Leinonen1, Rishi Nag1,2, Nathan C Sheffield6, Oliver Hofmann2,7, Thomas M Keane1,2. 1. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK. 2. Global Alliance for Genomics and Health. 3. Ontario Institute for Cancer Research, Toronto, ON, CA. 4. Google Summer of Code. 5. Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, UK. 6. Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA, 22903, USA. 7. University of Melbourne Centre for Cancer Research, University of Melbourne, Melbourne, Victoria, Australia.
Abstract
MOTIVATION: Reference sequences are essential in creating a baseline of knowledge for many common bioinformatics methods, especially those using genomic sequencing. RESULTS: We have created refget, a Global Alliance for Genomics and Health API specification to access reference sequences and sub-sequences using an identifier derived from the sequence itself. We present four reference implementations across in-house and cloud infrastructure, a compliance suite and a web report used to ensure specification conformity across implementations. AVAILABILITY: The Refget specification can be found at: https://w3id.org/ga4gh/refget. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Reference sequences are essential in creating a baseline of knowledge for many common bioinformatics methods, especially those using genomic sequencing. RESULTS: We have created refget, a Global Alliance for Genomics and Health API specification to access reference sequences and sub-sequences using an identifier derived from the sequence itself. We present four reference implementations across in-house and cloud infrastructure, a compliance suite and a web report used to ensure specification conformity across implementations. AVAILABILITY: The Refget specification can be found at: https://w3id.org/ga4gh/refget. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Reference genome sequences are central to genomic interpretation and to defining a baseline of knowledge upon which our understanding of biological systems, phenotypes and variation are based. The ability to interpret such data is essential for the delivery of genomics into the clinic. As precision medicine becomes mainstream in healthcare systems, organizations such as the Global Alliance for Genomics and Health (GA4GH) are developing interoperable standards to ensure the discovery and provenance of baseline knowledge.Reference sequences suffer from two issues: sequence identity and non-standardized access. Genomic analysis, such as read alignment, typically use a FASTA-formatted collection of sequences downloaded from a provider with inconsistent naming, e.g. INSDC (Karsch-Mizrachi ), Ensembl (Yates ) or UCSC (Lee ). For example, chromosome 1 from GRCh38 (GCA_000001405.15) is known as chr1 from hg38 (UCSC), 1 from GRCh38 (Ensembl) or CM000663.2 (INSDC). When it is critical to unambiguously identify an underlying reference sequence, it is better to use an identifier derived from the sequence itself, such as a cryptographic checksum digest. This method is employed in the CRAM format (Hsi-Yang Fritz ), which uses a MD5 digest to identify the correct reference during read reconstitution. The European Nucleotide Archive (ENA) developed the CRAM reference registry (CRR) to retrieve reference sequences by an MD5 sequence checksums. Similar ideas have been employed by tximeta to aid reproducible RNA-seq analysis (Love ).This manuscript describes a new application programming interface (API), called refget, which enables retrieval of full-length sequences or sub-sequences via a checksum identifier, returns metadata associated with an identifier and maintains compatibility with the CRR. Our API operates over HTTP(s) and so is accessible in all main-stream programming languages. We also present four implementations of the refget specification deployed across in-house and cloud infra-structures, and a toolkit to assess implementation compliance.
2 Results
The refget protocol operates with a client providing a supported digest identifier with an optional linear or circular genomic coordinate range, specified as URL parameters or a Range header, via a HTTP(s) GET request. An implementation responds with an unbroken stream of sequence characters. Users may request a metadata JSON document, which provides information about the sequence length, topology, known digests and any other known aliases. Finally, clients can request a JSON document of server capabilities allowing for adaptation to possible limitations of an implementation. Implementations are not restricted to a single type of reference sequence to serve and can provide DNA, mRNA, cDNA, CDS or peptide sequences. Should an implementation wish to provide a CRAM reference registry (CRR) compatible deployment they must mirror reference sequences as found in ENA.The Refget defines three supported identifier algorithms: MD5, TRUNC512 and GA4GH Identifier. All three algorithms normalize sequences by stripping all whitespace characters and restricting to characters in the range A-Z. We chose this as a compromise between the methods and requirements employed by CRAM, ENA and the Variation Representation Specification (VRS). MD5 is supported to maintain compatibility with CRAM format. However, hash collisions are a known weakness and to mitigate this concern, we have used the sha512t24u identifier scheme (Hart and Prlić, 2020) (also known as the GA4GH identifier) as employed by the VRS standard. In addition, we created a parallel format called TRUNC512, which represents sha512t24 as a hex string to maintain a similar representation to MD5. These schemes are described in Figure 1. However, sha512t24u is the preferred representation due to its use in VRS. We tested sha512t24u to the MGnify (Mitchell ) May 2019 protein database of ∼1 billion entries and found no collisions (see Supplementary Material). To retrieve a reference sequence, a client constructs a URL such as https://www.ebi.ac.uk/ena/cram/sequence/3332ed720ac7eaa9b3655c06f6b9e196, sets the acceptable media type to text/plain and uses a HTTP library such as Python’s requests package to negotiate the request (see Supplementary Material for additional examples).
Fig. 1.
Summary of the sequence normalization and algorithm used to generate checksum identifiers for TRUNC512 and GA4GH Identifier. All methods move through the same normalization process but differ in their choice of checksum algorithm (MD5 versus SHA-512)
Summary of the sequence normalization and algorithm used to generate checksum identifiers for TRUNC512 and GA4GH Identifier. All methods move through the same normalization process but differ in their choice of checksum algorithm (MD5 versus SHA-512)
3 Implementation and compliance
Four implementations exist across a diverse range of providers including ENA, Amazon Web Services (AWS) and Heroku (see Supplementary Materials). We developed a refget compliance documentation (https://compliancedoc.readthedocs.io/) and library suite (https://pypi.org/project/refget-compliance/) to ensure implementation compatibility. The compliance toolkit mandates an implementation hosts three sequences; Enterobacteria phage phiX174 sensu lato (NC_001422.1) and Saccharomyces cerevisiae S288C chromosomes I (BK006935.2) and IV (BK006938.2). Certain tests can be skipped if a pass was not possible e.g. we do not test circular sequence retrieval if a server declares it does not support circular sequences. Tests are run daily against all known implementations and a report is published at https://w3id.org/ga4gh/refget/compliance. We have also implemented a local Python interface in the refget Python package, hosted at PyPI (https://pypi.org/project/refget/). This package provides a local implementation of the refget protocol with SQLite, or MongoDB back-ends, and can connect to a remote API to provide local caching of retrieved results to improve performance for applications that require repeated lookups. It also provides Python functions to compute refget identifiers from within Python using raw sequences or FASTA files. Use of this package is shown below.import refgetsrv = “url = srv + “ga4gh/refget/reference/sequence/”rgc = refget. RefGetClient(url)rgc.refget(“6681ac2f62509cfc220d78751b8dc524”, start = 0, end = 10)
4 Discussion
Reference sequences are fundamental for providing a stable method of describing genomic variation and annotation. Refget formalzses a method for generating identifiers from reference sequence and specifies an API to retrieve sequences, sub-sequences and metadata. The specification is easy to implement with a mechanism to assert specification compliance. Refget can host any type of reference sequence, allows deployments to implement subsets of functionalities and provides a mechanism for deployments to programmatically declare this. Future work includes the definition of a reference sequence collection using checksums and sequence metadata, e.g. a genome and to provide a way to convert between known reference sequence names to refget identifiers.Click here for additional data file.
Authors: Christopher M Lee; Galt P Barber; Jonathan Casper; Hiram Clawson; Mark Diekhans; Jairo Navarro Gonzalez; Angie S Hinrichs; Brian T Lee; Luis R Nassar; Conner C Powell; Brian J Raney; Kate R Rosenbloom; Daniel Schmelter; Matthew L Speir; Ann S Zweig; David Haussler; Maximilian Haeussler; Robert M Kuhn; W James Kent Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971
Authors: Andrew D Yates; Premanand Achuthan; Wasiu Akanni; James Allen; Jamie Allen; Jorge Alvarez-Jarreta; M Ridwan Amode; Irina M Armean; Andrey G Azov; Ruth Bennett; Jyothish Bhai; Konstantinos Billis; Sanjay Boddu; José Carlos Marugán; Carla Cummins; Claire Davidson; Kamalkumar Dodiya; Reham Fatima; Astrid Gall; Carlos Garcia Giron; Laurent Gil; Tiago Grego; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Mike Kay; Ilias Lavidas; Tuan Le; Diana Lemos; Jose Gonzalez Martinez; Thomas Maurel; Mark McDowall; Aoife McMahon; Shamika Mohanan; Benjamin Moore; Michael Nuhn; Denye N Oheh; Anne Parker; Andrew Parton; Mateus Patricio; Manoj Pandian Sakthivel; Ahamed Imran Abdul Salam; Bianca M Schmitt; Helen Schuilenburg; Dan Sheppard; Mira Sycheva; Marek Szuba; Kieron Taylor; Anja Thormann; Glen Threadgold; Alessandro Vullo; Brandon Walts; Andrea Winterbottom; Amonida Zadissa; Marc Chakiachvili; Bethany Flint; Adam Frankish; Sarah E Hunt; Garth IIsley; Myrto Kostadima; Nick Langridge; Jane E Loveland; Fergal J Martin; Joannella Morales; Jonathan M Mudge; Matthieu Muffato; Emily Perry; Magali Ruffier; Stephen J Trevanion; Fiona Cunningham; Kevin L Howe; Daniel R Zerbino; Paul Flicek Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971
Authors: Alex L Mitchell; Maxim Scheremetjew; Hubert Denise; Simon Potter; Aleksandra Tarkowska; Matloob Qureshi; Gustavo A Salazar; Sebastien Pesseat; Miguel A Boland; Fiona M I Hunter; Petra Ten Hoopen; Blaise Alako; Clara Amid; Darren J Wilkinson; Thomas P Curtis; Guy Cochrane; Robert D Finn Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971
Authors: Michael I Love; Charlotte Soneson; Peter F Hickey; Lisa K Johnson; N Tessa Pierce; Lori Shepherd; Martin Morgan; Rob Patro Journal: PLoS Comput Biol Date: 2020-02-25 Impact factor: 4.475
Authors: Timothe Cezard; Fiona Cunningham; Sarah E Hunt; Baron Koylass; Nitin Kumar; Gary Saunders; April Shen; Andres F Silva; Kirill Tsukanov; Sundararaman Venkataraman; Paul Flicek; Helen Parkinson; Thomas M Keane Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971
Authors: Heidi L Rehm; Angela J H Page; Lindsay Smith; Jeremy B Adams; Gil Alterovitz; Lawrence J Babb; Maxmillian P Barkley; Michael Baudis; Michael J S Beauvais; Tim Beck; Jacques S Beckmann; Sergi Beltran; David Bernick; Alexander Bernier; James K Bonfield; Tiffany F Boughtwood; Guillaume Bourque; Sarion R Bowers; Anthony J Brookes; Michael Brudno; Matthew H Brush; David Bujold; Tony Burdett; Orion J Buske; Moran N Cabili; Daniel L Cameron; Robert J Carroll; Esmeralda Casas-Silva; Debyani Chakravarty; Bimal P Chaudhari; Shu Hui Chen; J Michael Cherry; Justina Chung; Melissa Cline; Hayley L Clissold; Robert M Cook-Deegan; Mélanie Courtot; Fiona Cunningham; Miro Cupak; Robert M Davies; Danielle Denisko; Megan J Doerr; Lena I Dolman; Edward S Dove; L Jonathan Dursi; Stephanie O M Dyke; James A Eddy; Karen Eilbeck; Kyle P Ellrott; Susan Fairley; Khalid A Fakhro; Helen V Firth; Michael S Fitzsimons; Marc Fiume; Paul Flicek; Ian M Fore; Mallory A Freeberg; Robert R Freimuth; Lauren A Fromont; Jonathan Fuerth; Clara L Gaff; Weiniu Gan; Elena M Ghanaim; David Glazer; Robert C Green; Malachi Griffith; Obi L Griffith; Robert L Grossman; Tudor Groza; Jaime M Guidry Auvil; Roderic Guigó; Dipayan Gupta; Melissa A Haendel; Ada Hamosh; David P Hansen; Reece K Hart; Dean Mitchell Hartley; David Haussler; Rachele M Hendricks-Sturrup; Calvin W L Ho; Ashley E Hobb; Michael M Hoffman; Oliver M Hofmann; Petr Holub; Jacob Shujui Hsu; Jean-Pierre Hubaux; Sarah E Hunt; Ammar Husami; Julius O Jacobsen; Saumya S Jamuar; Elizabeth L Janes; Francis Jeanson; Aina Jené; Amber L Johns; Yann Joly; Steven J M Jones; Alexander Kanitz; Kazuto Kato; Thomas M Keane; Kristina Kekesi-Lafrance; Jerome Kelleher; Giselle Kerry; Seik-Soon Khor; Bartha M Knoppers; Melissa A Konopko; Kenjiro Kosaki; Martin Kuba; Jonathan Lawson; Rasko Leinonen; Stephanie Li; Michael F Lin; Mikael Linden; Xianglin Liu; Isuru Udara Liyanage; Javier Lopez; Anneke M Lucassen; Michael Lukowski; Alice L Mann; John Marshall; Michele Mattioni; Alejandro Metke-Jimenez; Anna Middleton; Richard J Milne; Fruzsina Molnár-Gábor; Nicola Mulder; Monica C Munoz-Torres; Rishi Nag; Hidewaki Nakagawa; Jamal Nasir; Arcadi Navarro; Tristan H Nelson; Ania Niewielska; Amy Nisselle; Jeffrey Niu; Tommi H Nyrönen; Brian D O'Connor; Sabine Oesterle; Soichi Ogishima; Vivian Ota Wang; Laura A D Paglione; Emilio Palumbo; Helen E Parkinson; Anthony A Philippakis; Angel D Pizarro; Andreas Prlic; Jordi Rambla; Augusto Rendon; Renee A Rider; Peter N Robinson; Kurt W Rodarmer; Laura Lyman Rodriguez; Alan F Rubin; Manuel Rueda; Gregory A Rushton; Rosalyn S Ryan; Gary I Saunders; Helen Schuilenburg; Torsten Schwede; Serena Scollen; Alexander Senf; Nathan C Sheffield; Neerjah Skantharajah; Albert V Smith; Heidi J Sofia; Dylan Spalding; Amanda B Spurdle; Zornitza Stark; Lincoln D Stein; Makoto Suematsu; Patrick Tan; Jonathan A Tedds; Alastair A Thomson; Adrian Thorogood; Timothy L Tickle; Katsushi Tokunaga; Juha Törnroos; David Torrents; Sean Upchurch; Alfonso Valencia; Roman Valls Guimera; Jessica Vamathevan; Susheel Varma; Danya F Vears; Coby Viner; Craig Voisin; Alex H Wagner; Susan E Wallace; Brian P Walsh; Marc S Williams; Eva C Winkler; Barbara J Wold; Grant M Wood; J Patrick Woolley; Chisato Yamasaki; Andrew D Yates; Christina K Yung; Lyndon J Zass; Ksenia Zaytseva; Junjun Zhang; Peter Goodhand; Kathryn North; Ewan Birney Journal: Cell Genom Date: 2021-11-10
Authors: Fiona Cunningham; James E Allen; Jamie Allen; Jorge Alvarez-Jarreta; M Ridwan Amode; Irina M Armean; Olanrewaju Austine-Orimoloye; Andrey G Azov; If Barnes; Ruth Bennett; Andrew Berry; Jyothish Bhai; Alexandra Bignell; Konstantinos Billis; Sanjay Boddu; Lucy Brooks; Mehrnaz Charkhchi; Carla Cummins; Luca Da Rin Fioretto; Claire Davidson; Kamalkumar Dodiya; Sarah Donaldson; Bilal El Houdaigui; Tamara El Naboulsi; Reham Fatima; Carlos Garcia Giron; Thiago Genez; Jose Gonzalez Martinez; Cristina Guijarro-Clarke; Arthur Gymer; Matthew Hardy; Zoe Hollis; Thibaut Hourlier; Toby Hunt; Thomas Juettemann; Vinay Kaikala; Mike Kay; Ilias Lavidas; Tuan Le; Diana Lemos; José Carlos Marugán; Shamika Mohanan; Aleena Mushtaq; Marc Naven; Denye N Ogeh; Anne Parker; Andrew Parton; Malcolm Perry; Ivana Piližota; Irina Prosovetskaia; Manoj Pandian Sakthivel; Ahamed Imran Abdul Salam; Bianca M Schmitt; Helen Schuilenburg; Dan Sheppard; José G Pérez-Silva; William Stark; Emily Steed; Kyösti Sutinen; Ranjit Sukumaran; Dulika Sumathipala; Marie-Marthe Suner; Michal Szpak; Anja Thormann; Francesca Floriana Tricomi; David Urbina-Gómez; Andres Veidenberg; Thomas A Walsh; Brandon Walts; Natalie Willhoft; Andrea Winterbottom; Elizabeth Wass; Marc Chakiachvili; Bethany Flint; Adam Frankish; Stefano Giorgetti; Leanne Haggerty; Sarah E Hunt; Garth R IIsley; Jane E Loveland; Fergal J Martin; Benjamin Moore; Jonathan M Mudge; Matthieu Muffato; Emily Perry; Magali Ruffier; John Tate; David Thybert; Stephen J Trevanion; Sarah Dyer; Peter W Harrison; Kevin L Howe; Andrew D Yates; Daniel R Zerbino; Paul Flicek Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971
Authors: Alex H Wagner; Lawrence Babb; Gil Alterovitz; Michael Baudis; Matthew Brush; Daniel L Cameron; Melissa Cline; Malachi Griffith; Obi L Griffith; Sarah E Hunt; David Kreda; Jennifer M Lee; Stephanie Li; Javier Lopez; Eric Moyer; Tristan Nelson; Ronak Y Patel; Kevin Riehle; Peter N Robinson; Shawn Rynearson; Helen Schuilenburg; Kirill Tsukanov; Brian Walsh; Melissa Konopko; Heidi L Rehm; Andrew D Yates; Robert R Freimuth; Reece K Hart Journal: Cell Genom Date: 2021-11-10
Authors: Nathan C Sheffield; Vivien R Bonazzi; Philip E Bourne; Tony Burdett; Timothy Clark; Robert L Grossman; Ola Spjuth; Andrew D Yates Journal: Sci Data Date: 2022-09-08 Impact factor: 8.501