Fabrice P A David1,2, Jacques Rougemont3,2, Bart Deplancke4,5. 1. Bioinformatics and Biostatistics Core Facility, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland. 2. Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland. 3. Bioinformatics and Biostatistics Core Facility, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland bart.deplancke@epfl.ch. 4. Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland Jacques.rougemont@epfl.ch. 5. Laboratory of Systems Biology and Genetics, Institute of Bio-engineering, School of Life Sciences, Ecole Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland.
Abstract
GETPrime (http://bbcftools.epfl.ch/getprime) is a database with a web frontend providing gene- and transcript-specific, pre-computed qPCR primer pairs. The primers have been optimized for genome-wide specificity and for allowing the selective amplification of one or several splice variants of most known genes. To ease selection, primers have also been ranked according to defined criteria such as genome-wide specificity (with BLAST), amplicon size, and isoform coverage. Here, we report a major upgrade (2.0) of the database: eight new species (yeast, chicken, macaque, chimpanzee, rat, platypus, pufferfish, and Anolis carolinensis) now complement the five already included in the previous version (human, mouse, zebrafish, fly, and worm). Furthermore, the genomic reference has been updated to Ensembl v81 (while keeping earlier versions for backward compatibility) as a result of re-designing the back-end database and automating the import of relevant sections of the Ensembl database in species-independent fashion. This also allowed us to map known polymorphisms to the primers (on average three per primer for human), with the aim of reducing experimental error when targeting specific strains or individuals. Another consequence is that the inclusion of future Ensembl releases and other species has now become a relatively straightforward task.
GETPrime (http://bbcftools.epfl.ch/getprime) is a database with a web frontend providing gene- and transcript-specific, pre-computed qPCR primer pairs. The primers have been optimized for genome-wide specificity and for allowing the selective amplification of one or several splice variants of most known genes. To ease selection, primers have also been ranked according to defined criteria such as genome-wide specificity (with BLAST), amplicon size, and isoform coverage. Here, we report a major upgrade (2.0) of the database: eight new species (yeast, chicken, macaque, chimpanzee, rat, platypus, pufferfish, and Anolis carolinensis) now complement the five already included in the previous version (human, mouse, zebrafish, fly, and worm). Furthermore, the genomic reference has been updated to Ensembl v81 (while keeping earlier versions for backward compatibility) as a result of re-designing the back-end database and automating the import of relevant sections of the Ensembl database in species-independent fashion. This also allowed us to map known polymorphisms to the primers (on average three per primer for human), with the aim of reducing experimental error when targeting specific strains or individuals. Another consequence is that the inclusion of future Ensembl releases and other species has now become a relatively straightforward task.
Genome-scale experiments have accumulated massive information over recent years and have greatly contributed to our understanding of gene expression and its regulatory mechanisms. These experiments have clearly revealed the ubiquitous nature of alternative splicing and isoform dosage effects (1,2). It is in this regard key to perform precise, quantitative measurements of selected genes and transcripts to assess specific expression patterns or functions. Such experiments typically involve the quantitative real-time polymerase chain reaction (qPCR), and the value of these qPCR assays depends in large part on the quality of the selected primer pair for the respective, targeted transcription unit (3).We have therefore undertaken the systematic design of primer pairs for every known gene and transcript for organisms with well-annotated genome references with in silico verification of optimal specificity. The design of these primer pairs follows the pipeline described in (4), which we briefly recall here: for designing gene- or transcript-specific primers pairs, exon junctions that are included in respectively the largest or smallest number of isoforms for each gene are first identified after which the corresponding transcript is processed with PerlPrimer (5) for the best primer set that overlaps these junctions. Candidate primers are then filtered according to (i) genome-wide specificity (running BLAST with an E-value of 100) and (ii) not spanning 5′ or 3′ untranslated regions (UTR), as well as ranked according to the number of isoforms they cover, amplicon length, and other primer quality parameters that were previously discussed (3,4). The top three primer pairs are then retained and displayed in the database with a star-based quality flag corresponding to the rank in this list. If no pair passes the filters, then the original primer design constraints are progressively relaxed until a candidate pair emerges, hence the warnings associated with some primers (the ‘warnings’ column that can be observed in Figure 1).
Figure 1.
The GETPrime 2.0 search interface and tabular display. The figure shows several of the 30 primer pairs found for human gene MDM1. Results can be downloaded in tab-separated format through the ‘Download’ link. The search is restricted to an organism, Ensemble release, and a maximum number of lines (the smaller the number, the faster the query). Each result line corresponds to a single primer pair, and displays its unique ID, the gene, and transcript(s) it targets, its star-based rank (among the best three pairs found for the gene), the fraction of isoforms it covers, the amplicon length, the primer sequences and their respective melting temperatures, and the Ensembl annotation for the gene (KNOWN or NOVEL). The last two columns provide respectively warnings if the primer search did not work with standard parameters and a link to a primer pair-specific page shown in Figure 3.
The GETPrime 2.0 search interface and tabular display. The figure shows several of the 30 primer pairs found for human gene MDM1. Results can be downloaded in tab-separated format through the ‘Download’ link. The search is restricted to an organism, Ensemble release, and a maximum number of lines (the smaller the number, the faster the query). Each result line corresponds to a single primer pair, and displays its unique ID, the gene, and transcript(s) it targets, its star-based rank (among the best three pairs found for the gene), the fraction of isoforms it covers, the amplicon length, the primer sequences and their respective melting temperatures, and the Ensembl annotation for the gene (KNOWN or NOVEL). The last two columns provide respectively warnings if the primer search did not work with standard parameters and a link to a primer pair-specific page shown in Figure 3.
Figure 3.
The GETPrime 2.0 primer details page. All information about one particular primer pair is summarized in this page: gene and transcript IDs, GETPrime warnings, and detailed information about each forward and reverse primer. Particularly relevant are the indication of SNP positions (in red) and whether a primer spans an intron as well as the UCSC display link.
Since its inception in 2011, the database has been used continuously and access statistics show a large user base. For example, the GETPrime web interface received nearly 1800 visits (by 1000 users) over the first 6 months of 2016 alone. Individual users also provided constructive feedback to further improve GETPrime, which in large part prompted the major update of the database (2.0) that is presented here.
Data integration
GETPrime 2.0 cross-references a number of data sources to document gene structures, transcript sequences, genome sequences, and annotated variants. The database now incorporates data from three versions of Ensembl (6): 50 (July 2008), 61 (February 2011), and 81 (July 2015). This is to keep backward compatibility with the first release of GETPrime, while updates will be performed on a regular basis. Relevant data from Ensembl is automatically imported into our PostgreSQL database (https://www.postgresql.org). Thanks to the uniform structure of the Ensembl database for various species, we can now easily select additional species and we currently host yeast, chicken, macaque, chimpanzee, rat, platypus, pufferfish, and Anolis carolinensis next to the previously established primers pairs for human, mouse, zebrafish, fly, and worm. Compared to version 1.0 (4), the database schema has been re-designed to improve the speed of queries via the web user interface and to provide two new interaction modes: a batch download capability and a programmatic interface (RESTful API).
User interface
The user interface of GETPrime 2.0 has been re-designed to make it faster, friendlier, and richer. It is based on a new 3-tier Ruby on Rails (RoR) (http://rubyonrails.org) application. Among many other features, this framework improves the efficiency of database queries and simplifies the rendering on web pages. It also implements a RESTful API that allows programmers to access the data directly (see documentation at http://bbcftools.epfl.ch/getprime/api_documentation). A new search engine allows searching by gene name, Ensembl gene ID or transcript ID or directly by the internal primer pair ID (Figure 1). The search box accepts up to 10 identifiers per search. When only one identifier is provided and does not match perfectly, a regular expression search is performed. This search tool uses the Jquery (mostly the Ajax method) and datatables.js Javascript libraries. The Ajax technology is used to update portions of the web pages following user selections without reloading the whole page. This improves the responsivity and flexibility of the display.Primers are linked to a view in the UCSC genome browser (7) where they are displayed in their genomic context. In the UCSC view, primer pairs are identified by a unique numeric ID, by the gene and transcript they target, and by their rank in the list of candidates (Figure 2). This UCSC display is generated by uploading a single custom track (as a BED file) generated for each organism and Ensembl version. The BED file can be directly downloaded as well as the full database as TAB-separated files. Each primer pair is clickable and linked back to the GETPrime website, and more specifically to the page containing details about the primer. This page contains more information than the previous version of GETPrime. For example, next to the position in the genome of the primer sequences, the position and the length of the introns are reported when applicable.
Figure 2.
The UCSC view of GETPrime 2.0 primer pairs. The two primers (in black) of each pair are displayed as thick bars connected by thin arrows revealing on which strand the pair of primers will amplify DNA. They are also mapped to their genomic coordinates, including the intron(s) that each primer potentially spans. In this example, six primer pairs are displayed. For the first three, both forward and reverse primers span an intron, whereas for the three other pairs, only the reverse primer spans an intron. Note that the format of the displayed identifier is the following: GETPrimeID|Ensembl-gene-ID_GETPrime-rank (e.g. 2111376|ENSG00000111554_3) and that the other primer pairs for MDM1 are not visible within this screenshot.
The UCSC view of GETPrime 2.0 primer pairs. The two primers (in black) of each pair are displayed as thick bars connected by thin arrows revealing on which strand the pair of primers will amplify DNA. They are also mapped to their genomic coordinates, including the intron(s) that each primer potentially spans. In this example, six primer pairs are displayed. For the first three, both forward and reverse primers span an intron, whereas for the three other pairs, only the reverse primer spans an intron. Note that the format of the displayed identifier is the following: GETPrimeID|Ensembl-gene-ID_GETPrime-rank (e.g. 2111376|ENSG00000111554_3) and that the other primer pairs for MDM1 are not visible within this screenshot.
Sequence polymorphisms
Our knowledge of genomic variation within species and how such variants drive molecular and organismal diversity is rapidly increasing (8–12). One of the benefits of these advances is that we are now able to incorporate variant information (when available) in genomic experiments since such genetic variants may be an important source of experimental variability or even failure (13,14). Thus, to reduce experimental error, we decided to start displaying the presence of known SNPs within the GETPrime 2.0 primers to aid users in the design and interpretation of their experiments. So far, we were able to cover SNPs for human and mouse by importing them from dbSNP v145 (15) and to map these SNPs to the primers that overlap them. Corresponding positions in the primer sequences are then highlighted (Figure 3) and a link to the dbSNP-based evidence allows a more detailed evaluation of the nature and relevance of the polymorphism(s).The GETPrime 2.0 primer details page. All information about one particular primer pair is summarized in this page: gene and transcript IDs, GETPrime warnings, and detailed information about each forward and reverse primer. Particularly relevant are the indication of SNP positions (in red) and whether a primer spans an intron as well as the UCSC display link.
Database content
The GETPrime 2.0 database currently contains a total of 1 175 874 primer pairs (444 256 in human, 268 855 in mouse), corresponding to an average of six pairs per covered gene (across 13 species). In human, there are more than 20 pairs per gene and 12 in mouse. On average, 92% of Ensembl protein-coding genes are covered by our database, the remainder corresponding to non-unique sequences for which specific primers could not be designed (Table 1). Importantly, for human and mouse, this number exceeds 98%. However, some species are still only partially covered due to differences in the Ensembl annotation compared to the human database. In particular, for A.carolinensis or macaque, only a fraction of the annotated genes were processed in the pipeline (Table 1). Moreover, the incomplete status of the macaque assembly led to a high failure rate of the pipeline probably due to the repetitive nature of unassembled contigs (Table 1). We plan to resolve both issues in a next release. Regarding polymorphisms, a total of 2 864 885 variants were mapped to human primers (492 968 in mouse), indicating that more than 80% of human primers overlap a documented variant, with an average of about three SNPs per primer. This illustrates the importance of considering this information when designing or using primers.
Table 1.
Global statistics of GETPrime 2.0 for each of the 13 included species
Species
Number of genes in ensembl v81
Number of genes covered (% of total genes)
Number of primer pairs
Number of variants
Anolis carolinensis
19
19 (100%)
57
Caenorhabditis elegans
20 447
20 412 (99.8%)
104 810
Danio rerio
22 337
21 805 (97.6%)
121 576
Drosophila melanogaster
13 918
13 911 (99.9%)
99 032
Gallus gallus
5222
5204 (99.6%)
18 791
Homo sapiens
22 017
21 653 (98.3%)
444 256
2 864 885
Macaca mulatta
8693
1154 (13.2%)
5345
Mus musculus
22 155
21 835 (98.6%)
268 855
492 968
Ornithothynchus anatinus
170
149 (87.6%)
606
Pan troglodytes
140
140 (100%)
474
Rattus norvegicus
21 470
20 841 (97.0%)
88 311
Saccharomyces cerevisiae
6692
6620 (98.9%)
19 923
Tetraodon nigroviridis
1130
1125 (99.6%)
3838
CONCLUSION AND PERSPECTIVE
The steady access statistics of the GETPrime database are a testimony that the embedded primer information is useful and the release of GetPrime 2.0 responds to user feedback that we have received, namely: update the genomic data, extend to new species, and cross-reference new types of genomic data (polymorphisms). Our plan for the future is to maintain the availability of the database, keep it up-to-date and add new species when possible. In addition, we intend for GETPrime to closely follow and reflect the growth of genomic data resources at Ensembl and elsewhere. One additional important aspect would be a broader experimental validation of our in silico-designed primers. One way to do so would be to accommodate user feedback. We intend to implement a system that would allow the flagging of primers that have been successfully (or possibly even unsuccessfully) used in experiments, including links to the respective papers.
Authors: S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin Journal: Nucleic Acids Res Date: 2001-01-01 Impact factor: 16.971
Authors: Thomas M Keane; Leo Goodstadt; Petr Danecek; Michael A White; Kim Wong; Binnaz Yalcin; Andreas Heger; Avigail Agam; Guy Slater; Martin Goodson; Nicholas A Furlotte; Eleazar Eskin; Christoffer Nellåker; Helen Whitley; James Cleak; Deborah Janowitz; Polinka Hernandez-Pliego; Andrew Edwards; T Grant Belgard; Peter L Oliver; Rebecca E McIntyre; Amarjit Bhomra; Jérôme Nicod; Xiangchao Gan; Wei Yuan; Louise van der Weyden; Charles A Steward; Sendu Bala; Jim Stalker; Richard Mott; Richard Durbin; Ian J Jackson; Anne Czechanski; José Afonso Guerra-Assunção; Leah Rae Donahue; Laura G Reinholdt; Bret A Payseur; Chris P Ponting; Ewan Birney; Jonathan Flint; David J Adams Journal: Nature Date: 2011-09-14 Impact factor: 49.962
Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis Journal: Nature Date: 2015-10-01 Impact factor: 49.962
Authors: Andrew Yates; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Konstantinos Billis; Denise Carvalho-Silva; Carla Cummins; Peter Clapham; Stephen Fitzgerald; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Nathan Johnson; Thomas Juettemann; Stephen Keenan; Ilias Lavidas; Fergal J Martin; Thomas Maurel; William McLaren; Daniel N Murphy; Rishi Nag; Michael Nuhn; Anne Parker; Mateus Patricio; Miguel Pignatelli; Matthew Rahtz; Harpreet Singh Riat; Daniel Sheppard; Kieron Taylor; Anja Thormann; Alessandro Vullo; Steven P Wilder; Amonida Zadissa; Ewan Birney; Jennifer Harrow; Matthieu Muffato; Emily Perry; Magali Ruffier; Giulietta Spudich; Stephen J Trevanion; Fiona Cunningham; Bronwen L Aken; Daniel R Zerbino; Paul Flicek Journal: Nucleic Acids Res Date: 2015-12-19 Impact factor: 16.971
Authors: Carley Snoznik; Valentina Medvedeva; Jelena Mojsilovic-Petrovic; Paige Rudich; James Oosten; Robert G Kalb; Todd Lamitina Journal: Proc Natl Acad Sci U S A Date: 2021-09-30 Impact factor: 11.205
Authors: Emma Muiños Lopez; Kevin Leclerc; Malissa Ramsukh; Paulo El Parente; Karan Patel; Carlos J Aranda; Anna M Josephson; Lindsey H Remark; David J Kirby; Daniel B Buchalter; Tarik Hadi; Sophie M Morgani; Bhama Ramkhelawon; Philipp Leucht Journal: Bone Date: 2022-01-06 Impact factor: 4.626
Authors: S Lee; L H Remark; A M Josephson; K Leclerc; E Muiños Lopez; D J Kirby; Devan Mehta; H P Litwa; M Z Wong; S Y Shin; P Leucht Journal: NPJ Regen Med Date: 2021-05-28
Authors: Callison E Alcott; Hari Krishna Yalamanchili; Ping Ji; Meike E van der Heijden; Alexander Saltzman; Nathan Elrod; Ai Lin; Mei Leng; Bhoomi Bhatt; Shuang Hao; Qi Wang; Afaf Saliba; Jianrong Tang; Anna Malovannaya; Eric J Wagner; Zhandong Liu; Huda Y Zoghbi Journal: Elife Date: 2020-04-22 Impact factor: 8.140