Literature DB >> 22681780

Patho-Genes.org: a website dedicated to gene sequences of potential bioterror bacteria and PCR primers used to amplify them.

Julien Gardès1, Dipankar Bachar, Olivier Croce, Richard Christen.   

Abstract

Pathogenic agents can be very hard to detect, and usually they do not cause illness for several hours or days. To improve the speed and the accuracy of detection tests and satisfy the needs of early diagnosis, molecular biology methods such as PCR are now used. However, selecting a proper target gene and designing good primers is often not easy. We present a dedicated website, http://patho-genes.org, where we provide every sequence, functional annotation, published primer and relevant article for every annotated gene of major pathogenic bacterial species listed as key agents to be used for a bioterrorism attack. Each published primer was analysed to determine its melting temperature, its specificity and its coverage (i.e. its sensitivity against every allele of its target gene). Data generated have been organized in the form of data sheet for each gene, which are available through multiple browser panels and query systems. Published 2012. This article is a U.S. Government work and is in the public domain in the USA.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22681780      PMCID: PMC3815871          DOI: 10.1111/j.1751-7915.2012.00353.x

Source DB:  PubMed          Journal:  Microb Biotechnol        ISSN: 1751-7915            Impact factor:   5.813


Introduction

The detection protocols of pathogens are traditionally based on cell cultures and biochemical tests (or use of antibodies), which are inexpensive and largely automated. However, these procedures often require a long time. The advent of the polymerase chain reaction (PCR) allowed the emergence of molecular diagnoses. Faster [7 min (Belgrader et al., 1999) or less (http://www.nhdiag.com/profile_one.shtml)] and often more accurate (allowing to characterize which alleles are present), molecular detections are progressively replacing cell cultures and biochemical tests (Christen, 2008; Fricke et al., 2009), allowing early diagnosis (Saravolatz et al., 2003) and therapy monitoring (Zaph, 2010). Nevertheless, the development of molecular techniques requires selecting the proper target gene and using good primers. A valid primer should be specific of the target species and should cover every intraspecific genetic variation of the target gene. Its melting temperature (Tm) should be 55°C or above for an optimized amplification (Arun and Saurabha, 2003). In silico analyses theoretically allow verifying the specificity, the sensitivity and the thermodynamic conditions of a molecular protocol, before doing the real experiment. Retrieving every sequence for a given gene is not always easy. Two different strategies are commonly used, retrieval by sequence similarity or by gene names (keywords). The most popular program to search sequences by similarity is blast (Altschul et al., 1990). However, selecting the threshold (expect value, e‐value) to avoid noise (similar but different genes) may be quite difficult. This e‐value depends upon the size of the database and the proper value to select is not known a priori. When retrieving sequences using keywords, the difficulty lies in non‐annotated gene sequences, variations in naming, or wrong annotations. Finally, retrieval of PCR primers from the literature is extremely tedious because hundreds of articles have to be carefully read. As a result, primers used are often retrieved from a recent publication but with no real warranty of specificity and coverage. We propose a web resource, http://patho‐genes.org, which includes every genetic information required for developing or using molecular detection: every sequence, functional annotation, relevant article, published primer for each pathogenicity gene of major potential bioterror bacteria. For each published primer, an in silico analysis was performed to determine the specificity, coverage and melting temperature of each primer. Navigation panels and query systems allow retrieving these datasets.

Results and discussion

Patho‐Genes.org: Content

Data provided by this website are freely available. One can choose a pathogenic bacterium from the homepage by selecting a species or by providing primers. Table 1 lists the main information available. A browsing system allows consulting the data sheet of a given gene (Fig. 1). Genes can be selected from an alphabetical list of gene or protein names, or from a ranked list of the most sequenced or the most divergent genes. The ranking of most sequenced genes is based on the number of sequences available in INSDC databases (the International Nucleotide Sequence Databases Collaboration between Japanese, European and American nucleotide databases, respectively, DDBJ, ENA and GenBank). The ranking of most divergent genes is based on the number of unique sequences for a gene, giving a rough idea of its intraspecific divergence.
Table 1

Major potential bioterror bacteria and statistics at Patho‐Genes.org

OrganismPublic DatabasesPatho‐Genes.org
CDSCDS/genomeGC%EntriesCDSPublished primersArticles
Bacillus anthracis 28 835528834.811906 4812043 934
Burkholderia mallei 21 2925213*68.6*14266 413058
Burkholderia pseudomallei 31 7526152*68.2*217111 4030725
Chlamydophila psittaci 2 24896338.94121 124103224
Coxiella burnetti 11 590190342.68465 047038
Francisella tularensis 14 470169832.28868 1351171 965
Mycobacterium tuberculosis 22 828404665.6181411 886013 632
Rickettsia prowazekii 2 06789329.08361 9320178
Yersinia pestis43 032386346.9105631 7001873 307

Entries of Patho‐Genes.org correspond to annotated genes for each species. CDS: number of CDS sequences available in the EMBL database. CDS/genome: average of CDS per genome. Entries: number of data sheets of genes at Patho‐Genes.org. CDS: number of CDS sequences in Patho‐Genes.org. Published primers: number of published primers collected from the literature. Articles: number of relevant articles found. B. mallei and B. pseudomallei genomes are composed of two chromosomes. Asterisks indicate that the average of CDS/genome for these species is the sum of averages of CDS/chromosome (GC% is an average of over the two chromosomes).

Figure 1

Example of resources available in Patho‐Genes. After selecting a species, users can find a gene via browsing panel located at the top of the species homepage. Genes can be selected by gene name, protein name, or by rank of most sequenced or most divergent genes. Two query systems allow using either keywords or primers to access the data sheet. From the data sheet, users can download sequences at FASTA format [every sequence (accession numbers), unique sequence (genetic variants) or aligned sequences], published primers (Tm and localization) or relevant articles.

Example of resources available in Patho‐Genes. After selecting a species, users can find a gene via browsing panel located at the top of the species homepage. Genes can be selected by gene name, protein name, or by rank of most sequenced or most divergent genes. Two query systems allow using either keywords or primers to access the data sheet. From the data sheet, users can download sequences at FASTA format [every sequence (accession numbers), unique sequence (genetic variants) or aligned sequences], published primers (Tm and localization) or relevant articles. Major potential bioterror bacteria and statistics at Patho‐Genes.org Entries of Patho‐Genes.org correspond to annotated genes for each species. CDS: number of CDS sequences available in the EMBL database. CDS/genome: average of CDS per genome. Entries: number of data sheets of genes at Patho‐Genes.org. CDS: number of CDS sequences in Patho‐Genes.org. Published primers: number of published primers collected from the literature. Articles: number of relevant articles found. B. mallei and B. pseudomallei genomes are composed of two chromosomes. Asterisks indicate that the average of CDS/genome for these species is the sum of averages of CDS/chromosome (GC% is an average of over the two chromosomes).

Patho‐Genes.org: Utility

Patho‐Genes aims at providing a user‐friendly web resource to ease the studies of genes and the design of PCR protocols for detecting major pathogenic bacteria (Table 1). We restricted our analysis to protein CoDing Sequences (CDS). Non‐coding parts are less conserved than a CDS, and are likely to be less efficient in coverage. We standardized the display as a single gene name and a single protein name. This nomenclature was conserved between species so that the same gene name and the same protein name is used whatever the selected species (e.g. gyrA for the DNA gyrase subunit A), like in the HUGO Gene Nomenclature Committee with human genes (Seal et al., 2011). A table containing all unified gene and protein names and their list of synonyms is available (and queries can be done using any of these alternate names). For each gene we also provide every published PCR primer extracted from the scientific literature. Users have at their disposal every data (sequences, unique sequences and alignments in FASTA format) to design new PCR primers or to test the validity of known PCR primers, for example via Primer3 (Untergasser et al., 2007) or Prifi (Fredslund et al., 2005) or the web application OHM (Croce et al., 2008). The most sequenced or the most divergent ranked results are also useful to start a de novo primer design or to re‐orient a strategy of detection. This can ease the selection of a target gene, for example by selecting most studied and conserved genes for a higher coverage. However, genes having several copies in the same genome can bias these rankings. Indeed, our method of collecting sequences by similarity gathers, in a single cluster, identical or nearly identical genes whose loci are different. It is typically the case of housekeeping genes from transposons and other selfish genetic elements (e.g. transposases), which are known to be present in several copies in bacterial genomes (Gibert et al., 1990; Beuzón et al., 2004). Although ranks may be overvalued, this type of clustering was retained because these different loci correspond to copies of the same gene and the difference between sequences is limited to a few single nucleotide polymorphisms (SNPs). From the data sheet of a gene (Fig. 1), users have access to several resources such as strains sequenced for the gene, links to the main biological databases (KEGG, GO, Interpro, PDB and Uniprot), links to the NCBI sequence viewer and relevant publications. It is also possible to download FASTA files containing only unique sequences, aligned, with published primers or every sequence for a gene. A text file, compatible with regular spreadsheet, is available with the coordinates and the strain name of each sequence. Each published PCR primer is displayed with Tms computed using five different methods, its position within aligned sequences, estimates of coverage and specificity, as well as links to publications in which they were found. Some genes had a majority of forward/reverse primers or only one published primer, although a minimum of two is required for PCR amplification. These results were seemingly caused by a design in non‐coding regions, by the presence of an additional restriction site added to the primers leading to the failure of our automated retrieval process, or finally when a larger genomic fragment was amplified with primers located within two different genes (Gardès et al., 2012). However, users can combine two primers thanks to the alignment map of primers with their target gene, when several primers are available.

Patho‐Genes.org: Tools

From the navigation panel, two query systems are available in Patho‐Genes. First, entering a gene, a protein or an author name or selecting a species or a group of species allows searching for genes. Using the option ‘include alternate names’, it is possible to find every sequence of a gene using any alternate name used to describe a gene. To select only genes having published primers, users can check the case ‘published primer’. With this action and by not filling the sections gene, protein and author names, the list of genes having published primers for a species or for the whole database can be displayed too. Unlike current online retrieval system by keywords, our query system retrieves even non‐annotated sequences of a gene because of the combination of similarity and keywords used to build this database. Second, querying by using a primer or a couple of primer's sequences allows retrieving target genes, the amplicon sequences and the positions of primers in aligned sequences.

Patho‐Genes.org: Conclusions

Patho‐Genes.org is a website containing every information useful for the development of molecular detection tests based on PCR and the study of genes in major potential bioterror bacteria. Contrary to Multilocus Sequence Typing (MLST) databases (e.g. http://www.mlst.net/ or http://pubmlst.org/) that focus on seven housekeeping genes per species by proposing every allele, couples of primers and MLST profiles, Patho‐Genes.org targets every annotated gene of a species but does not provide any MLST profile. In particular, our platform offers an analysis of the quality of published primers. Indeed, in the case of Vibrio cholerae, we discovered that only one‐third of published PCR primers are able to amplify every allele of their target gene (Gardès et al., 2012). This tool thus allows to rapidly check the main characteristics of primers (specificity, coverage and thermodynamic parameters). Presently, primers are available only for four species. Primers for the other species of Table 1 will be available soon.

Methods

Ethics Statement: this study did not involve any living being or biological sample

Protein coding DNA sequences belonging to each potential pathogenic bacterium were collected using the ACNUC database and its retrieval system (Gouy and Delmotte, 2008). For each species, the occurrence of gene names was computed to obtain a raw list of most frequent names. From this list, tBLASTx analyses were performed to cluster similar sequences. Every annotation of such similar sequences was collected and the gene was internally described using the most used gene and protein names. Relevance and consistency of each term were checked among species. For each gene, sequences were then ‘de‐replicated’: sequences contained into a longer sequence or identical sequences were removed in order to obtain a set of unique sequences. This set was aligned with MUSCLE version 3.8.31 (Edgar, 2004). Using species name and annotations of similar sequences, requests were done with Entrez at NCBI (PubMed), Jane (Schuemie and Kors, 2008) and eTBLAST (Errami et al., 2007) in order to retrieve a combined list of relevant PubMed IDentification numbers (PMID) for each gene. Some requests yielded up to hundreds of publications. Each article was downloaded in PDF format and relevant short nucleic acid sequences were extracted from each file using regular expressions. Oligomers found at least once in the set of unique sequences were considered as published primers. Because different methods to calculate a Tm can give different results, the basic (Mann et al., 2009) (bas), the salt adjusted (Howley et al., 1979) (Sal) and three nearest‐neighbour (Breslauer et al., 1986; Sugimoto et al., 1996; SantaLucia, 1998) (Bre, San and Sug) methods were used to compute Tm of published primers via dnaMATE (Panjkovich et al., 2005). An Apache2 web server hosts Patho‐Genes. From the navigation panel, two query systems are available. The first one is written in python and allows searching for genes using keywords: gene, protein or author names. It is possible to specify if the search should take in account every annotation used in INSDC databases, or be limited to genes having published primers. The program checks the existence of submitted keywords and selects corresponding entries. The second query system is based on a C program developed by our team, which returns genes containing a primer or a set of primers. Unlike blast, this software allows the use of the IUPAC code for degenerate positions.

Future directions

Patho‐Genes will be updated regularly to incorporate the last release of the public databases. In the future, other organisms, like pulmonary pathogens or new emerging pathogens, will be included. The main goal of this project is to provide an exhaustive and user‐friendly web resources for DNA‐based detection technologies of pathogenic bacteria.
  23 in total

1.  PCR detection of bacteria in seven minutes.

Authors:  P Belgrader; W Benett; D Hadley; J Richards; P Stratton; R Mariella; F Milanovich
Journal:  Science       Date:  1999-04-16       Impact factor: 47.728

2.  Broad-range bacterial polymerase chain reaction for early detection of bacterial meningitis.

Authors:  Louis D Saravolatz; Odette Manzor; Nancy VanderVelde; Joan Pawlak; Bradley Belian
Journal:  Clin Infect Dis       Date:  2002-12-12       Impact factor: 9.079

3.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

4.  Distribution of insertion sequence IS200 in Salmonella and Shigella.

Authors:  I Gibert; J Barbé; J Casadesús
Journal:  J Gen Microbiol       Date:  1990-12

5.  Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes.

Authors:  N Sugimoto; S Nakano; M Yoneyama; K Honda
Journal:  Nucleic Acids Res       Date:  1996-11-15       Impact factor: 16.971

6.  A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics.

Authors:  J SantaLucia
Journal:  Proc Natl Acad Sci U S A       Date:  1998-02-17       Impact factor: 11.205

7.  Predicting DNA duplex stability from the base sequence.

Authors:  K J Breslauer; R Frank; H Blöcker; L A Marky
Journal:  Proc Natl Acad Sci U S A       Date:  1986-06       Impact factor: 11.205

8.  A rapid method for detecting and mapping homology between heterologous DNAs. Evaluation of polyomavirus genomes.

Authors:  P M Howley; M A Israel; M F Law; M A Martin
Journal:  J Biol Chem       Date:  1979-06-10       Impact factor: 5.157

9.  In silico analyses of primers used to detect the pathogenicity genes of Vibrio cholerae.

Authors:  Julien Gardès; Olivier Croce; Richard Christen
Journal:  Microbes Environ       Date:  2012-05-17       Impact factor: 2.912

10.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors:  Robert C Edgar
Journal:  BMC Bioinformatics       Date:  2004-08-19       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.