Literature DB >> 21112874

PSSRdb: a relational database of polymorphic simple sequence repeats extracted from prokaryotic genomes.

Pankaj Kumar¹, Pasumarthy S Chaitanya, Hampapathalu A Nagarajaram.

Abstract

PSSRdb (Polymorphic Simple Sequence Repeats database) (http://www.cdfd.org.in/PSSRdb/) is a relational database of polymorphic simple sequence repeats (PSSRs) extracted from 85 different species of prokaryotes. Simple sequence repeats (SSRs) are the tandem repeats of nucleotide motifs of the sizes 1-6 bp and are highly polymorphic. SSR mutations in and around coding regions affect transcription and translation of genes. Such changes underpin phase variations and antigenic variations seen in some bacteria. Although SSR-mediated phase variation and antigenic variations have been well-studied in some bacteria there seems a lot of other species of prokaryotes yet to be investigated for SSR mediated adaptive and other evolutionary advantages. As a part of our on-going studies on SSR polymorphism in prokaryotes we compared the genome sequences of various strains and isolates available for 85 different species of prokaryotes and extracted a number of SSRs showing length variations and created a relational database called PSSRdb. This database gives useful information such as location of PSSRs in genomes, length variation across genomes, the regions harboring PSSRs, etc. The information provided in this database is very useful for further research and analysis of SSRs in prokaryotes.

Entities: Chemical Disease Species

Mesh：

Year: 2010 PMID： 21112874 PMCID： PMC3013739 DOI： 10.1093/nar/gkq1198

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Simple sequence repeats (SSRs), also known as microsatellites, are the repetitive nucleotide sequences ubiquitously present in all the known genomes (1–9). These sequences characteristically comprise of mono to hexa nucleotide repeats that are arranged in tandem. SSRs undergo high rates of insertion and deletion (INDEL) mutations of their repeat units as a consequence of slipped mispairing of the nascent and the template strands during replication and hence exhibit high polymorphism (10,11). The INDEL mutations of repeat units in SSRs occurs at high frequencies ranging from 10−6 to 10−2 per generation, which is much higher than base substitution rates (6,11–13). Mutations in SSRs have different effects depending on the location of SSRs relative to the organization of genes (6,14). SSRs that are located far from coding regions may evolve neutrally and have no effect on structure and function of genes. On the other hand mutations of SSRs either in the coding regions or near the regulatory regions of genes could produce considerable effects on translation or transcription of genes. Furthermore, the severity of the effect in the coding regions depends on the repeat type and the repeat location (11). Polymorphic SSRs of repeating motif length 3 or 6 nt in the coding regions of genome bring out in-frame mutations which translate into insertion or deletion of amino acid residues whereas polymorphic SSRs of non-triplet repeats (mono-, di-, tetra- and penta-nucleotide) bring out frame-shift mutations. When one looks into abundance and length distribution of SSRs in genomes it gives an impression that SSRs are suppressed in prokaryotic genomes as compared to eukaryotic genomes (9). Nonetheless, some SSRs do show polymorphism and such SSRs have been known to render beneficial effects to prokaryotes [reviewed in (6,8,14)]. The well-documented effects have been the SSR mediated phase variation and antigenic variation which have been well-exploited by many pathogens to evade challenges offered by host immune systems and these have been studied in some bacteria (15). Our group has been analyzing polymorphic SSRs in known prokaryotic genomes and trying to understand evolution of pathogens mediated by SSRs. During the course of our studies, we identified and extracted SSRs which show length variation among different strains and isolates available for 85 different prokaryotic species. All the data pertaining to these polymorphic SSRs (PSSRs) have further been compiled in the form of a relational database called PSSRdb. The present communication gives the details of this database.

EXTRACTION OF THE DATA PERTAINING TO PSSRS

The complete genome sequences of various species with a minimum of two strains were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). Extraction of PSSRs was done by an in-house developed tool called PSSRFinder (Kumar, P. and Nagarajaram, H.A., unpublished data) whose workflow is shown in Figure 1. Essentially, PSSRFinder runs BLASTN (16) to identify equivalent SSRs (SSRs having very similar/identical flanking sequences of lengths of at least 50 bp) among all the genomes available for a species.Some essential details of the method are given below:

Figure 1.

Schematic representation of PSSRFinder. C_PSSRF and NC_PSSRF are the two PERL programs which parse coding and non-coding PSSRs respectively from the BLAST output.

Identification of SSRs from given genomes using SSRF (17) which reports SSR motif, motif repeat counts, co-ordinate of SSR tract in the genome and its location relative to coding and non-coding regions. Identification of equivalent SSRs along with their conserved flanking segments among various strains and isolates by using BLASTN searches with the following set of parameters: E-value ≤10−20; X drop-off value for final gapped alignment=1000; and repeat masking filter=off. Identification of PSSRs by comparing tract lengths of equivalent SSRs found in all the given genomes. If the equivalent polymorphic SSRs are part of non-coding regions in all the genomes it is annotated as non-coding PSSR. If it is found as a part of a coding region even in one of the genomes then the PSSR is referred to as coding PSSR. Schematic representation of PSSRFinder. C_PSSRF and NC_PSSRF are the two PERL programs which parse coding and non-coding PSSRs respectively from the BLAST output.

STRUCTURE OF THE DATABASE

PSSRdb has been developed using MySql (www.mysql.com). PSSRs found in coding and non-coding regions are separately stored in two different logically connected databases. Both the coding and non-coding databases contain 357 tables each of which contains useful information pertaining to PSSRs viz., motif types, repeat copy numbers of SSRs, genomic location of SSRs and information pertaining to the coding regions harboring or flanking the PSSRs. The details of the structure of the relational tables in the coding and non-coding PSSR databases are given in Tables 1 and 2, respectively.

Table 1.

Structure of MySQL table which is used for storing coding PSSR information

Information	Field	Type	Null	Key	Default	Extra
PSSR number	P_n	int(11)	No	PRI	NULL	auto_increment
Strain name	Strn	varchar(90)	YES		NULL
PSSR	mf	varchar(8)	YES		NULL
Repeat length	rpt	int(11)	YES		NULL
Start of repeat	strt_rpt	varchar(20)	YES		NULL
End of repeat	end_rpt	varchar(20)	YES		NULL
Mutation point	mut_pnt	varchar(20)	YES		NULL
Sequence	seq	varchar(50)	YES		NULL
Strand type	strnd_type	varchar(5)	YES		NULL
Protein length	prtn_len	bigint(20)	YES		NULL
Protein ID	prtn_id	varchar(20)	YES		NULL
ORF	orf_name	varchar(20)	YES		NULL
Protein function	prtn_func	varchar(150)	YES		NULL
DNA sequence of length 400 nucleotides	seq_link	varchar(550)	YES		NULL

Table 2.

Structure of MySQL table which is used for storing non-coding PSSR information

Information	Field	Type	Null	Key	Default	Extra
PSSR number	P_n	int(11)	NO	PRI	NULL	auto_increment
Strain name	Strn	varchar(90)	YES		NULL
PSSR	mf	varchar(8)	YES		NULL
Repeat length	rpt	int(11)	YES		NULL
Start of repeat	s_rpt	varchar(20)	YES		NULL
End of repeat	e_rpt	varchar(20)	YES		NULL
Mutation point	mut_pnt	varchar(20)	YES		NULL
Sequence	seq	varchar(50)	YES		NULL
Distance from left ORF	L_D	varchar(10)	YES		NULL
Left strand type	U_S_T	varchar(5)	YES		NULL
Left protein length	U_P_L	bigint(20)	YES		NULL
Left protein ID	U_P_I	varchar(20)	YES		NULL
Left ORF	U_orf	varchar(20)	YES		NULL
Distance from right ORF	R_D	varchar(10)	YES		NULL
Right strand type	D_S_T	varchar(5)	YES		NULL
Right protein length	D_P_L	bigint(20)	YES		NULL
Right protein ID	D_P_I	varchar(20)	YES		NULL
Right ORF	D_orf	varchar(20)	YES		NULL
DNA sequence of 400 nucleotide length	seq_link	varchar(550)	YES		NULL

Structure of MySQL table which is used for storing coding PSSR information Structure of MySQL table which is used for storing non-coding PSSR information

OVERVIEW OF THE DATABASE AND ITS USAGE FOR DATA EXTRACTION

The Database overview is shown in Figure 2. The main page of the database contains a pull down menu containing the names of all the 85 species. Once a selection is made for a species the page is updated with the list of all the available strains belonging to the selected species. One can select two or more of the enlisted strains to query for PSSRs found in those selected set of strains. A separate option is provided to query for PSSRs found in the coding regions and the non-coding regions. A query leads to a page which gives the number of PSSRs found in the selected species. The numbers are clickable links and when clicked display pages containing the detailed information pertaining to the corresponding PSSRs. The displayed information includes the sequence of the repeat motif, its genomic location and the details of the regions harboring that repeat motif. In this page, hyperlinks are also provided to each of the listed PSSRs to design primers using PRIMER3 (14). The coding regions harboring or flanking the PSSRs are also hyperlinked to their respective annotations available at NCBI site (http://www.ncbi.nlm.nih.gov/).

Figure 2.

Overview of PSSRdb shown using screen-shots of various pages. (A) Main page containing species name which can be selected; (B) PSSRs found in the selected species; (C) Table containing the useful details of the selected coding PSSRs found in the selected species; (D) Table containing the useful details of the selected non-coding PSSRs found in the selected species; (E) Sequence alignment of a selected PSSR (in this case G tract). As mentioned earlier, PSSRs stored in PSSRdb have been identified species-wise and these correspond to those SSRs which show length variation among different strains and isolates available for each of the 85 species. In this respect, we would like to sound a word of caution. Although all the prokaryotic genomes have >10× coverage, some sequencing or assembly mistakes cannot be completely ruled out. Some of SSRs may get qualified as PSSRs as a consequence of sequencing errors or due to mistakes committed during assembly of genome sequences. It is very difficult to identify such artifacts. Nonetheless, we believe the data represented in PSSRdb makes a good starting point for further exploratory investigations on SSR polymorphism in prokaryotes. The identification of PSSRs in a species has a very good advantage. Depending upon the region of occurrence it could have different potential application. The strain specific PSSR (SSR length varies only in one strain) could be used for the identification of that strain and is of importance in making diagnostic kits. The genes harboring PSSRs form good candidates to study the functional role of genes in pathogenesis and virulence.

FUTURE DIRECTION

A hyper link will be provided to query for the multiple sequence alignment of the PSSRs along with their flanking regions.So that user can select the number of base pairs from upstream and downstream sequence and will do the multiple sequence alignment on fly. The database will be regularly updated as and when whole genome sequences of new prokaryotes become available.

FUNDING

The work as well as the publication costs were supported by the Core fund of Centre for DNA Fingerprinting and Diagnostics (CDFD). Conflict of interest statement. None declared.

17 in total

1. MICAS: a fully automated web server for microsatellite extraction and analysis from prokaryote and viral genomic sequences.

Authors: Vattipally B Sreenu; Gundu Ranjitkumar; Sugavanam Swaminathan; Sasidharan Priya; Buddhaditta Bose; Mogili N Pavan; Geeta Thanu; Javaregowda Nagaraju; Hampapathalu A Nagarajaram
Journal: Appl Bioinformatics Date: 2003

Review 2. Phase and antigenic variation in bacteria.

Authors: Marjan W van der Woude; Andreas J Bäumler
Journal: Clin Microbiol Rev Date: 2004-07 Impact factor: 26.132

Review 3. DNA replication fidelity.

Authors: Thomas A Kunkel
Journal: J Biol Chem Date: 2004-02-26 Impact factor: 5.157

4. Simple sequence repeats in prokaryotic genomes.

Authors: Jan Mrázek; Xiangxue Guo; Apurva Shah
Journal: Proc Natl Acad Sci U S A Date: 2007-05-07 Impact factor: 11.205

5. Slippage synthesis of simple sequence DNA.

Authors: C Schlötterer; D Tautz
Journal: Nucleic Acids Res Date: 1992-01-25 Impact factor: 16.971

Review 6. Notes on the definition and nomenclature of tandemly repetitive DNA sequences.

Authors: D Tautz
Journal: EXS Date: 1993

Review 7. Bacterial antigenic variation, host immune response, and pathogen-host coevolution.

Authors: R C Brunham; F A Plummer; R S Stephens
Journal: Infect Immun Date: 1993-06 Impact factor: 3.441

Review 8. Adaptive evolution of highly mutable loci in pathogenic bacteria.

Authors: E R Moxon; P B Rainey; M A Nowak; R E Lenski
Journal: Curr Biol Date: 1994-01-01 Impact factor: 10.834

Review 9. Simple sequences.

Authors: D Tautz
Journal: Curr Opin Genet Dev Date: 1994-12 Impact factor: 5.578

Review 10. Slipped-strand mispairing: a major mechanism for DNA sequence evolution.

Authors: G Levinson; G A Gutman
Journal: Mol Biol Evol Date: 1987-05 Impact factor: 16.240

5 in total

1. A study on mutational dynamics of simple sequence repeats in relation to mismatch repair system in prokaryotic genomes.

Authors: Pankaj Kumar; H A Nagarajaram
Journal: J Mol Evol Date: 2012-03-14 Impact factor: 2.395

2. CAGm: a repository of germline microsatellite variations in the 1000 genomes project.

Authors: Nicholas Kinney; Kyle Titus-Glover; Jonathan D Wren; Robin T Varghese; Pawel Michalak; Han Liao; Ramu Anandakrishnan; Arichanah Pulenthiran; Lin Kang; Harold R Garner
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

3. ChloroMitoSSRDB: open source repository of perfect and imperfect repeats in organelle genomes for evolutionary genomics.

Authors: Gaurav Sablok; Suresh B Mudunuri; Sujan Patnana; Martina Popova; Mario A Fares; Nicola La Porta
Journal: DNA Res Date: 2013-01-02 Impact factor: 4.458

4. LeishMicrosatDB: open source database of repeat sequences detected in six fully sequenced Leishmania genomes.

Authors: Manas R Dikhit; Kanhu C Moharana; Bikash R Sahoo; Ganesh C Sahoo; Pradeep Das
Journal: Database (Oxford) Date: 2014-08-14 Impact factor: 3.451

5. RepeatAnalyzer: a tool for analysing and managing short-sequence repeat data.

Authors: Helen N Catanese; Kelly A Brayton; Assefaw H Gebremedhin
Journal: BMC Genomics Date: 2016-06-03 Impact factor: 3.969

5 in total