Literature DB >> 23861572

GMATo: A novel tool for the identification and analysis of microsatellites in large genomes.

Xuewen Wang1, Peng Lu, Zhaopeng Luo.   

Abstract

UNLABELLED: Simple Sequence Repeats (SSR), also called microsatellite, is very useful for genetic marker development and genome application. The increasing whole sequences of more and more large genomes provide sources for SSR mining in silico. However currently existing SSR mining tools can't process large genomes efficiently and generate no or poor statistics. Genome-wide Microsatellite Analyzing Tool (GMATo) is a novel tool for SSR mining and statistics at genome aspects. It is faster and more accurate than existed tools SSR Locator and MISA. If a DNA sequence was too long, it was chunked to short segments at several Mb followed by motifs generation and searching using Perl powerful pattern match function. Matched loci data from each chunk were then merged to produce final SSR loci information. Only one input file is required which contains raw fasta DNA sequences and output files in tabular format list all SSR loci information and statistical distribution at four classifications. GMATo was programmed in Java and Perl with both graphic and command line interface, either executable alone in platform independent manner with full parameters control. Software GMATo is a powerful tool for complete SSR characterization in genomes at any size. AVAILABILITY: The soft GMATo is freely available at http://sourceforge.net/projects/gmato/files/?source=navbar or on contact.

Entities:  

Keywords:  Genome; Marker development; Microsatellite; SSR; Software

Year:  2013        PMID: 23861572      PMCID: PMC3705631          DOI: 10.6026/97320630009541

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

Simple Sequence Repeats (SSR) or microsatellite is a relative short tandem repeats of DNA [1, 2]. Its length polymorphism is specie specific and inheritable, which makes SSR very useful for developing genetic SSR marker widely used in linking genome sequence with traits, diversity investigation, map-base cloning and molecular breeding [2]. There are some useful software and tools developed for SSRs discovery and marker development in silico. However, they were designed before large genome era and have two major limitations: i) too low sequence processing capability and slow speed as pointed by Sharma [2] et al to deal with large genomes while more large genomes i.e. those from crops become important sources for SSR characterization with the benefit from the advanced next generation sequencing technology, ii) no or simple statistical function provided such as TROLL [3]. In addition, some tools have platform dependence i.e. SSR Locator [4] and SciRoko [5]. Most command tools have no graphic interface and very limited other functions, i.e. tool MISA [6, 7]. In order to overcome those limitations mentioned above, novel software named GMATo was developed for faster and accurate SSR discovery and comprehensively statistical analyzing especially for large genomes running at multiple platforms with both graphic and command interface.

Methodology

The soft GMATo was written in Perl and Java language. Java was used for developing graphic interface. Perl was used to discover the microsatellite and perform statistical analyzing. In GMATo DNA sequences are formatted first and the long DNA sequence is chunked to small segments at several Mb for easy processing. All microsatellite motifs consisting of A, T, G and C nucleotide of DNA at user controlled length are generated using Perl meta-characters and regular expression pattern. All motifs are searched greedily through each DNA chunk using Perl powerful pattern matching function. The returned values are used to generate SSR loci information at each chunk and the final SSR loci data at a chromosome after merging data from chunks. This method allows microsatellite discover efficiently in any genome with any size theoretically. Statistical classification and summarization were performed at four levels i.e. motif length, motif composition, grouped complementary motifs and chromosome/scaffold. A flowchart was shown in (Figure 1).
Figure 1

A flowchart of software GMATo.

Validation

The performance of microsatellite identification in recent published Setaria Italica entire genome[8] showed GMATo ran faster than either of most widely used tools, SSR Locator and MISA, in all three platforms Table 1 (see supplementary material). It was also easily to mine SSR in the genome in a normal computer because processing one chunked segment at a time in GMATo required less computing memory. A total number of 46,739, 46,625 and 46,782 microsatellite loci were identified by GMATo, SSR Locator and MISA respectively (Table 1), suggesting more accurate SSR mining than SSR Locator. Manually comparison of these loci revealed that the extra loci from MISA are mined redundantly in the overlapped microsatellites.

Software input

Both graphic user and command line interface were provided in GMATo, either executable independently in Windows, Linux or Mac OS system. Only one input file containing DNA sequence(s) in (raw) (multi-) FASTA format is required to be chosen in graphic mode or typed in command mode if taken the default parameters. The parameters are the motif length range, the minimum repeated times and an option for highlighting microsatellite (Figure 2A). The motif length can be set to any range instead of 1-10 bp given in most SSR mining tools.
Figure 2

Image showing input and output of soft GMATo; (A) graphic input interface; (B) SSR loci information produced by GMATo; (C) SSR statistical data produced by GMATo.

Software output

The output files generated by GMATo are one formatting report, one file containing SSR loci information and another file containing statistical distribution of SSR. All output files are in a tab delimited plain text format for easily importing to other applications i.e. spread sheet for viewing or other manipulation (Figure 2 B, C).The formatting report summarizes the input sequence(s). The SSR loci file lists the input sequence ID and its length, starting and ending position of a microsatellite, the repeated times and the motif sequence. The statistical distribution file provides statistical data at four different classifications at genome aspect. A summary of total is generated in the end of each classification. Classification I is the motif length statistics, providing overview information for the type, abundance in rank order. Classification II is the motif statistics based on sequence composition, i.e. motif composition, occurrence in ranked order. Classification III is the statistics of grouped complementary motifs, providing distribution data for complementary motifs such as TC/GA in a group and their occurrence in ranked order. Classification IV is the statistics of chromosome level distribution, providing the total occurrence of motif(s) and SSR frequency (loci/Mb) at each chromosome or super-scaffold.

Utility

GMATo can be used for efficient and faster microsatellite sequence identification from any given DNA sequences or genomes at any size. Detailed statistic distribution of microsatellites can be used for genome analysis.

Caveat and future development

Current version provides each perfect SSR loci information. The compound and long imperfect microsatellites can be calculated from the SSR loci output using additional script. For a future development, more functions including displaying statistical data graphically, primer designing, marker generation and electronic mapping markers into a genome will be added. The final goal is to develop an integrated powerful toolkit facilitating microsatellite characterization and marker development in large genomes.
  8 in total

1.  TROLL--tandem repeat occurrence locator.

Authors:  Adalberto T Castelo; Wellington Martins; Guang R Gao
Journal:  Bioinformatics       Date:  2002-04       Impact factor: 6.937

Review 2.  Microsatellites: simple sequences with complex evolution.

Authors:  Hans Ellegren
Journal:  Nat Rev Genet       Date:  2004-06       Impact factor: 53.242

Review 3.  Mining microsatellites in eukaryotic genomes.

Authors:  Prakash C Sharma; Atul Grover; Günter Kahl
Journal:  Trends Biotechnol       Date:  2007-10-22       Impact factor: 19.536

4.  SciRoKo: a new tool for whole genome microsatellite search and investigation.

Authors:  Robert Kofler; Christian Schlötterer; Tamas Lelley
Journal:  Bioinformatics       Date:  2007-04-26       Impact factor: 6.937

5.  Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.).

Authors:  T Thiel; W Michalek; R K Varshney; A Graner
Journal:  Theor Appl Genet       Date:  2002-09-14       Impact factor: 5.699

6.  Reference genome sequence of the model plant Setaria.

Authors:  Jeffrey L Bennetzen; Jeremy Schmutz; Hao Wang; Ryan Percifield; Jennifer Hawkins; Ana C Pontaroli; Matt Estep; Liang Feng; Justin N Vaughn; Jane Grimwood; Jerry Jenkins; Kerrie Barry; Erika Lindquist; Uffe Hellsten; Shweta Deshpande; Xuewen Wang; Xiaomei Wu; Therese Mitros; Jimmy Triplett; Xiaohan Yang; Chu-Yu Ye; Margarita Mauro-Herrera; Lin Wang; Pinghua Li; Manoj Sharma; Rita Sharma; Pamela C Ronald; Olivier Panaud; Elizabeth A Kellogg; Thomas P Brutnell; Andrew N Doust; Gerald A Tuskan; Daniel Rokhsar; Katrien M Devos
Journal:  Nat Biotechnol       Date:  2012-05-13       Impact factor: 54.908

7.  Genome-wide distribution and organization of microsatellites in plants: an insight into marker development in Brachypodium.

Authors:  Humira Sonah; Rupesh K Deshmukh; Anshul Sharma; Vinay P Singh; Deepak K Gupta; Raju N Gacche; Jai C Rana; Nagendra K Singh; Tilak R Sharma
Journal:  PLoS One       Date:  2011-06-21       Impact factor: 3.240

8.  SSR Locator: Tool for Simple Sequence Repeat Discovery Integrated with Primer Design and PCR Simulation.

Authors:  Luciano Carlos da Maia; Dario Abel Palmieri; Velci Queiroz de Souza; Mauricio Marini Kopp; Fernando Irajá Félix de Carvalho; Antonio Costa de Oliveira
Journal:  Int J Plant Genomics       Date:  2008
  8 in total
  37 in total

Review 1.  The Genome 10K Project: a way forward.

Authors:  Klaus-Peter Koepfli; Benedict Paten; Stephen J O'Brien
Journal:  Annu Rev Anim Biosci       Date:  2015       Impact factor: 8.923

2.  Siberian larch (Larix sibirica Ledeb.) chloroplast genome and development of polymorphic chloroplast markers.

Authors:  Eugeniya I Bondar; Yuliya A Putintseva; Nataliya V Oreshkova; Konstantin V Krutovsky
Journal:  BMC Bioinformatics       Date:  2019-02-05       Impact factor: 3.169

3.  ImtRDB: a database and software for mitochondrial imperfect interspersed repeats annotation.

Authors:  Viktor A Shamanskiy; Valeria N Timonina; Konstantin Yu Popadin; Konstantin V Gunbin
Journal:  BMC Genomics       Date:  2019-05-08       Impact factor: 3.969

4.  Genome-wide mining of potentially-hypervariable microsatellites and validation of markers in Momordica charantia L.

Authors:  Lavale Shivaji Ajinath; Deepu Mathew
Journal:  Genetica       Date:  2021-11-25       Impact factor: 1.082

5.  Pipeline for developing polymorphic microsatellites in species without reference genomes.

Authors:  Kai Liu; Nan Xie
Journal:  3 Biotech       Date:  2022-08-26       Impact factor: 2.893

6.  ProGeRF: proteome and genome repeat finder utilizing a fast parallel hash function.

Authors:  Robson da Silva Lopes; Walas Jhony Lopes Moraes; Thiago de Souza Rodrigues; Daniella Castanheira Bartholomeu
Journal:  Biomed Res Int       Date:  2015-02-25       Impact factor: 3.411

7.  The report of my death was an exaggeration: A review for researchers using microsatellites in the 21st century.

Authors:  Richard G J Hodel; M Claudia Segovia-Salcedo; Jacob B Landis; Andrew A Crowl; Miao Sun; Xiaoxian Liu; Matthew A Gitzendanner; Norman A Douglas; Charlotte C Germain-Aubrey; Shichao Chen; Douglas E Soltis; Pamela S Soltis
Journal:  Appl Plant Sci       Date:  2016-06-16       Impact factor: 1.936

8.  GMATA: An Integrated Software Package for Genome-Scale SSR Mining, Marker Development and Viewing.

Authors:  Xuewen Wang; Le Wang
Journal:  Front Plant Sci       Date:  2016-09-13       Impact factor: 5.753

9.  Identification of SNP and SSR Markers in Finger Millet Using Next Generation Sequencing Technologies.

Authors:  Davis Gimode; Damaris A Odeny; Etienne P de Villiers; Solomon Wanyonyi; Mathews M Dida; Emmarold E Mneney; Alice Muchugi; Jesse Machuka; Santie M de Villiers
Journal:  PLoS One       Date:  2016-07-25       Impact factor: 3.240

10.  Complete plastid genome sequence of Primula sinensis (Primulaceae): structure comparison, sequence variation and evidence for accD transfer to nucleus.

Authors:  Tong-Jian Liu; Cai-Yun Zhang; Hai-Fei Yan; Lu Zhang; Xue-Jun Ge; Gang Hao
Journal:  PeerJ       Date:  2016-06-28       Impact factor: 2.984

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.