Literature DB >> 17914229

Evolutionary insights from suffix array-based genome sequence analysis.

Anindya Poddar1, Nagasuma Chandra, Madhavi Ganapathiraju, K Sekar, Judith Klein-Seetharaman, Raj Reddy, N Balakrishnan.   

Abstract

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG,coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17914229     DOI: 10.1007/s12038-007-0087-z

Source DB:  PubMed          Journal:  J Biosci        ISSN: 0250-5991            Impact factor:   1.826


  10 in total

Review 1.  Chance favors the prepared genome.

Authors:  L H Caporale
Journal:  Ann N Y Acad Sci       Date:  1999-05-18       Impact factor: 5.691

2.  Variations on probabilistic suffix trees: statistical modeling and prediction of protein families.

Authors:  G Bejerano; G Yona
Journal:  Bioinformatics       Date:  2001-01       Impact factor: 6.937

Review 3.  The evolution of mycobacterial pathogenicity: clues from comparative genomics.

Authors:  R Brosch; A S Pym; S V Gordon; S T Cole
Journal:  Trends Microbiol       Date:  2001-09       Impact factor: 17.079

4.  Fast sequence clustering using a suffix array algorithm.

Authors:  Ketil Malde; Eivind Coward; Inge Jonassen
Journal:  Bioinformatics       Date:  2003-07-01       Impact factor: 6.937

5.  BLMT: statistical sequence analysis using N-grams.

Authors:  Madhavi Ganapathiraju; Vijayalaxmi Manoharan; Judith Klein-Seetharaman
Journal:  Appl Bioinformatics       Date:  2004

6.  Massive gene decay in the leprosy bacillus.

Authors:  S T Cole; K Eiglmeier; J Parkhill; K D James; N R Thomson; P R Wheeler; N Honoré; T Garnier; C Churcher; D Harris; K Mungall; D Basham; D Brown; T Chillingworth; R Connor; R M Davies; K Devlin; S Duthoy; T Feltwell; A Fraser; N Hamlin; S Holroyd; T Hornsby; K Jagels; C Lacroix; J Maclean; S Moule; L Murphy; K Oliver; M A Quail; M A Rajandream; K M Rutherford; S Rutter; K Seeger; S Simon; M Simmonds; J Skelton; R Squares; S Squares; K Stevens; K Taylor; S Whitehead; J R Woodward; B G Barrell
Journal:  Nature       Date:  2001-02-22       Impact factor: 49.962

7.  Alignment of whole genomes.

Authors:  A L Delcher; S Kasif; R D Fleischmann; J Peterson; O White; S L Salzberg
Journal:  Nucleic Acids Res       Date:  1999-06-01       Impact factor: 16.971

8.  Characterization of IS1547, a new member of the IS900 family in the Mycobacterium tuberculosis complex, and its association with IS6110.

Authors:  Z Fang; C Doig; N Morrison; B Watt; K J Forbes
Journal:  J Bacteriol       Date:  1999-02       Impact factor: 3.490

9.  Genome sequence of the human malaria parasite Plasmodium falciparum.

Authors:  Malcolm J Gardner; Neil Hall; Eula Fung; Owen White; Matthew Berriman; Richard W Hyman; Jane M Carlton; Arnab Pain; Karen E Nelson; Sharen Bowman; Ian T Paulsen; Keith James; Jonathan A Eisen; Kim Rutherford; Steven L Salzberg; Alister Craig; Sue Kyes; Man-Suen Chan; Vishvanath Nene; Shamira J Shallom; Bernard Suh; Jeremy Peterson; Sam Angiuoli; Mihaela Pertea; Jonathan Allen; Jeremy Selengut; Daniel Haft; Michael W Mather; Akhil B Vaidya; David M A Martin; Alan H Fairlamb; Martin J Fraunholz; David S Roos; Stuart A Ralph; Geoffrey I McFadden; Leda M Cummings; G Mani Subramanian; Chris Mungall; J Craig Venter; Daniel J Carucci; Stephen L Hoffman; Chris Newbold; Ronald W Davis; Claire M Fraser; Bart Barrell
Journal:  Nature       Date:  2002-10-03       Impact factor: 49.962

10.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence.

Authors:  S T Cole; R Brosch; J Parkhill; T Garnier; C Churcher; D Harris; S V Gordon; K Eiglmeier; S Gas; C E Barry; F Tekaia; K Badcock; D Basham; D Brown; T Chillingworth; R Connor; R Davies; K Devlin; T Feltwell; S Gentles; N Hamlin; S Holroyd; T Hornsby; K Jagels; A Krogh; J McLean; S Moule; L Murphy; K Oliver; J Osborne; M A Quail; M A Rajandream; J Rogers; S Rutter; K Seeger; J Skelton; R Squares; S Squares; J E Sulston; K Taylor; S Whitehead; B G Barrell
Journal:  Nature       Date:  1998-06-11       Impact factor: 49.962

  10 in total
  2 in total

1.  Direct mapping of symbolic DNA sequence into frequency domain in global repeat map algorithm.

Authors:  Matko Glunčić; Vladimir Paar
Journal:  Nucleic Acids Res       Date:  2012-09-12       Impact factor: 16.971

2.  N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Authors:  Hatice Ulku Osmanbeyoglu; Madhavi K Ganapathiraju
Journal:  BMC Bioinformatics       Date:  2011-01-10       Impact factor: 3.169

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.