Literature DB >> 15693744

BLMT: statistical sequence analysis using N-grams.

Madhavi Ganapathiraju1, Vijayalaxmi Manoharan, Judith Klein-Seetharaman.   

Abstract

UNLABELLED: Statistical analysis of amino acid and nucleotide sequences, especially sequence alignment, is one of the most commonly performed tasks in modern molecular biology. However, for many tasks in bioinformatics, the requirement for the features in an alignment to be consecutive is restrictive and "n-grams" (aka k-tuples) have been used as features instead. N-grams are usually short nucleotide or amino acid sequences of length n, but the unit for a gram may be chosen arbitrarily. The n-gram concept is borrowed from language technologies where n-grams of words form the fundamental units in statistical language models. Despite the demonstrated utility of n-gram statistics for the biology domain, there is currently no publicly accessible generic tool for the efficient calculation of such statistics. Most sequence analysis tools will disregard matches because of the lack of statistical significance in finding short sequences. This article presents the integrated Biological Language Modeling Toolkit (BLMT) that allows efficient calculation of n-gram statistics for arbitrary sequence datasets. AVAILABILITY: BLMT can be downloaded from http://www.cs.cmu.edu/~blmt/source and installed for standalone use on any Unix platform or Unix shell emulation such as Cygwin on the Windows platform. Specific tools and usage details are described in a "readme" file. The n-gram computations carried out by the BLMT are part of a broader set of tools borrowed from language technologies and modified for statistical analysis of biological sequences; these are available at http://flan.blm.cs.cmu.edu/.

Entities:  

Mesh:

Year:  2004        PMID: 15693744     DOI: 10.2165/00822942-200403020-00013

Source DB:  PubMed          Journal:  Appl Bioinformatics        ISSN: 1175-5636


  6 in total

1.  Mitochondrial Redox Opto-Lipidomics Reveals Mono-Oxygenated Cardiolipins as Pro-Apoptotic Death Signals.

Authors:  Gaowei Mao; Feng Qu; Claudette M St Croix; Yulia Y Tyurina; Joan Planas-Iglesias; Jianfei Jiang; Zhentai Huang; Andrew A Amoscato; Vladimir A Tyurin; Alexandr A Kapralov; Amin Cheikhi; John Maguire; Judith Klein-Seetharaman; Hülya Bayır; Valerian E Kagan
Journal:  ACS Chem Biol       Date:  2016-01-05       Impact factor: 5.100

2.  Evolutionary insights from suffix array-based genome sequence analysis.

Authors:  Anindya Poddar; Nagasuma Chandra; Madhavi Ganapathiraju; K Sekar; Judith Klein-Seetharaman; Raj Reddy; N Balakrishnan
Journal:  J Biosci       Date:  2007-08       Impact factor: 1.826

3.  AL-Base: a visual platform analysis tool for the study of amyloidogenic immunoglobulin light chain sequences.

Authors:  Kip Bodi; Tatiana Prokaeva; Brian Spencer; Maurya Eberhard; Lawreen H Connors; David C Seldin
Journal:  Amyloid       Date:  2009-03       Impact factor: 7.141

4.  N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Authors:  Hatice Ulku Osmanbeyoglu; Madhavi K Ganapathiraju
Journal:  BMC Bioinformatics       Date:  2011-01-10       Impact factor: 3.169

5.  Mining for class-specific motifs in protein sequence classification.

Authors:  Satish M Srinivasan; Suleyman Vural; Brian R King; Chittibabu Guda
Journal:  BMC Bioinformatics       Date:  2013-03-15       Impact factor: 3.169

6.  Comparison of phosphorylation patterns across eukaryotes by discriminative N-gram analysis.

Authors:  Itziar Frades; Svante Resjö; Erik Andreasson
Journal:  BMC Bioinformatics       Date:  2015-07-30       Impact factor: 3.169

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.