Literature DB >> 19293991

Finding Alu in primate genomes with AF-1.

Ravi Shankar¹, Bhavesh Kataria, Mitali Mukerji.

Abstract

UNLABELLED: Repetitive sequences occupy more than 40% of the human genome which is much larger compared to the 2% occupied by the coding DNA. Amongst these Alu elements are the second largest class of repeats, occupying nearly 10% of the whole genome. Alus have been implicated in many genomic processes, sometimes giving rise to aberrations while many times playing as silent player in genomic and regulatory evolution. Here we present a web server, AF1, exclusively developed for finding Alu like elements. Besides alignment based methodology, this server utilizes probabilistic scanning to find more diverged elements and employs a more precise way of element classification based on unequal weighting of sequence through sequence encoding. AVAILABILITY: AF1 is freely available at http://software.iiar.res.in/af1/. The standalone is also available for download.

Entities: Chemical Disease Gene Species

Keywords: Alu; algorithm; non‐coding; primate; repetitive element

Year: 2009 PMID： 19293991 PMCID： PMC2652563 DOI： 10.6026/97320630003287

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Alus are short interspersed nucleotide elements, which comprise about 10% of human genome (International Human genome consortium, 2001). An intact Alu has two monomeric units linked through an A‐rich region, with approximately 67% identity. The average length of these elements is estimated to be 282 base pairs excluding variable length 3' poly‐A tail. These repeats harbor regulatory sites and contribute to the regulatory repertoire of the genome [1,2]. Alu repeats have been implicated in alternative splicing and coding for proteins [3]. Many monogenic diseases like acute myelocytic leukemia, Tay Sach’s and hemophilia are associated with Alu transpositions [4]. Alu repeats have been a very useful marker in phylogenetics and evolution based studies [5,6]. So far two generalized repeat finding programs have been extensively used for Alu finding: RepeatMasker and Censor [7]. These programs have BLAST in their core and find broad spectrum of repeats like complex as well as simple repeats. Both the programs are dependent upon a common database, Repbase [8]. Here we have tried to find Alu kind of elements by implementing an alignment based method in combination with a probabilistic modeling which utilizes matrices specially designed on Alu sequences, to analyze Alu elements in the ever increasing amount of primate sequences and interest in non‐coding genomic sequences, majority of which earlier tagged as junk. Another important feature is its classification, which incorporates unequal weighting of positions to minimize the impact of non‐diagnostic position in determining the class. Also AF1 would be first of its kind as an exclusively dedicated server for Alu elements.

Input and output

The basic working principle of AF1 is shown in Figure 1. The AF1 server has mainly following components: (1) Restricted alignment based module (2) Probabilistic modeler (3) Classifier. The user gives an input sequence in FASTA format either in paste sequence mode or load sequence file mode. The server has been designed to take a large single sequence. The input query sequence is searched for exact match seed using library of overlapping words generated from an Alu prototype sequence unlike any other database search tools that instead break query into words to scan databases. For every hit only flanking 300 bp regions on both ends are taken for further analysis through alignment. These subsequences are subjected to first scan for longest possible region of continuous match to nucleate the alignment. Unlike other famous methods of detecting multiple nucleus, here we need to locate just one and around which alignment is extended. The matrices used are specific for Alu, derived from 5000 Alu sequences. If Alu is not detected by this alignment, the entire alignment is scanned for a small subregion having reasonable identity. If its present, the aligning sequence is subjected to probabilistic scanning where PWM derived from alignment of 5000 Alu sequences is used with overlapping window of 32 on matrix as well as on sequence, assuming each position as start position in the matrix as well as sequence. The score is compared to random one using a randomized matrix with same dimensions and composition and evaluated for threshold value for identification as an Alu. The found Alu repeats are presented in both directions, whose links are made available. Clicking on those links provides tabulated results giving start and end position with found Alu in that region. Probabilistic approaches work well when sequences are not very close and in case of Alus when they are old and highly diverged.

Figure 1

AF‐1 working principle

The last stage is classification where the query is converted into encoded sequence via alignment with Alu Sx prototype. The same is done for all known subfamilies of Alus. Finally the encoded query is aligned to encoded subfamily library where only diagnostic position is allowed to guide the alignment and achieve the correct judgment for classification. Classification option runs automatically once the first step of Alu identification is complete. The output of classification step is start of the region, end of the region, classified subfamily and sequence. The entire server has been implemented in Tomcat with JSP, while the core programs have been written in C++, Python and PERL. Details, comparison and algorithm of program are available on the server page. The program achieved sensitivity and specificity above 0.9 when validated over experimental data from various sources. This data too has been made available on the server.

Caveats and future development

The web server version has some limitations with size of query as it takes some amount of time if query is very large. It is our server limitation which we are trying to fix by converting our code for parallel computing and run the server through 64 nodes cluster. A possible issue could be the time taken in classification step. This part too could be made faster in the future. Presently we have made the standalone version of the software available on download section which users can easily install on their systems. More complex models can be incorporated for probabilistic scanning in future to get better result for highly weathered elements as well as the entire methodology can be extended to other transposons or retroelements. Continuous work over the server will keep on going in order to keep it up‐to‐date and refined.

8 in total

1. Origin and phylogenetic distribution of Alu DNA repeats: irreversible events in the evolution of primates.

Authors: H Hamdi; H Nishio; R Zielinski; A Dugaiczyk
Journal: J Mol Biol Date: 1999-06-18 Impact factor: 5.469

2. Differential binding of human nuclear proteins to Alu subfamilies.

Authors: N V Tomilin; V M Bozhkov; E M Bradbury; C W Schmid
Journal: Nucleic Acids Res Date: 1992-06-25 Impact factor: 16.971

Review 3. Repbase Update, a database of eukaryotic repetitive elements.

Authors: J Jurka; V V Kapitonov; A Pavlicek; P Klonowski; O Kohany; J Walichiewicz
Journal: Cytogenet Genome Res Date: 2005 Impact factor: 1.636

4. Non-random genomic divergence in repetitive sequences of human and chimpanzee in genes of different functional categories.

Authors: Ravi Shankar; Amit Chaurasia; Biswaroop Ghosh; Dmitry Chekmenev; Evgeny Cheremushkin; Alexander Kel; Mitali Mukerji
Journal: Mol Genet Genomics Date: 2007-03-09 Impact factor: 3.291

5. CENSOR--a program for identification and elimination of repetitive elements from DNA sequences.

Authors: J Jurka; P Klonowski; V Dagman; P Pelton
Journal: Comput Chem Date: 1996-03

6. Alu sequences in the coding regions of mRNA: a source of protein variability.

Authors: W Makałowski; G A Mitchell; D Labuda
Journal: Trends Genet Date: 1994-06 Impact factor: 11.639

Review 7. Alu repeats and human disease.

Authors: P L Deininger; M A Batzer
Journal: Mol Genet Metab Date: 1999-07 Impact factor: 4.797

8. Evolution and distribution of RNA polymerase II regulatory sites from RNA polymerase III dependant mobile Alu elements.

Authors: Ravi Shankar; Deepak Grover; Samir K Brahmachari; Mitali Mukerji
Journal: BMC Evol Biol Date: 2004-10-04 Impact factor: 3.260

8 in total