Literature DB >> 17517762

INDELSCAN: a web server for comparative identification of species-specific and non-species-specific insertion/deletion events.

Feng-Chi Chen¹, Chueng-Jong Chen, Trees-Juen Chuang.

Abstract

Insertion and deletion (indel) events usually have dramatic effects on genome structure and gene function. Species-specific indels have been demonstrated to be associated with species-unique traits. Currently, indel identifications mainly rely on pair-wise sequence alignments (the 'pair-wise indels'), which suffer lack of discrimination of species specificity and insertion versus deletion. Also, there is no freely accessible web server for genome-wide identification of indels. Therefore, we develop a web server--INDELSCAN--to identify four types of indels using multiple sequence alignments that include sequences from one target, one subject and > or =1 out-group species. The four types of indels identified encompass target species-specific, subject species-specific, non-species-specific and target-subject pair-wise indels. Insertions and deletions are discriminated with reference to out-group sequences. The genomic locations (5'UTR, intron, CDS, 3'UTR and intergenic region) of these indels are also provided for functional analysis. INDELSCAN provides genomic sequences and gene annotations from a wide spectrum of taxa for users to select from, including nine target species (human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), dog (Canis familiaris), opossum (Monodelphis domestica), chicken (Gallus gallus), zebrafish (Danio rerio), fly (Drosophila melanogaster) and yeast (Saccharomyces cerevisiae) and >35 subject/out-group species, ranging from yeasts to mammals. The server also provides analytic figures and supports indel identification from user-uploaded alignments/annotations. INDELSCAN is freely accessible at http://indelscan.genomics.sinica.edu.tw/IndelScan/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2007 PMID： 17517762 PMCID： PMC1933116 DOI： 10.1093/nar/gkm350

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Insertions and deletions (indels) represent a major force of genome evolution. It has been revealed that indels occur at a surprisingly high frequency and contribute to sequence divergence more than nucleotide substitutions do (1–4) between very closely related species, such as human and chimpanzee (3, 5–7). Indels that occur after speciation (i.e. species-specific indels) can lead to significant changes in phenotype (8–10) in light of their dramatic effects on genome structure and gene function (6). Therefore, genome-wide analysis of species-specific indels and non-species-specific indels (‘NSS’ indels, i.e. indels of which species specificity is not observed) may shed some light on the mechanisms of genome evolution and functional divergence. However, currently no freely accessible web-based server is available for genome-wide identification of such indels. Hence, it is essential to develop a computational platform for this purpose. Here we develop a web sever (‘INDELSCAN’) to infer species-specific and NSS indels using multiple sequence alignments from at least three species. The compared species should include one target species, one subject species and at least one out-group species. Note that the selection of out-group species is important for the resolution of INDELSCAN to infer insertions and deletions. The differentiation between insertions and deletions is evolutionarily important, and is usually impossible in pair-wise genome comparisons. In general, the out-group species should be more distantly related to both the target and subject species than they are to each other. Yet, an out-group very distant from both compared species may yield minimal resolution in the comparison. The identified target species-specific indels (‘TSS’ indels, i.e. indels that are specific to the target species), which are supported by out-group sequences, should have considerably higher accuracy than indels inferred from target-subject pair-wise comparisons if the subject genome remains a draft. Moreover, by incorporating annotations of the target genome, the web server can display the distributions of indels in different genomic regions [including coding sequence (CDS), untranslated region (UTR), intron and intergenic region]. The genomic sequences and annotations of 9 target species and more than 35 subject/out-group species (see Table 1) are available for the current version of INDELSCAN for analysis. The server also provides the target-subject pair-wise indels for comparison.

Table 1.

Currently available target, subject and out-group species at INDELSCAN (downloaded from the UCSC genome browser at http://hgdownload.cse.ucsc.edu/downloads.html.)

Target	Subject/out-group
Human (Homo sapiens)	Chimpanzee (Pan troglodytes), macaque (Rhesus macaque), mouse (Mus musculus), rat (Rattus norvegicus), rabbit (Oryctolagus cunicuhis), dog (Canis familiaris), cow (Bos Taurus), armadillo (Dasypus novemcinctus), elephant (Loxodonta Africana), tenrec (Echinops telfairi), opossum (Monodelphis domestica), chicken (Gallus gallus), frog (Xenopus tropicalis), zebrafish (Danio rerio), tetraodon (Tetraodon nigroviridis) and Fugu (Fugu rubripes)
Mouse	Human, chimpanzee, macaque, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum, chicken, frog, zebrafish, tetraodon and fugu
Rat	Human, mouse, dog, cow, opossum, chicken, frog and zebrafish
Dog	Human and mouse
Opossum	Human, mouse, rat, chicken, frog and zebrafish
Chicken	Human, mouse, rat, opossum, frog and zebrafish
Zebrafish	Human, mouse, opossum, tetraodon, fugu, and frog
Fly (Drosophila melanogaster)	Drosophila simulans, Drosophila sechellia, Drosophila yakuba, Drosophila erecta, Drosophila ananassae, Drosophila pseudoobscura, Drosophila persimilis, Drosophila willistoni, Drosophila virilis, Drosophila mojavensis, Drosophila grimshawi, Anopheles gambiae, Anopheles mellifera and Tribolium castaneum
Yeast (Saccharomyces cerevisiae)	Saccharomyces paradoxus, Saccharomyces mikatae, Saccharomyces kudriavzevii, Saccharomyces bayanus, Saccharomyces castelli, and Saccharomyces kluyveri

Currently available target, subject and out-group species at INDELSCAN (downloaded from the UCSC genome browser at http://hgdownload.cse.ucsc.edu/downloads.html.)

MATERIALS AND METHODS

Process flow of INDELSCAN

The system flow of TSS/NSS indel identification and categorization are stated below. First, the user can upload multiple sequence alignments of at least three species in the UCSC (University of California, Santa Cruz) multiple alignment format (described at http://genome.ucsc.edu/goldenPath/help/maf.html) and specify the target, subject and out-group species. The user can also select the compared species from the INDELSCAN-provided species list, which is linked to pre-stored genomic sequences downloaded from the UCSC genome browser (http://hgdownload.cse.ucsc.edu/downloads.html). To reduce potential errors, only indels that occur within continuously alignable sequences are considered. Second, overlapping alignments (i.e. one target sequence segment is aligned to two or more genomic sequences in the compared species) are filtered out in the system to eliminate potential spurious indels. Third, three indel types (also see Figure 1), TSS (Events 1 and 2), subject species-specific (‘SSS’; Events 3 and 4) and NSS (Events 5 and 6) indels, are identified. The TSS indels are classified into TSS insertions and TSS deletions based on comparison with the out-group genomic sequence(s) (described below). Fourth, using the user-input or INDELSCAN-provided annotations of the target genome, the genomic locations (i.e. 5′UTR, intron, CDS, 3′UTR and intergenic region) of the identified indels are determined. Finally, the system adjusts the UCSC pair-wise sequence alignments (described below) and provides pair-wise indels between the target and subject genomes for comparison. Analytic figures/tables that compare the numbers and rates of TSS, SSS, NSS and pair-wise indels are also provided.

Figure 1.

Three types of INDELSCAN-identified indels: TSS indels (Events 1 and 2), SSS indels (Events 3 and 4) and NSS indels (Events 5 and 6), with odd number events representing insertions and even number events, deletions.

Discrimination between TSS insertions and TSS deletions

As stated above (also see Figure 1), the inclusion of out-group genomic sequences enables the system to distinguish between insertions and deletions in TSS indels. Here, a TSS insertion is defined as a DNA segment (or a single base) of the target species genome that is not only absent in the orthologous genomic sequence of the subject species but also absent or partially absent in the out-group species (e.g. Event 1). A TSS deletion is defined in a similar way (e.g. Event 2). If the subject species genome compared is still in the draft stage, then many indels (especially one- or two-bp indels) inferred from target-subject pair-wise sequence alignments may be false positives that result from sequencing or assembling errors. The TSS indels identified by INDELSCAN are therefore more reliable than indels inferred from pair-wise comparisons because the former are supported by the genomic sequences of the out-group species.

Update of UCSC multiple sequence alignments of vertebrate genomes

The chimpanzee genome used in the current UCSC multiple alignments of vertebrate genomes is Build 1 Version 1 (or UCSC version panTro1). To include the up-to-date chimpanzee genome (Build 2 Version 1 or UCSC version panTro2), three processes were performed. First, in the UCSC human-genome-based (hg18) multiple alignments, we replaced panTro1 with panTro2 transplanted from the UCSC human–chimpanzee pair-wise alignments (hg18 versus panTro2). Second, we used the MUSCLE package (11,12) to realign the updated sequences. Finally, the new alignments were transformed to the UCSC multiple alignment format.

Justification of pair-wise sequence alignments

In the UCSC alignments, there exist a considerable number of potentially artificial alignment gaps, e.g. aagcatgcat—gaatcggata aagcatg—agttgaatcggata Such alignments might result in false indel inferences. To eliminate these gaps, we realigned and closed neighboring UCSC alignment gap pairs that satisfied both of these conditions: (i) one gap occurs in the subject genome and the other in the target genome and (ii) the distance between these two gaps is not larger than three bases. And the adjustment process continues until no gap pairs satisfy the conditions. The alignment shown above is adjusted as follows: aagcatgcat-gaatcggata aagcatgagttgaatcggata

Implementation and run time

Our server is implemented in ASP.NET on the server end and java script on the client end. There are four major steps in this program: scanning the multiple sequence alignments and filtering out overlapping alignments; inferring TSS, SSS and NSS indels from the multiple sequence alignments; determining the genomic locations of the identified indels and identifying target-subject pair-wise indels. As an example of run time, analysis of indels in the human chromosomes 21 and 22 using human as target, chimpanzee as subject and macaque, mouse and rat as out-group species takes approximately 3 min (see Figure 2).

Figure 2.

The INDELSCAN web sever. (A) Users can choose from pre-stored genomic sequences for comparison or (B) upload their own multiple sequence alignments and annotations. (C) The server will exhibit the time elapsing and request link after the job is submitted. (D) When the job is completed, the server gives the links from which the results can be downloaded. (E) Also provided are analytical figures that show the numbers of indels, indel rates and genomic distributions of indels (results in the human chromosome 22 are not shown in the figure).

WEB SEVER DESCRIPTION

Input

INDELSCAN supports two schemes for inputting multiple alignments and annotation information in the target genome: users can use the INDELSCAN-provided alignments/annotation (Figure 2A) or upload their own data (Figure 2B). For the former scheme, users can select one target, one subject and at least one out-group species in the system. The web server allows users to choose one out of nine target species, including human (Homo sapiens), mouse (Mus musculus), rat (Rattus norvegicus), dog (Canis familiaris), opossum (Monodelphis domestica), chicken (Gallus gallus), zebrafish (Danio rerio), fly (Drosophila melanogaster) and yeast (Saccharomyces cerevisiae). Table 1 lists the available subject and out-group species for each target species. As shown in Figure 2A, users can also choose to perform indel analysis for single or multiple chromosomes or only in user-specified region(s)/gene(s). For the latter scheme, users must upload three files to the system: description of compared species, multiple sequence alignments and target genome annotation (Figure 2)B. Note that INDELSCAN by default takes the first sequence in the uploaded multiple alignments as target species sequence, and the second as subject-species sequence, whereas the others are regarded as out-group sequences. Also note that only the species specified in the description file will be processed in the server. For each submitted job, INDELSCAN will automatically assign a request link for the user to retrieve the results (Figure 2C).

Output

When the submitted job is completed, users can use the request link to visualize or download their results. Two text outputs (indels inferred from multiple sequence alignments and pair-wise indels) are also downloadable from the system (Figure 2D), including the coordinates of the identified indels in the target and subject genomes, indel lengths, genomic locations of indels and the IDs of indel-affected genes. The result page also shows analytic figures for comparisons of the numbers and rates of TSS, SSS, NSS and pair-wise indels for each user-specified region (Figure 2E). The results of each query will be retained in the system for 48 h.

CONCLUSION

The INDELSCAN web interface identifies TSS, SSS and NSS indels using multiple sequence alignments. So far, such comparisons have remained scarce due to lack of suitable analysis tools and high-quality genomic sequences. With the rapidly increasing number of available genomes, multi-genome comparisons will soon become a norm. INDELSCAN will therefore, be helpful for analysis of species-specific and non-species-specific indels. Moreover, the server also detects the genomic locations of the identified indels, giving very useful information for biologists to functionally study these indels. Note that the web interface can identify indels on a genome-wide scale or in individual genes or shorter sequences of interest. Individual genes and sporadic sequences are more readily accessible than complete genome sequences. It is relatively easy to compare shorter homologous sequences of a large number of species for inference of species specificity, which would otherwise be limited to only a few compared species in genome-scale comparisons. Such large-number-species comparisons can provide high resolution for inference of species specificity and paths of indel evolution through the phylogenetic tree of the compared species. Currently, the INDELSCAN-provided multiple sequence alignments are downloaded from the UCSC genome browser, which deals with gaps by assigning gaps to branches in the phylogenetic tree using out-group information (13). For multiple sequence alignment tools, to measure the cost of a multiple alignment and to choose gap costs consistent with the measure chosen remain challenging (14–16). Many tools have been proposed and have performed well. For example, CLUSTALW (17) dynamically adjusts the gap penalties in a position- and residue-specific manner. T-Coffee (18) improves the accuracy but sacrifices the efficiency by first building a library of both local and global alignments for each pair of sequences and then using a library-based scoring scheme for progressive alignment. MUSCLE (11,12) and MAFFT (19) enhance both the accuracy and efficiency by initially building a progressive alignment and then horizontally refining the phylogenetic tree to improve the objective score. In our server, indel analyses will not be limited to the sequence alignments available from UCSC, but can be performed together with any multiple sequence alignment tools. Moreover, INDELSCAN can be applied to the detection of lineage-specific indels, such as primate-, rodent-, mammal-, avian- or fish-specific indels. Results thus obtained may bring functional and evolutionary insights and help focus experimental studies. Finally, analyses of indel events can be combined with other species-specific events, such as nucleotide substitutions, frame-shift mutations (20), duplications (21) and pseudogenizations (22), to further our understanding of the mechanisms of speciation and functional divergence.

22 in total

1. T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Authors: C Notredame; D G Higgins; J Heringa
Journal: J Mol Biol Date: 2000-09-08 Impact factor: 5.469

2. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

3. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors: Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

4. Aligning multiple genomic sequences with the threaded blockset aligner.

Authors: Mathieu Blanchette; W James Kent; Cathy Riemer; Laura Elnitski; Arian F A Smit; Krishna M Roskin; Robert Baertsch; Kate Rosenbloom; Hiram Clawson; Eric D Green; David Haussler; Webb Miller
Journal: Genome Res Date: 2004-04 Impact factor: 9.043

5. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

6. Alu-mediated inactivation of the human CMP- N-acetylneuraminic acid hydroxylase gene.

Authors: T Hayakawa; Y Satta; P Gagneux; A Varki; N Takahata
Journal: Proc Natl Acad Sci U S A Date: 2001-09-18 Impact factor: 11.205

7. Human-specific insertions and deletions inferred from mammalian genome sequences.

Authors: Feng-Chi Chen; Chueng-Jong Chen; Wen-Hsiung Li; Trees-Juen Chuang
Journal: Genome Res Date: 2006-11-09 Impact factor: 9.043

8. Genomic DNA insertions and deletions occur frequently between humans and nonhuman primates.

Authors: Kelly A Frazer; Xiyin Chen; David A Hinds; P V Krishna Pant; Nila Patil; David R Cox
Journal: Genome Res Date: 2003-03 Impact factor: 9.043

9. Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence.

Authors: Tatsuya Anzai; Takashi Shiina; Natsuki Kimura; Kazuyo Yanagiya; Sakae Kohara; Atsuko Shigenari; Tetsushi Yamagata; Jerzy K Kulski; Taeko K Naruse; Yoshifumi Fujimori; Yasuhito Fukuzumi; Masaaki Yamazaki; Hiroyuki Tashiro; Chie Iwamoto; Yumi Umehara; Tadashi Imanishi; Alice Meyer; Kazuho Ikeo; Takashi Gojobori; Seiamak Bahram; Hidetoshi Inoko
Journal: Proc Natl Acad Sci U S A Date: 2003-06-10 Impact factor: 11.205

10. Myosin gene mutation correlates with anatomical changes in the human lineage.

Authors: Hansell H Stedman; Benjamin W Kozyak; Anthony Nelson; Danielle M Thesier; Leonard T Su; David W Low; Charles R Bridges; Joseph B Shrager; Nancy Minugh-Purvis; Marilyn A Mitchell
Journal: Nature Date: 2004-03-25 Impact factor: 49.962

4 in total

1. Identification and analysis of ancestral hominoid transcriptome inferred from cross-species transcript and processed pseudogene comparisons.

Authors: Yao-Ting Huang; Feng-Chi Chen; Chiuan-Jung Chen; Hsin-Liang Chen; Trees-Juen Chuang
Journal: Genome Res Date: 2008-03-27 Impact factor: 9.043

2. SeqFIRE: a web application for automated extraction of indel regions and conserved blocks from protein multiple sequence alignments.

Authors: Pravech Ajawatanawong; Gemma C Atkinson; Nathan S Watson-Haigh; Bryony Mackenzie; Sandra L Baldauf
Journal: Nucleic Acids Res Date: 2012-06-11 Impact factor: 16.971

3. CAPIH: a Web interface for comparative analyses and visualization of host-HIV protein-protein interactions.

Authors: Fan-Kai Lin; Chia-Lin Pan; Jinn-Moon Yang; Trees-Juen Chuang; Feng-Chi Chen
Journal: BMC Microbiol Date: 2009-08-12 Impact factor: 3.605

4. LenVarDB: database of length-variant protein domains.

Authors: Eshita Mutt; Oommen K Mathew; Ramanathan Sowdhamini
Journal: Nucleic Acids Res Date: 2013-11-04 Impact factor: 16.971

4 in total