Literature DB >> 19933259

MouseIndelDB: a database integrating genomic indel polymorphisms that distinguish mouse strains.

Keiko Akagi1, Robert M Stephens, Jingfeng Li, Evgenji Evdokimov, Michael R Kuehn, Natalia Volfovsky, David E Symer.   

Abstract

MouseIndelDB is an integrated database resource containing thousands of previously unreported mouse genomic indel (insertion and deletion) polymorphisms ranging from approximately 100 nt to 10 Kb in size. The database currently includes polymorphisms identified from our alignment of 26 million whole-genome shotgun sequence traces from four laboratory mouse strains mapped against the reference C57BL/6J genome using GMAP. They can be queried on a local level by chromosomal coordinates, nearby gene names or other genomic feature identifiers, or in bulk format using categories including mouse strain(s), class of polymorphism(s) and chromosome number. The results of such queries are presented either as a custom track on the UCSC mouse genome browser or in tabular format. We anticipate that the MouseIndelDB database will be widely useful for research in mammalian genetics, genomics, and evolutionary biology. Access to the MouseIndelDB database is freely available at: http://variation.osu.edu/.

Entities:  

Mesh:

Year:  2009        PMID: 19933259      PMCID: PMC2808983          DOI: 10.1093/nar/gkp1046

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

An ultimate goal of genetics research is to link phenotypic differences with different genomic variants, and vice versa. Hundreds of distinct mouse strains are characterized by wide-ranging functional differences. This extensive phenotypic variation has helped to make the mouse a premier model organism, mimicking many aspects of human diversity and diseases. Understanding the genomic differences that distinguish different mouse strains and species will improve the usefulness of different mouse lineages as model organisms, facilitate further evolutionary analysis of ancestral relationships for mouse species and strains and shed new light on the genetic basis for variation among human individuals and in human diseases (1,2). Recently, much attention has been given to the types of variation that exist within or between mammalian species (3–5), particularly short variations such as single nucleotide polymorphisms (SNPs) (6,7). Identification and analysis of such variants has been accomplished by many groups, as exemplified by the HapMap project compiling human data (8). These studies have helped to facilitate the recent discovery of genes associated with certain diseases by genome-wide linkage analyses. In addition to SNPs, insertion/deletion (indel) polymorphisms are another important form of variation (9–15). Indels are comprised of blocks of nucleotides that are present in one individual, strain or lineage, but absent at the orthologous locus in another. In addition to being useful in genotyping studies, indel polymorphisms can have direct functional consequences. As they are longer than SNPs, and may introduce or alter promoters, terminators, alternative splice sites and/or other determinants of transcriptional variation (16–19), indel polymorphisms could contribute significantly both to differences in gene structure and expression, and to various disease processes. In addition to indel polymorphisms, other important forms of structural variation, including copy number variants and polymorphic segmental duplications, also have been studied extensively (5,20–22). A rich potential source of information about genomic variation exists in unassembled, conventional whole-genome shotgun (WGS) sequence traces obtained from different individuals within or between species. Recently, such traces have been used to identify human SNPs (23,24) and simple tandem repeat (STR) and short indel polymorphisms (10,11,25), as tools to identify such polymorphisms from sequence traces have been developed (26). To identify intermediate length (101–10 000 nt) indels distinguishing between mouse lineages, we recently aligned ∼26 million WGS traces from four unassembled mouse strains to the C57BL/6J reference genome assembly (19,27). Most such mouse indels of this intermediate length range are made up of repetitive elements. An overwhelming majority of such polymorphisms appears to have resulted from endogenous retrotransposon integration events (19), which is clearly distinct from human indels (12,25,28,29). There are now several genome browsers and databases available which provide data on SNPs, STRs and other forms of variation (23,24,30–33). These browsers are mostly focused on human variants, although other species including mouse have been developed (34). Other databases tabulate forms of structural variation that distinguish human individuals or populations, including polymorphic transposon integrants and other indels in humans, but in some cases lack contextual information about neighboring genomic features (25,35,36). By contrast, MouseIndelDB is an integrated searchable database that presents high-resolution information about indel polymorphisms that distinguish inbred mouse strains. Through their presentation as a custom track on the UCSC mouse genome browser, these mouse indel data now can be visualized easily in the context of many other important and regularly updated genomic features including annotated genes and expressed sequence tags, CpG islands, other variants, including SNP polymorphisms, and conserved regions (37). These data are freely available for user-initiated queries, either focused upon local features or in categories according to mouse strain, class of indel polymorphism, and chromosome number, at: http://variation.osu.edu/. Included in this report is an example of indel polymorphism data found in MouseIndelDB that was used to screen for a nearby, linked recombinant gene trap cassette, thereby illustrating how a new genotyping assay to distinguish between inbred mouse strains can be developed using MouseIndelDB.

DATABASE DEVELOPMENT

Data sources and software

Conventional WGS sequence traces from four unassembled mouse strains (A/J, DBA/2J, 129S1/SvImJ and 129X1/SvJ), generated at Celera, were downloaded from the National Center for Biotechnology Information (NCBI) trace archive database (4,19). After removing sequence traces containing <300 bases of quality >Q20, ∼26 million raw traces with an average length of 800 nt remained. Thus, a total of ∼18 billion nucleotides were available for alignment to the C57 reference genome. Genomic sequences for the C57 reference mouse genome (release 36.1/mm8, Mar. 2006) were downloaded from NCBI (27). MySql v5.05 was used for all relational tables. RepeatMasker output from the mouse reference genome was downloaded from the UCSC website, and RepeatMasker Open-3.0 was downloaded from http://www.repeatmasker.org/ (38).

Sequence trace alignments

We previously aligned inbred laboratory mouse WGS sequence traces to the mm8 mouse reference genome (Supplementary Figure S1) using GMAP (39) and in some cases Blat as described below (40). A custom Perl script was used to categorize them (Supplementary Figure S2) (19). Further details are available from the authors upon request. Weakly aligned traces were set aside, including those with shorter anchor alignment lengths or lower identities (Supplementary Figure S2). Our analysis resulted in the identification of two distinct categories of indel polymorphisms. In the first group, the aligned sequence traces identified polymorphic insertions that are present in the reference C57 genome but absent from at least one of the unassembled strains’ genomes (Supplementary Figures S1 and S2) (19). In this category, the WGS traces aligned to the reference genome with >90% identity and >200 nt anchoring sequence at each end (both 5′ and 3′), where the inserted sequence length is of intermediate size, i.e. between ∼100 nt and 10 Kb (19). The second group of WGS sequence traces (∼8% of the total) aligned well, but only at one end. We found that a large majority of these traces contain poor quality sequences at the unaligned end. A small number of traces that align well only at one end identify a polymorphic insertion present in the unassembled genome, but absent from the C57 reference genome. These sequence traces were filtered into strong and weak alignment groups based on their alignment scores and other criteria (‘polymorphism in strain X’, Supplementary Figures S1 and S2). Since we previously found that most polymorphic integrants present in the C57 reference genome are caused by endogenous retrotransposition by LINE (L1), SINE and ERV-K LTR retrotransposons, with L1 variants found most frequently (19), we used RepeatMasker (38) to identify mouse L1 retrotransposon sequences within such sequence traces (Supplementary Figure S2). This approach is comparable to a recently published strategy (41). Those repeat sequences contained within the traces were then masked, while the remaining, nonrepetitive sequences were re-aligned to the reference genome using Blat (40). Resulting alignments were used to categorize and map portions of polymorphic L1 integrants present in the unknown strains but absent from the reference genome at orthologous loci. Resulting information about each trace in these two groups, including their categorization and their mapped chromosomal coordinates (mm8), was loaded into relational databases (Mysql v. 5.05). We used the UCSC ‘liftOver’ tool to map these indel traces to the mm9 mouse reference genome (42).

DATABASE CONTENT AND WEB INTERFACE

Overview of MouseIndelDB content

A total of 12 951 unique insertions between 101 nt and 10 Kb were identified in the C57 reference genome but absent from at least one of the other four mouse strains studied (Table 1). Most of these reference genome insertions are repetitive elements, particularly retrotransposon integrants (19), while the rest are simple repeats. In many cases, individual insertional polymorphisms were identified by more than one aligned WGS sequence trace, so they were clustered into unique integrants (19). An additional 9193 previously unreported L1 retrotransposon insertions, present in at least one of the four unassembled mouse strain genomes but absent from the C57 reference genome, have been incorporated into the MouseIndelDB database. These indels were identified by a total of 37 500 WGS sequence traces.
Table 1.

Mapped sequence traces and unique polymorphic loci currently in MouseIndelDB

GenotypeRepeatMaskerNo. of lociNo. of traces
Insert in alt-strainLINE919314 025
Insert in C57 ref.LINE55649394
SINE29126193
LTR33146363
Simple repeat11611525
Mapped sequence traces and unique polymorphic loci currently in MouseIndelDB

User queries

Users can initiate queries of the mouse indel polymorphisms presented in MouseIndelDB, using two query modes available at the home page at: http://variation.osu.edu/ (Figure 1). Users can alternatively focus upon local features of interest, or search the database in categories according to mouse strain(s), class of indel polymorphism(s) and chromosome number. For local feature queries, users can optionally enter a GenBank accession number, gene symbol or chromosomal coordinates in the format ‘chr:start-end’. The maximal range for chromosomal coordinates is 5 MB. Examples of these inputs are provided on the home page (Figure 1). Users can choose to display outputs via a custom track at the UCSC mouse genome browser, or in a table (see below). A choice is provided to search for polymorphisms mapped to the mm8 mouse genome assembly or to the more recent mm9 assembly (July 2007). For category queries, users can choose one or more of the mouse strains 129S1/SvImJ, 129X1/SvJ, A/J and DBA/2J, one or more of the polymorphic elements, including L1 retrotransposons, SINEs, LTR retrotransposons and simple repeats, and a chromosome number. These category searches result in tabular output.
Figure 1.

MouseIndelDB user interface. Two query modes are available at the home page of the MouseIndelDB web interface. Users can alternatively focus upon local features of interest, or search the database in categories according to mouse strain(s), class of indel polymorphism(s) and chromosome number. Default entries and other examples of optional user inputs are provided.

MouseIndelDB user interface. Two query modes are available at the home page of the MouseIndelDB web interface. Users can alternatively focus upon local features of interest, or search the database in categories according to mouse strain(s), class of indel polymorphism(s) and chromosome number. Default entries and other examples of optional user inputs are provided.

Custom track at UCSC mouse genome browser

We implemented a custom track at the UCSC mouse genome browser (43) to display content of the MouseIndelDB database in the context of other annotated genomic features presented alternatively according to the mm8 and mm9 reference mouse genome assemblies. In each case, a temporary Browser Extensible Data (BED) file containing indel polymorphisms up to 500 Kb upstream and 500 Kb downstream of a specified chromosomal locus is uploaded to the UCSC genome browser website. A screen-shot of the MouseIndelDB custom track on the UCSC browser is presented in Figure 2. Examples of intermediate-sized indel variants (100 nt–10 Kb) are presented here, including three WGS sequence traces from DBA/2J mice that skip over a single polymorphic SINE retrotransposon present in the reference genome but absent from DBA, while a nearby mapped sequence trace from the 129X1 strain indicates an insertional polymorphism present in that strain but absent from the reference C57 genome. Polymorphisms are color-coded, as red indels indicate integrants present in the reference genome (Ref-IN), while blue indels indicate those present in an alternative strain (Alt-IN). In cases where a polymorphic integrant is present in an alternative strain (Alt-IN, Figure 2), we also added a 50-nt thin projection on one side or the other of anchored sequence traces to indicate its genomic junction and relative position. Following conventions established on the host browser, users can also click on each feature for additional information including primary sequences and WGS trace alignment information, and can scale the chromosomal region displayed up to a limit of 500 Kb upstream and 500 Kb downstream of the original locus while depicting all indels in this region.
Figure 2.

MouseIndelDB custom track at UCSC mouse genome browser. Polymorphic SINE and L1 retrotransposons at a region of mouse chromosome 1 are shown on the MouseIndelDB custom track displayed at the UCSC mouse genome browser (mm9 assembly). Three WGS sequence traces from DBA/2J mice each skip over a single polymorphic SINE retrotransposon, present in the reference (designated as REF-IN) genome but absent from DBA (top, labeled as REF-IN, red rectangles, trace IDs 1090731691, 1101700443 and 109988639). The reference SINE element can be seen on the conventional RepeatMasker track (bottom). In addition, a nearby mapped sequence trace from the 129X1 strain indicates an insertional polymorphism present in that alternative (ALT-IN) strain but absent from the reference C57 genome (blue rectangle, labeled as ALT-IN, trace ID 1073424193). In the latter case, a 50-nt thin projection (thin blue rectangle, right) indicates the genomic junction and relative position of the previously unreported L1 retrotransposon integrant in the 129X1 genome. Superimposed arrows indicate the direction of sequence trace alignments compared with the reference genome.

MouseIndelDB custom track at UCSC mouse genome browser. Polymorphic SINE and L1 retrotransposons at a region of mouse chromosome 1 are shown on the MouseIndelDB custom track displayed at the UCSC mouse genome browser (mm9 assembly). Three WGS sequence traces from DBA/2J mice each skip over a single polymorphic SINE retrotransposon, present in the reference (designated as REF-IN) genome but absent from DBA (top, labeled as REF-IN, red rectangles, trace IDs 1090731691, 1101700443 and 109988639). The reference SINE element can be seen on the conventional RepeatMasker track (bottom). In addition, a nearby mapped sequence trace from the 129X1 strain indicates an insertional polymorphism present in that alternative (ALT-IN) strain but absent from the reference C57 genome (blue rectangle, labeled as ALT-IN, trace ID 1073424193). In the latter case, a 50-nt thin projection (thin blue rectangle, right) indicates the genomic junction and relative position of the previously unreported L1 retrotransposon integrant in the 129X1 genome. Superimposed arrows indicate the direction of sequence trace alignments compared with the reference genome.

Tabular display

Based on chromosomal coordinates entered by the user, a list of polymorphisms can be retrieved from the MouseIndelDB database. The range of chromosomal coordinates that can be queried is limited to 5 MB. Each aligned sequence trace is linked to the sequence trace ID, providing additional alignment data and a link to the indel polymorphism custom track at UCSC.

DISCUSSION AND FUTURE PLANS

Our goal in developing the MouseIndelDB database and web interface has been to identify and provide detailed access to tens of thousands of indel polymorphisms that distinguish mouse strains. The data can be queried either according to local features or in a bulk, category mode. The resulting data have been linked to a custom track at the UCSC mouse genome browser, facilitating the visualization of previously unreported indel polymorphisms in the context of other annotated features available with the mm8 and mm9 reference mouse genome assemblies. Resulting data can be downloaded in tabular format, and large data sets will be made available to users upon request. In developing this database, we focused on mouse strains and subspecies, since to our knowledge no integrated indel polymorphism database has been described previously for mouse strains, and since millions of high-quality WGS sequence traces are available for alignment to the reference C57 genome. As several hundred distinct mouse strains have many distinct phenotypes including behavioral differences, predisposition to many different diseases and cancers, and other quantifiable characteristics (1), we expect that MouseIndelDB will prove useful in genetic and evolutionary studies addressing various forms of variation (including but not merely limited to SNPs) within and between the strains. To highlight how the MouseIndelDB database can be queried for indel polymorphisms near a local feature of interest, we studied variants closely linked to a transgene insertion in the Sumo1 locus (Supplementary Data and Supplementary Figure S3). We and others currently are generating additional mouse genome sequence data from other currently unsequenced strains and murine species. We plan to update MouseIndelDB frequently to include more information about various forms of polymorphisms as they become available. In particular, we plan to add more information about STR polymorphisms distinguishing between mouse lineages as it becomes available. In addition, we now are studying how some classes of indel polymorphisms are related to transcriptional variation in different strains, tissues, developmental time points, etc. Resulting novel fusion transcript data also will be incorporated together with these genomic variants in additional tracks and data available via future versions of MouseIndelDB. Using information from the Mouse Genome Database and related databases with phenotypic information (31), we plan to identify those genes to which strain-specific phenotypes have been mapped, to facilitate correlations between the various types of genomic polymorphisms available in MouseIndelDB and such variable phenotypes. Through a merging method similar to that used to consolidate overlapping indel traces (Supplementary Figures S1 and S2), we also plan to flag those short indels, SNPs and other genomic variants represented in multiple WGS sequence traces to add an evidence statistic to them. We also plan to make our polymorphism data collection available directly through on-demand tracks at the UCSC mouse genome browser (http://genome.ucsc.edu/) and through the Mouse Genome Database website at the Jackson Laboratory (http://www.informatics.jax.org/) (34).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

The National Cancer Institute, National Institutes of Health (under contract N01-CO-12400); the Intramural Research Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health; and The Ohio State University Comprehensive Cancer Center. Funding for open access charge: The Ohio State University Comprehensive Cancer Center. Conflict of interest statement. None declared.
  42 in total

Review 1.  Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome.

Authors:  J Zhang; L Feuk; G E Duggan; R Khaja; S W Scherer
Journal:  Cytogenet Genome Res       Date:  2006       Impact factor: 1.636

2.  Automating resequencing-based detection of insertion-deletion polymorphisms.

Authors:  Tushar R Bhangale; Matthew Stephens; Deborah A Nickerson
Journal:  Nat Genet       Date:  2006-11-19       Impact factor: 38.330

3.  An initial map of insertion and deletion (INDEL) variation in the human genome.

Authors:  Ryan E Mills; Christopher T Luttig; Christine E Larkins; Adam Beauchamp; Circe Tsui; W Stephen Pittard; Scott E Devine
Journal:  Genome Res       Date:  2006-08-10       Impact factor: 9.043

4.  Retroviral elements and their hosts: insertional mutagenesis in the mouse germ line.

Authors:  Irina A Maksakova; Mark T Romanish; Liane Gagnier; Catherine A Dunn; Louie N van de Lagemaat; Dixie L Mager
Journal:  PLoS Genet       Date:  2006-01       Impact factor: 5.917

5.  Variation resources at UC Santa Cruz.

Authors:  Daryl J Thomas; Heather Trumbower; Andrew D Kern; Brooke L Rhead; Robert M Kuhn; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2006-12-06       Impact factor: 16.971

6.  SNPSTR: a database of compound microsatellite-SNP markers.

Authors:  I Agrafioti; M P H Stumpf
Journal:  Nucleic Acids Res       Date:  2007-01       Impact factor: 16.971

7.  TRDB--the Tandem Repeats Database.

Authors:  Yevgeniy Gelfand; Alfredo Rodriguez; Gary Benson
Journal:  Nucleic Acids Res       Date:  2006-12-14       Impact factor: 16.971

8.  The UCSC genome browser database: update 2007.

Authors:  R M Kuhn; D Karolchik; A S Zweig; H Trumbower; D J Thomas; A Thakkapallayil; C W Sugnet; M Stanke; K E Smith; A Siepel; K R Rosenbloom; B Rhead; B J Raney; A Pohl; J S Pedersen; F Hsu; A S Hinrichs; R A Harte; M Diekhans; H Clawson; G Bejerano; G P Barber; R Baertsch; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2006-11-16       Impact factor: 16.971

9.  The mouse genome database (MGD): new features facilitating a model system.

Authors:  Janan T Eppig; Judith A Blake; Carol J Bult; James A Kadin; Joel E Richardson
Journal:  Nucleic Acids Res       Date:  2006-11-29       Impact factor: 16.971

10.  Ensembl 2007.

Authors:  T J P Hubbard; B L Aken; K Beal; B Ballester; M Caccamo; Y Chen; L Clarke; G Coates; F Cunningham; T Cutts; T Down; S C Dyer; S Fitzgerald; J Fernandez-Banet; S Graf; S Haider; M Hammond; J Herrero; R Holland; K Howe; K Howe; N Johnson; A Kahari; D Keefe; F Kokocinski; E Kulesha; D Lawson; I Longden; C Melsopp; K Megy; P Meidl; B Ouverdin; A Parker; A Prlic; S Rice; D Rios; M Schuster; I Sealy; J Severin; G Slater; D Smedley; G Spudich; S Trevanion; A Vilella; J Vogel; S White; M Wood; T Cox; V Curwen; R Durbin; X M Fernandez-Suarez; P Flicek; A Kasprzyk; G Proctor; S Searle; J Smith; A Ureta-Vidal; E Birney
Journal:  Nucleic Acids Res       Date:  2006-12-05       Impact factor: 16.971

View more
  12 in total

1.  Searching for non-B DNA-forming motifs using nBMST (non-B DNA motif search tool).

Authors:  R Z Cer; K H Bruce; D E Donohue; N A Temiz; U S Mudunuri; M Yi; N Volfovsky; A Bacolla; B T Luke; J R Collins; R M Stephens
Journal:  Curr Protoc Hum Genet       Date:  2012-04

2.  Extensive recombination rate variation in the house mouse species complex inferred from genetic linkage maps.

Authors:  Beth L Dumont; Michael A White; Brian Steffy; Tim Wiltshire; Bret A Payseur
Journal:  Genome Res       Date:  2010-10-26       Impact factor: 9.043

3.  How do mammalian transposons induce genetic variation? A conceptual framework: the age, structure, allele frequency, and genome context of transposable elements may define their wide-ranging biological impacts.

Authors:  Keiko Akagi; Jingfeng Li; David E Symer
Journal:  Bioessays       Date:  2013-01-14       Impact factor: 4.345

4.  Discriminating between deleterious and neutral non-frameshifting indels based on protein interaction networks and hybrid properties.

Authors:  Ning Zhang; Tao Huang; Yu-Dong Cai
Journal:  Mol Genet Genomics       Date:  2014-09-24       Impact factor: 3.291

5.  Genome-Wide Detection of Gene Coexpression Domains Showing Linkage to Regions Enriched with Polymorphic Retrotransposons in Recombinant Inbred Mouse Strains.

Authors:  Marie-Pier Scott-Boyer; Christian F Deschepper
Journal:  G3 (Bethesda)       Date:  2013-04-09       Impact factor: 3.154

6.  Mouse endogenous retroviruses can trigger premature transcriptional termination at a distance.

Authors:  Jingfeng Li; Keiko Akagi; Yongjun Hu; Anna L Trivett; Christopher J W Hlynialuk; Deborah A Swing; Natalia Volfovsky; Tamara C Morgan; Yelena Golubeva; Robert M Stephens; David E Smith; David E Symer
Journal:  Genome Res       Date:  2012-02-23       Impact factor: 9.043

7.  Non-B DB: a database of predicted non-B DNA-forming motifs in mammalian genomes.

Authors:  Regina Z Cer; Kevin H Bruce; Uma S Mudunuri; Ming Yi; Natalia Volfovsky; Brian T Luke; Albino Bacolla; Jack R Collins; Robert M Stephens
Journal:  Nucleic Acids Res       Date:  2010-11-21       Impact factor: 16.971

8.  IAP display: a simple method to identify mouse strain specific IAP insertions.

Authors:  Akshay Ray; Raheleh Rahbari; Richard M Badge
Journal:  Mol Biotechnol       Date:  2011-03       Impact factor: 2.695

9.  Retrotransposon-induced heterochromatin spreading in the mouse revealed by insertional polymorphisms.

Authors:  Rita Rebollo; Mohammad M Karimi; Misha Bilenky; Liane Gagnier; Katharine Miceli-Royer; Ying Zhang; Preeti Goyal; Thomas M Keane; Steven Jones; Martin Hirst; Matthew C Lorincz; Dixie L Mager
Journal:  PLoS Genet       Date:  2011-09-29       Impact factor: 5.917

10.  An antisense promoter in mouse L1 retrotransposon open reading frame-1 initiates expression of diverse fusion transcripts and limits retrotransposition.

Authors:  Jingfeng Li; Manoj Kannan; Anna L Trivett; Hongling Liao; Xiaolin Wu; Keiko Akagi; David E Symer
Journal:  Nucleic Acids Res       Date:  2014-02-03       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.