| Literature DB >> 19825173 |
Ian Sudbery1, Jim Stalker, Jared T Simpson, Thomas Keane, Alistair G Rust, Matthew E Hurles, Klaudia Walter, Dee Lynch, Lydia Teboul, Steve D Brown, Heng Li, Zemin Ning, Joseph H Nadeau, Colleen M Croniger, Richard Durbin, David J Adams.
Abstract
Genome sequences are essential tools for comparative and mutational analyses. Here we present the short read sequence of mouse chromosome 17 from the Mus musculus domesticus derived strain A/J, and the Mus musculus castaneus derived strain CAST/Ei. We describe approaches for the accurate identification of nucleotide and structural variation in the genomes of vertebrate experimental organisms, and show how these techniques can be applied to help prioritize candidate genes within quantitative trait loci.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19825173 PMCID: PMC2784327 DOI: 10.1186/gb-2009-10-10-r112
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Sequencing statistics
| A/J | CAST/Ei | |
|---|---|---|
| Number of Illumina lanes | 10 | 10 |
| Total number of bases (bp) | 3,828,787,991 | 5,476,237,877 |
| Total number of reads | 112,046,098 | 173,021,348 |
| % Mapped (in pairs) | 91.9% (87.9%) | 88.3% (80.0%) |
| % Duplicate read pairs | 22.1% | 18.2% |
| Modal insert size (± 1 SD) | 130 (± 24) bp | 118 (± 12) bp |
SD, standard deviation.
Figure 1Mapping of short read sequence to the mouse genome. The MAQ algorithm was used to map the short read sequences to the NCBI 37 mouse genome assembly. (a) The percentage of reads that map to chromosome 17, other mouse chromosomes, or not at all to the C57BL/6J reference assembly. (b) The actual average sequence depth over chromosome 17 after duplicate sequence reads have been removed, plotted against the nominal depth if all reads were unique and mapped to chromosome 17. (c) The number of contiguous blocks of sequence, defined as a stretch of sequence where all bases have non-zero sequencing depth over them, plotted against nominal depth (see above).
Figure 2Read depth and SNP density for A/J on chromosome 17. Plot shows depth (black) and density of homozygous (red) and heterozygous (green) SNPs compared to the C57BL/6J reference along chromosome 17. Gray or hashed bars are gaps in the reference assembly.
Figure 3Read depth and SNP density for CAST/Ei on chromosome 17. Plot shows depth (black) and density of homozygous (red) and heterozygous (green) SNPs compared to the C57BL/6J reference along chromosome 17. Gray or hashed bars are gaps in the reference assembly.
Figure 4Analysis of SNPs found in A/J sequence. SNPs were called by the MAQ algorithm and then filtered using the MAQ SNP filter. An overlap of SNPs called by MAQ using the A/J Illumina data (MAQ Calls), those present in dbSNP, and those present in the 'Mouse HapMap'.
Figure 5Analysis of SNPs found in CAST/Ei sequence. SNPs were called by the MAQ algorithm and then filtered using the MAQ SNP filter. The overlap between the SNPs called by MAQ using the CAST/Ei Illumina data (MAQ Calls), those present in dbSNP, and those present in the 'Mouse HapMap'.
Figure 6Quality control analysis of novel A/J SNPs. A sample of novel SNPs was genotyped using the Sequenom platform. (a) Plot shows the proportion of calls confirmed by genotyping for differing qualities of SNP. (b) Quality controlling SNPs on basis of mapping depth as well as quality. Confirmation data were used to calculate a score for each SNP based on quality and depth. Plot shows estimated sensitivity and false discovery rate (FDR) based on using different thresholds of p-score. (c) SNPs missed by MAQ. A sample of SNPs present in dbSNP but absent from the MAQ was were genotyped. The reason for the absence of each SNP is shown. 'Filtered' SNPs were called by MAQ but filtered out as being of low quality. 'dbSNP errors' are SNPs where our genotyping agrees with the MAQ call but not dbSNP. 'No base call SNPs' are SNPs for which MAQ did not make a base call (generally due to zero depth). MAQ errors are bases where our genotyping agreed with the dbSNP call.
Figure 7Structural variants on chromosome 17. Copy number variants (CNVs) were called in the sequence using a hidden Markov model (HMM), using both depth information and density of heterozygous SNPs (for amplifications). Deletions were called using aberrantly mapping pairs of reads. (a) Density of CNVs and deletions across chromosome 17 of A/J. CNVs called by the HMM are shown in dark red/green on the left-hand axis. Deletions called from aberrantly mapping read pairs are shown in bright red on the right hand axis. (b) As (a) but for CAST/Ei chromosome 17. Cut-out shows an example of output from the HMM for a region with two amplifications and a loss.
Stop codons gained and lost relative to reference
| Gene | Description | Codon |
|---|---|---|
| A/J | ||
| Stops gained | ||
| | ATP-binding cassette sub-family A member 3 | 1069/1704 |
| | Transmembrane protein 12 | 182/188 |
| ENSMUSG00000073427 | Adult male hypothalamus cDNA | 109/132 |
| ENSMUSG00000002791 | 109/311 | |
| | Butyrophilin-like 7 | 570/580 |
| | H-2 class I histocompatibility antigen, D-37 alpha chain precursor | 218/357 |
| | Ribonuclease P protein subunit p21 | 70/71 |
| ENSMUSG00000073395 | Adult male epididymis cDNA | 158/163 |
| Stops lost | ||
| | Response to metastatic cancers 2 gene | 263/263 |
| ENSMUSG00000073373 | 12 days embryo female mullerian duct includes surrounding region cDNA | 42/42 |
| CAST/Ei | ||
| Stops gained | ||
| | Poly(A) binding protein, cytoplasmic 3 gene | 631/644 |
| ENSMUSG00000073464 | Putative uncharacterized protein | 107/140 |
| | Sperm motility kinase 2B gene | 96/505 |
| | Mesothelin-like gene | 673/686 |
| | UHRF1 (ICBP90) binding protein 1 gene | 89/93 |
| | RIKEN cDNA C920016K16 gene | 13/468 |
| ENSMUSG00000067203 | 3/385 | |
| | RIKEN cDNA 2610110G12 gene | 398/399 |
| | Histocompatibility 2, T region locus 22 gene | 126/380 |
| | Ribonuclease P 21 subunit (human) gene | 70/72 |
| ENSMUSG00000044538 | 305/320 | |
| | Olfactory receptor 138 gene | 203/313 |
| 9130008F23Rik ENSMUSG00000054951 | RIKEN cDNA 9130008F23 gene | 55+106/182 |
| | Triggering receptor expressed on myeloid cells 2 gene | 148/250 |
| | Predicted gene, EG328839 | 13/115 |
| | Yip1 domain family, member 4 gene | 30/290 |
| ENSMUSG00000079336 | 3/13 | |
| ENSMUSG00000066938 | Putative uncharacterized protein fragment | 94/108 |
| | Thyroid adenoma associated gene | 1791/1950 |
| ENSMUSG00000071036 | Putative uncharacterized protein MCG125396 | 53/120 |
| Stops lost | ||
| | Response to metastatic cancers 2 gene | 263/263 |
| | Histocompatibility 2, blastocyst gene | 263/263 |
| | Forkhead box N2 gene | 212/212 |
| ENSMUSG00000073373 | Putative uncharacterized protein | 42/42 |
Genes and transcripts affected by deletions predicted from aberrantly mapping sequence pairs
| Gene | Transcript | Strain | Exons |
|---|---|---|---|
| ENSMUST00000073143 | CAST/Ei | 1 | |
| ENSMUST00000086639 | A/J | 19 | |
| ENSMUST00000095208 | CAST/Ei | 7 | |
| ENSMUST00000052440 | CAST/Ei | 1, 2 | |
| ENSMUST00000077420 | CAST/Ei | 1 | |
| ENSMUST00000079363 | CAST/Ei | 1 | |
| ENSMUST00000086423 | CAST/Ei | 1 | |
| ENSMUSG00000060087 | ENSMUST00000077584 | A/J | 1, 2 |
| ENSMUST00000024858 | CAST/Ei | 15 | |
| ENSMUST00000046839 | A/J | 3 | |
| ENSMUST00000046839 | CAST/Ei | 3 | |
| ENSMUST00000102263 | A/J | 1 | |
| ENSMUST00000102263 | CAST/Ei | 1 | |
| ENSMUST00000112168 | CAST/Ei | 1 | |
| ENSMUST00000015267 | A/J | 1-7 | |
| ENSMUST00000087940 | A/J | 1 | |
| ENSMUST00000087940 | CAST/Ei | 1 | |
| ENSMUST00000040474 | A/J | 9 | |
| ENSMUST00000040474 | CAST/Ei | 9 | |
| ENSMUST00000097376 | CAST/Ei | 10 | |
| ENSMUST00000102026 | A/J | 1 | |
| ENSMUST00000077301 | CAST/Ei | 1 | |
| ENSMUST00000087699 | A/J | 4 |
Genes and transcripts associated with the MHC complex affected by CNVs as predicted by a hidden Markov model
| Gene | Transcript | Type | Strain | Exons |
|---|---|---|---|---|
| ENSMUST00000090537 | Gain | A/J | 1- 7 | |
| ENSMUST00000090537 | Loss | CAST/Ei | 1-6 | |
| ENSMUST00000087244 | Gain | CAST/Ei | 1-5 | |
| ENSMUST00000087173 | Gain | A/J | 1-8 | |
| ENSMUST00000078966 | Gain | A/J | 1-7 | |
| ENSMUST00000087173 | Gain | CAST/Ei | 1-8 | |
| ENSMUST00000078966 | Gain | CAST/Ei | 1-7 | |
| ENSMUST00000114232 | Gain | A/J | 1-6 | |
| ENSMUST00000114311 | Gain | A/J | 1-7 | |
| ENSMUST00000087189 | Gain | A/J | 1-9 | |
| ENSMUST00000025181 | Gain | A/J | 1-8 | |
| ENSMUST00000046131 | Gain | A/J | 1-7 | |
| ENSMUST00000114311 | Gain | CAST/Ei | 1-7 | |
| ENSMUST00000087189 | Gain | CAST/Ei | 1-9 | |
| ENSMUST00000025181 | Gain | CAST/Ei | 1-8 | |
| ENSMUST00000046131 | Gain | CAST/Ei | 1-7 | |
| ENSMUST00000041531 | Gain | A/J | 1 | |
| ENSMUST00000041398 | Gain | CAST/Ei | 1, 2 | |
| ENSMUST00000105041 | Gain | A/J | 1-3 | |
| ENSMUST00000073208 | Gain | A/J | 1-8 | |
| ENSMUST00000074806 | Gain | A/J | 1-7 | |
| ENSMUST00000078205 | Loss | A/J | 1-8 | |
| ENSMUST00000113887 | Loss | A/J | 6, 7 | |
| ENSMUST00000081435 | Loss | A/J | 5, 6 | |
| ENSMUST00000078205 | Gain | CAST/Ei | 1-3 | |
| ENSMUST00000105041 | Gain | CAST/Ei | 2 | |
| ENSMUST00000073208 | Gain | CAST/Ei | 4-8 | |
| ENSMUST00000113887 | Gain | CAST/Ei | 1-4 | |
| ENSMUST00000081435 | Gain | CAST/Ei | 1-3 | |
| ENSMUST00000074806 | Gain | CAST/Ei | 1-7 | |
| ENSMUST00000078205 | Loss | CAST/Ei | 1-8 | |
| ENSMUST00000113887 | Loss | CAST/Ei | 6, 7 | |
| ENSMUST00000081435 | Loss | CAST/Ei | 5, 6 | |
| ENSMUST00000056774 | Gain | A/J | 1-4 | |
| ENSMUST00000068291 | Gain | A/J | 1-3 | |
| ENSMUST00000040279 | Gain | A/J | 1-3 | |
| ENSMUST00000056774 | Gain | CAST/Ei | 1-4 | |
| ENSMUST00000068291 | Gain | CAST/Ei | 1-3 | |
| ENSMUST00000040279 | Gain | CAST/Ei | 1-3 | |
| ENSMUST00000040240 | Gain | A/J | 1-6 | |
| ENSMUST00000071951 | Gain | A/J | 7 | |
| ENSMUST00000076256 | Gain | A/J | 7, 8 | |
| ENSMUST00000074201 | Loss | A/J | 1-10 | |
| ENSMUST00000025312 | Gain | A/J | 1-6 | |
| ENSMUST00000095300 | Gain | A/J | 1-4 | |
| ENSMUST00000097329 | Gain | A/J | 1, 2 | |
| ENSMUST00000113714 | Gain | A/J | 1-4 | |
| ENSMUST00000102675 | Gain | A/J | 1-6 | |
| ENSMUST00000025312 | Gain | CAST/Ei | 1-6 | |
| ENSMUST00000095300 | Gain | CAST/Ei | 1-4 | |
| ENSMUST00000097329 | Gain | CAST/Ei | 1, 2 | |
| ENSMUST00000113714 | Gain | CAST/Ei | 1-4 | |
| ENSMUST00000102675 | Gain | CAST/Ei | 1-6 | |
| ENSMUST00000058801 | Loss | A/J | 1-9 | |
| ENSMUST00000077960 | Loss | A/J | 1-10 | |
| ENSMUST00000080015 | Loss | A/J | 1-9 | |
| ENSMUST00000064686 | Gain | A/J | 1-9 | |
| ENSMUST00000064686 | Gain | CAST/Ei | 1-9 |
Assembly statistics
| Assembler | Sequencing | Strains | Insert size | Read length (bp) | Number of reads | Raw read coverage | Assembled bases (Mb) | Contig coverage | Contig N50 |
|---|---|---|---|---|---|---|---|---|---|
| FuzzyPath | Illumina | CAST/Ei | 200 bp | 36 | 173 million | 65× | 76.03 | 80.0% | 2,315 |
| Velvet* | Illumina | CAST/Ei | 200 bp | 36 | 173 million | 65× | 58.09 | 61.15% | 391 |
| ABySS | Illumina | CAST/Ei | 200 bp | 36 | 173 million | 65× | 72.80 | 76.63% | 1,022 |
| FuzzyPath | Illumina | A/J | 200 bp | 36 | 112 million | 42× | 55.87 | 58.8% | 959 |
| Phusion | Capillary | A/J | 3-5 kb | 400-900 | 357,100 | 2.5× | 81.2 | 85.5% | 3,377 |
| FuzzyPath | Hybrid | A/J† | Hybrid† | Hybrid† | Hybrid† | Hybrid† | 85.54 | 90.0% | 5,793 |
*Version 0.4. We also tried the latest version 0.7, which had been much improved in terms of contig length, but ran out of memory on a computer with 192 Gb RAM. †Assembly incorporating the A/J Illumina reads and Celera capillary reads.
Figure 8Analysis of SNPs in candidate genes within the Obrq13 QTL region of mouse chromosome 17 that has a protective effect on liver triglyceride levels. Shown is an example of histology of the liver of a C57BL/6J male and a consomic mouse carrying the Obrq13 region from A/J, which has a protective effect on the accumulation of liver triglycerides when animals are placed on a high fat diet [16]. Using the sequence of A/J chromosome 17, we called SNPs against the reference C57BL/6J genome and positioned them in candidate genes within the Obrq13 region. Non-coding SNPs are shown as red circles, non-synonymous SNPs are shown in green, synonymous SNPs are shown in yellow while the truncating and essential splice site SNPs found in a transcript of Lmf1 are shown as an open circle and an orange circle, respectively. The orientation of each gene relative to the forward strain is shown above the gene name as an arrow and genes are grouped together based on size (a scale is shown above each group of genes). Genes are displayed with 5 kb of genomic sequence 5' and 3'.