| Literature DB >> 22011106 |
Hongbo M Xie1, Juan C Perin, Theodore G Schurr, Matthew C Dulik, Sergey I Zhadanov, Joseph A Baur, Michael P King, Emily Place, Colleen Clarke, Michael Grauer, Jonathan Schug, Avni Santani, Anthony Albano, Cecilia Kim, Vincent Procaccio, Hakon Hakonarson, Xiaowu Gai, Marni J Falk.
Abstract
BACKGROUND: Mitochondrial genome sequence analysis is critical to the diagnostic evaluation of mitochondrial disease. Existing methodologies differ widely in throughput, complexity, cost efficiency, and sensitivity of heteroplasmy detection. Affymetrix MitoChip v2.0, which uses a sequencing-by-genotyping technology, allows potentially accurate and high-throughput sequencing of the entire human mitochondrial genome to be completed in a cost-effective fashion. However, the relatively low call rate achieved using existing software tools has limited the wide adoption of this platform for either clinical or research applications. Here, we report the design and development of a custom bioinformatics software pipeline that achieves a much improved call rate and accuracy for the Affymetrix MitoChip v2.0 platform. We used this custom pipeline to analyze MitoChip v2.0 data from 24 DNA samples representing a broad range of tissue types (18 whole blood, 3 skeletal muscle, 3 cell lines), mutations (a 5.8 kilobase pair deletion and 6 known heteroplasmic mutations), and haplogroup origins. All results were compared to those obtained by at least one other mitochondrial DNA sequence analysis method, including Sanger sequencing, denaturing HPLC-based heteroduplex analysis, and/or the Illumina Genome Analyzer II next generation sequencing platform.Entities:
Mesh:
Year: 2011 PMID: 22011106 PMCID: PMC3234255 DOI: 10.1186/1471-2105-12-402
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Figure 1MitoChip filtering protocol (MFP) workflow diagram.
Sample Characteristics.
| Sample ID | Group | TISSUE ORIGIN | Whole mtDNA SEQUENCING PREVIOUSLY PERFORMED | COMPARATIVE SEQUENCING METHOD | mtDNA Haplogroup | MFP Predicted Haplogroup | Unique or Pathogenic Known Feature | Heteroplasmic variant levels | MitoChip Detection of a priori known feature(s) | MFP MitoChip v2.0 Call Rate (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Clinical | Blood | No | Common point mutation panel (Baylor) | H | H | None | Yes | 99.5 | |
| 2 | Clinical | Blood | No | Common point mutation panel (Baylor) | I2 | I | Heteroplasmic 3243A > G | 84% | Yes | 99.7 |
| 3 | Clinical | Blood | No | DHPLC (Transgenomics) | B2b | R* | None | Yes | 99.6 | |
| 4 | Clinical | Blood | No | DHPLC (Transgenomics) | R* | R* | None | Yes | 99.5 | |
| 5 | Clinical | Muscle | No | DHPLC (Transgenomics) | N1b2 | N1* | None | Yes | 99.7 | |
| 6 | Clinical | Muscle | Yes | Sanger (Baylor) | J1c | J | Homoplasmic 10845C > T Heteroplasmic 5049C > T | Not reported | Yes | 99.8 |
| 7 | Clinical | Muscle& | Yes | Sanger (Baylor) | J1c | J | Homoplasmic 12264C > T | 100% | Yes | 99.6 |
| 8 | Clinical | Blood& | No | qPCR of heteroplasmic variant (Baylor) | J1c | J | Heteroplasmic 12264C > T | 30% | Yes | 99.6 |
| 9 | Clinical | Blood | Yes | Sanger (Baylor) | N1a | N1* | Homoplasmic T insertion between 5537 and 5538 | Yes | 99.8 | |
| 10 | Clinical | Blood | Yes | Sanger (Baylor) | N1b2 | N1* | None | Yes | 99.7 | |
| 11 | Clinical | Blood | Yes | Sanger (Baylor) | W1c | W | Homoplasmic 11204T > C | Yes | 99.6 | |
| 12 | Clinical | Blood | Yes | Sanger (Baylor) | L1b1a | L0/L1 | Homoplasmic 11778G > A | Yes | 99.5 | |
| 13 | Clinical | Fibroblast Cell Line | Yes | Sanger (Baylor) | H | H | None | Yes | 99.6 | |
| 14 | Research | Cell line | Yes | Sanger (MPK) | K | L3 | 5 Kb deletion | Yes | N/A | |
| 15 | Research | Hela Cell Line | Yes | Illumina GAII (JAB) | L3b1a1 | L3 | None | No | 99.7 | |
| 16 | Research | Blood | Yes | Sanger (TGS) | V7 | V | None | Yes | 99.7 | |
| 17 | Research | Blood | Yes | Sanger (TGS) | H11 | H | Heteroplasmy 9966G > A | 20% | No | 98.5 |
| 18 | Research | Blood | Yes | Sanger (TGS) | U4a | U* | Heteroplasmy 1706A > G | 25% | No | 98.9 |
| 19 | Research | Blood | Yes | Sanger (TGS) | D5a | L3 | None | Yes | 99.7 | |
| 20 | Research | Blood | Yes | Sanger (TGS) | D5a | L3 | None | Yes | 99.6 | |
| 21 | Research | Blood | Yes | Sanger (TGS) | J1c2 | J | Heteroplasmy 12879C > T | 45% | Yes | 99.7 |
| 22 | Research | Blood | Yes | Sanger (TGS) | T1a | T | None | Yes | 99.6 | |
| 23 | Research | Blood | Yes | Sanger (TGS) | U4b3 | U* | None | Yes | 99.8 | |
| 24 | Research | Blood | Yes | Sanger (TGS) | D5c | L3 | None | Yes | 99.6 |
13 clinical samples and 11 research samples were analyzed by MitoChip v2.0. Tissue origin, comparative mtDNA genome sequencing methodologies, unique or pathogenic features that characterize particular samples based on a priori sequencing knowledge, variant heteroplasmy levels, as well as MitoChip performance in terms of ability to detect known variants and the call rate achieved using MitoChip Filtering Protocol (MFP) are detailed. The 'mtDNA haplogroup' column details the manually curated haplogroup based on full sequence analysis for each sample. 'MitoSNP predicted haplogroup' was based on a subset of 22 mtDNA positions and generally agreed with manual curation, with the exception of haplogroups B, K, and D that were not properly identified by MitoSNP prediction. &, muscle and blood samples originated from the same subject.
Figure 2Quality score distribution for all 24 samples. Quality score distribution is shown for all bases of all 24 samples. Sample #14 has a uniquely low quality score distribution.
Figure 3Correlation heat map for all 24 samples. The heat map plots the correlation coefficient score between any two samples. Samples #4, #14, #17, and #18 were clear outliers relative to the other samples.
Figure 4Residual sum of squares (RSS) plot for all 24 samples. Using the mean quality score across all samples for each probe as the baseline, the sum of squares of the difference between quality score for each sample and the baseline was determined. Samples #4, #14, #17, and #18 were outliers as had been shown by the generalized extreme studentized deviate (GSD) many-outlier procedure performed in Figure 2 [9].
Figure 5Quality score analysis. (A) (Top) Average quality score plots for all 24 samples, using a 25 bp moving window. (Bottom) Highest intensity value plots for all 24 samples, using a 50 bp moving window. (B) An average quality score plot is shown for a single sample (#14) using a 25 bp moving window.
Figure 6Comparison of Call Rate by MFP and GSEQ 4.1. Call rate comparison for each sample processed by MFP and GSEQ 4.1 alone.
Figure 7Improvement fraction in call rate by gene. Average call rate comparison by gene for each sample processed with MFP over GSEQ 4.1 alone.
Variant discrepancy for calls made on MitoChip v2.0 with MFP bioinformatic analysis compared to other methods of whole mitochondrial genome sequencing.
| 6 | 5 (0/5) |
| 7 | 2 (1/1) |
| 9 | 1 (1/0) |
| 10 | 0 (0/0) |
| 11 | 0 (0/0) |
| 12 | 1 (0/1) |
| 13 | 0 (0/0) |
| 15 | 0 (0/0) |
| 16 | 0 (0/0) |
| 19 | 3 (2/1) |
| 20 | 10 (2/8) |
| 21 | 0(0/0) |
| 22 | 0 (0/0) |
| 23 | 3 (1/2) |
| 24 | 4 (1/3) |
Discrepant calls between MitoChip v2.0 using our custom MFP bioinformatics algorithm are summarized for each of the 15 study samples for which full mtDNA genome sequencing had been previously performed (samples #14, #17 and #18 were excluded from this comparative analysis due to consistent findings of poor sample quality on multiple analyses). Total number of discrepant calls is detailed, with numbers in parentheses specifying calls missed by MFP-based MitoChip analysis that had been made by standard sequencing, as well as the extra calls made by MFP-based MitoChip analysis that had not been detected by standard sequencing. The specific discordant variants from each sample are catalogued in Additional File 4.
Figure 8Signal intensity pattern of heteroplasmic sites. GSEQ intensity plots demonstrating identification of a known heteroplasmic site (yellow highlight) on both forward (left panel) and reverse (right panel) strands that had been originally demonstrated by Sanger sequencing. The peak at each position corresponds to the signal intensity of one of 4 probes (A,C,G,T) in a probe set.
Figure 9Venn diagram comparison of heteroplasmy calls made by MFP and GSEQ 4.1.
Heteroplasmy detection levels made using the MFP analysis algorithm.
| 2 | A3243G | 84% | 61% |
| 6 | C5049T | N.D. | 34% |
| 8 | C12264T | 30% | 43% |
| 9 | 5537 | 5537_5538 insT (100%) | 58% |
| 5538 | 45% | ||
| 17 | G9966A | 20% | Sample excluded |
| 18 | A1709G | 25% | Sample excluded |
| 21 | C12879T | 45% | 55% |
Sample #2, #6, #8, and #21 had heteroplasmic variants detected both by Sanger sequencing and MFP analysis of MitoChip v2.0 data. Of interest, the 5537_5538insT in tRNA-TRP identified by Sanger sequencing in sample #9 was interpreted in MFP as a heteroplasmic mutation at both positions 5537 and 5538. Although a heteroplasmic A1709G site was detected by Sanger sequencing in sample #18, this sample was excluded from MFP analysis due to poor sample quality. "N.D." = not determined.