| Literature DB >> 22649055 |
Mingkun Li1, Roland Schroeder, Albert Ko, Mark Stoneking.
Abstract
Enriching target sequences in sequencing libraries via capture hybridization to bait/probes is an efficient means of leveraging the capabilities of next-generation sequencing for obtaining sequence data from target regions of interest. However, homologous sequences from non-target regions may also be enriched by such methods. Here we investigate the fidelity of capture enrichment for complete mitochondrial DNA (mtDNA) genome sequencing by analyzing sequence data for nuclear copies of mtDNA (NUMTs). Using capture-enriched sequencing data from a mitochondria-free cell line and the parental cell line, and from samples previously sequenced from long-range PCR products, we demonstrate that NUMT alleles are indeed present in capture-enriched sequence data, but at low enough levels to not influence calling the authentic mtDNA genome sequence. However, distinguishing NUMT alleles from true low-level mutations (e.g. heteroplasmy) is more challenging. We develop here a computational method to distinguish NUMT alleles from heteroplasmies, using sequence data from artificial mixtures to optimize the method.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22649055 PMCID: PMC3467033 DOI: 10.1093/nar/gks499
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Mapping results for the mt-free cell line and the parental cell line
| Cell line | Number of reads | percent mapped reads (HG19) | percent mtDNA reads (HG19) | percent mtDNA reads (MT) |
|---|---|---|---|---|
| RHO1 | 459 888 | 87.72 | 0.007 (34) | 0.15 (692) |
| RHO2 | 219 470 | 74.12 | 0.017 (38) | 0.28 (620) |
| WT1 | 928 428 | 93.31 | 41.3 | 45.7 |
| WT2 | 1071 782 | 85.04 | 41.5 | 50.8 |
aBoth cell lines were sequenced twice (76-bp paired-end reads with double indexes), from independent sequencing libraries. RHO, mt-free cell line, WT, parental cell line.
bPercentage of reads mapped to the entire genome (nuclear DNA + mtDNA).
cPercentage of reads mapped to mtDNA when using the entire genome (nuclear DNA + mtDNA) as the mapping reference (number of reads in parentheses).
dPercentage of reads mapped to mtDNA when using only the mtDNA genome as the mapping reference (number of reads in parentheses).
Reads mapping to potential NUMTs in sequence data from the mt-free cell line and the parental cell line
| Major mt-free allele = Major parental allele | Major mt-free allele ≠ Major parental allele | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Minor mt-free allele = Minor parental allele | Minor mt-free allele ≠ Minor parental allele | Major mt-free allele = Minor parental allele | Major mt-free allele ≠ Minor parental allele | ||||||||
| Comparison | Ref | POS | Present | Absent | Present | Absent | POS | Present | Absent | Present | Absent |
| RHO1 versus WT1 | MT | 9061 | 38 | 8 | 29 | 14 | 203 | 142 | 11 | 44 | 6 |
| HG19 | 540 | 0 | 0 | 0 | 0 | 7 | 5 | 1 | 1 | 0 | |
| RHO2 versus WT2 | MT | 8025 | 13 | 15 | 2 | 9 | 164 | 120 | 28 | 14 | 2 |
| HG19 | 898 | 1 | 1 | 0 | 0 | 17 | 1 | 11 | 5 | 0 | |
aNumber of positions in the mtDNA genome included in reads from the mt-free cell line.
bPresent means the mt-free allele is present in the NUMTs database and absent means the mt-free allele is not in the NUMTs database.
Minor allele profile in different sequencing libraries (after quality filter) mapped with either the mtDNA genome (MT) or entire genome (HG19) as the mapping reference
| Methods | Ref | Minor allele count | In NUMTs database | Not in NUMTs database | In RHO94 database |
|---|---|---|---|---|---|
| LR-PCR | MT | 250 | 58 | 192 | 7 |
| Capture | MT | 406 | 202 | 204 | 63 |
| Shotgun | MT | 33 | 28 | 5 | 8 |
| LR-PCR | HG19 | 278 | 61 | 217 | 1 |
| Capture | HG19 | 227 | 44 | 183 | 0 |
| Shotgun | HG19 | 4 | 2 | 2 | 0 |
Figure 1.Correlation between read loss and major allele frequency change. Read loss was calculated as the percentage of reads that could be mapped when using the mtDNA genome as the reference (mapping quality score ≥20) but discarded when mapped to the entire genome (mapping quality score <20). Major allele frequency change was calculated as the frequency change of the correct allele (defined as the allele obtained when using mtDNA as the reference). Each dot represents one position in the mtDNA sequence of one of 14 samples (13 samples for the shotgun data). Blue dots indicate positions with consensus alleles that are not included in the NUMTs database; color intensity is proportional to the number of dots. Red dots represent positions with consensus alleles that differ from the reference mtDNA and are the same as a NUMT allele. The circled dots indicate positions whose major alleles changed when mapping to different reference genomes (mtDNA alone versus the entire genome). (A) long-range PCR data; (B) capture-enriched data; (C) shotgun data.
Figure 2.False-negative rates and false discovery rates under different thresholds of MAF, DQS. (A and C) False-negative rate; (B and D) false-positive rate. A and B are results when using mtDNA as the mapping reference; C and D are results when using the entire genome (HG19) as the mapping reference. Empty bins in B and D represent no false positives. Basic thresholds used here are as follows: coverage ≥100; minor allele count ≥3 on each strand; minor allele count (number of distinct reads) ≥3 on each strand; and position is not located in C-stretch or STR regions (303–315, 512–525, 16 181–16 195).
False-negative rates under different mapping strategies and different mixture levels
| Thresholds | Ref. | Mixture level | |||||
|---|---|---|---|---|---|---|---|
| 0.5 | 0.25 | 0.1 | 0.05 | 0.025 | All levels | ||
| MAF ≥0.055 | MT | 0 | 0 | 0.022 | 0.885 | 0.984 | 0.389 |
| MAF ≥0.015, DQS ≥4 | MT | 0 | 0 | 0 | 0.005 | 0.322 | 0.068 |
| MAF ≥0.02, DQS ≥10 | MT | 0 | 0 | 0 | 0.033 | 0.639 | 0.138 |
| MAF ≥0.02 | HG19 | 0.066 | 0.066 | 0.066 | 0.082 | 0.508 | 0.159 |
aAll thresholds result in no false positives.
DQS: DREEP quality score.