| Literature DB >> 24375697 |
Steven L Salzberg1, Mihaela Pertea, Jill A Fahrner, Nara Sobreira.
Abstract
DNA sequencing has become a powerful method to discover the genetic basis of disease. Standard, widely used protocols for analysis usually begin by comparing each individual to the human reference genome. When applied to a set of related individuals, this approach reveals millions of differences, most of which are shared among the individuals and unrelated to the disease being investigated. We have developed a novel algorithm for variant detection, one that compares DNA sequences directly to one another, without aligning them to the reference genome. When used to find de novo mutations in exome sequences from family trios, or to compare normal and diseased samples from the same individual, the new method, direct alignment for mutation discovery (DIAMUND), produces a dramatically smaller list of candidate mutations than previous methods, without losing sensitivity to detect the true cause of a genetic disease. We demonstrate our results on several example cases, including two family trios in which it correctly found the disease-causing variant while excluding thousands of harmless variants that standard methods had identified.Entities:
Keywords: bioinformatics; computational biology; exome sequencing; sequence alignment; variant detection
Mesh:
Year: 2014 PMID: 24375697 PMCID: PMC4031744 DOI: 10.1002/humu.22503
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Figure 1Outline of initial steps in the Diamund algorithm, which identifies all k-mers unique to an affected proband and missing from both unaffected parents. The first step identifies k-mers, after which the proband data are filtered to remove k-mers resulting from sequencing errors. Intersecting all three sets identifies k-mers that are unique to the proband.
Illustration of the Data Reduction at Each Step from Raw Reads to a Final Set of Mutated Loci
| Data remaining at the end of step | ||||
|---|---|---|---|---|
| Filtering step | Disease/normal pair | Family trio BH1019 | Family trio BH2041 | Family trio BH2688 |
| Number of reads from proband/diseased tissue | 118,414,556 | 84,201,820 | 75,877,750 | 103,527,644 |
| Number of 27-mers in proband/diseased tissue | 911,738,627 | 795,477,167 | 517,272,851 | 1,088,610,020 |
| Number of | 77,903,885 | 61,805,320 | 64,719,150 | 113,066,951 |
| Remove vector sequence | 77,898,848 | 61,800,798 | 64,713,995 | 113,062,417 |
| Eliminate | 17,821,359 | 9,385,347 | 10,730,208 | 50,535,681 |
| Eliminate | 10,568 | 65,352 | 20,130 | 2,006 |
| Identify reads containing | 32,829 reads | 148,496 | 46,454 | 4,404 |
| Remove reads containing vector | 15,260 | 125,648 | 38,799 | 2,760 |
| Number of contigs after assembly | 2,147 | 13,189 | 3,755 | 359 |
| Number of contigs with >3 reads after merging contigs | 279 contigs | 1,437 | 701 | 71 |
| Identify variants covered by reads from normal tissue | 55 contigs | 5 | 6 | 2 |
| Keep variants with >5% coverage | 42 variants | 5 | 6 | 2 |
| Find variants in coding regions | 14 variants | 3 | 3 | 1 |
| Remove synonymous SNPs | 10 variants | 2 | 3 | 1 |
Comparison of the Number of De Novo Mutations Found by DIAMUND and GATK when Comparing Exomes from Family Trios and Exomes from Diseased and Normal Cultured Fibroblasts from the Same Individual
| Method | Disease/normal pair | Family BH1019 | Family BH2041 | Family BH2688 |
|---|---|---|---|---|
| D | 42 | 5 | 6 | 2 |
| 62,962 | 60,173 | 67,034 | 85,226 | |
| GATK: variants in affected individual/diseased tissue but not in unaffected | 1,644 | 7,726 | 5,621 | 953 |