| Literature DB >> 22785314 |
Brock A Peters1, Bahram G Kermani, Andrew B Sparks, Oleg Alferov, Peter Hong, Andrei Alexeev, Yuan Jiang, Fredrik Dahl, Y Tom Tang, Juergen Haas, Kimberly Robasky, Alexander Wait Zaranek, Je-Hyuk Lee, Madeleine Price Ball, Joseph E Peterson, Helena Perazich, George Yeung, Jia Liu, Linsu Chen, Michael I Kennemer, Kaliprasad Pothuraju, Karel Konvicka, Mike Tsoupko-Sitnikov, Krishna P Pant, Jessica C Ebert, Geoffrey B Nilsen, Jonathan Baccash, Aaron L Halpern, George M Church, Radoje Drmanac.
Abstract
Recent advances in whole-genome sequencing have brought the vision of personal genomics and genomic medicine closer to reality. However, current methods lack clinical accuracy and the ability to describe the context (haplotypes) in which genome variants co-occur in a cost-effective manner. Here we describe a low-cost DNA sequencing and haplotyping process, long fragment read (LFR) technology, which is similar to sequencing long single DNA molecules without cloning or separation of metaphase chromosomes. In this study, ten LFR libraries were made using only ∼100 picograms of human DNA per sample. Up to 97% of the heterozygous single nucleotide variants were assembled into long haplotype contigs. Removal of false positive single nucleotide variants not phased by multiple LFR haplotypes resulted in a final genome error rate of 1 in 10 megabases. Cost-effective and accurate genome sequencing and haplotyping from 10-20 human cells, as demonstrated here, will enable comprehensive genetic studies and diverse clinical applications.Entities:
Mesh:
Year: 2012 PMID: 22785314 PMCID: PMC3397394 DOI: 10.1038/nature11236
Source DB: PubMed Journal: Nature ISSN: 0028-0836 Impact factor: 49.962
Figure 1Overview of the Long Fragment Read (LFR) technology and controlled random enzymatic (CoRE) fragmenting
(a) 100-130 pg of high molecular weight DNA is physically separated into 384 distinct wells, (b) through several steps, all within the same well without intervening purifications, the genomic DNA is amplified, fragmented, and ligated to unique barcode adapters, (c) all 384 wells are combined, purified, and introduced into Complete Genomics’ sequencing platform[10], (d) mate-paired reads are mapped to the genome using a custom alignment program and barcode sequences are used to group tags into haplotype contigs, (e) the final result is a diploid genome sequence.
Comparison of haplotyping performance between different genome assemblies
Variant calls for standard and LFR assembled libraries were combined and used as loci for phasing except where specified. Two samples were run with Complete Genomics’ pipeline 2.0 algorithms which are expected to call more heterozygous SNPs, the remaining samples were analysed with previous versions (1.7-1.8) of Complete Genomics’ algorithms. The LFR phasing rate was based on a calculation of parental phased heterozygous SNPs (Supplementary Table 4).
| Sample | Ethnicity | Number of | LFR | Haploid | Fragment Size for | Fragment Size for | DNA Bases | N50 Contig |
|---|---|---|---|---|---|---|---|---|
| NA19240-Replicate 1 | Yoruban | 2,386,741 | 91% | 38 (9.4) | 64 | 84 | 237+176 | 1,210 |
| NA19240-Replicate 2 | Yoruban | 2,433,621 | 91% | 51 (12.7) | 66 | 96 | 313+176 | 1,010 |
| NA19240-10 cell pipeline 2.0 | Yoruban | 2,369,433 | 89% | 54.3 (13.6) | 80 | 120 | 308+176 | 943 |
| NA19240-Replicate 1 High Coverage | Yoruban | 2,578,903 | 96% | 48 (11.9) | 82 | 116 | 509+176 | 1,429 |
| NA19240-Replicates 1&2 combined | Yoruban | 2,646,352 | 97% | 89 (22.1) | 65 | 90 | 550+176 | 1,577 |
| NA19240-Replicate 1 LFR only pipeline 2.0 | Yoruban | 2,031,514 | 91% | 38 (9.4) | 64 | 84 | 237 | 1,036 |
| NA19240-Replicate 1 High Coverage LFR only | Yoruban | 2,274,696 | 95% | 48 (11.9) | 82 | 116 | 509 | 1,282 |
| NA12877-Replicate 1 | European | 1,831,032 | 93% | 65 (16.3) | 74 | 104 | 258+218 | 530 |
| NA12877-Replicate 2 | European | 1,810,540 | 92% | 51 (12.7) | 76 | 106 | 238+218 | 535 |
| NA12877-Replicates 1&2 combined | European | 1,946,089 | 97% | 116 (29) | 75 | 105 | 496+218 | 600 |
| NA12885 | European | 1,850,409 | 92% | 46 (11.6) | 72 | 98 | 272+221 | 528 |
| NA12886 | European | 1,854,360 | 93% | 44 (11) | 66 | 88 | 293+216 | 535 |
| NA12891 | European | 1,825,427 | 90% | 46 (11.6) | 80 | 112 | 280+246 | 545 |
| NA12892 | European | 1,917,442 | 93% | 93 (23.3) | 94 | 138 | 285+213 | 553 |
| NA12892 LFR only | European | 1,720,750 | 97% | 93 (23.3) | 94 | 138 | 285 | 525 |
| NA20431 High Coverage | European | 1,703,047 | 84% | 30 (7.4) | 94 | 142 | 514+189 | 411 |
For those individuals without parental genome data (NA12891, NA12892, and NA20431) the phasing rate was calculated by dividing the number of phased heterozygous SNPs by the number of heterozygous SNPs expected to be real (number of attempted to be phased SNPs – 50,000 expected errors). N50 calculations are based on the total assembled length of all contigs to the NCBI build 36 (build 37 in the case of NA19240 10 cell and high coverage and NA20431 high coverage) human reference genome. Haploid fragment coverage is four times greater than the number of cells as a result of all DNA being denatured to single stranded prior to being dispersed across a 384 well plate. The insufficient amount of starting DNA explains lower phasing efficiency in the NA20431 genome.
The 10 cell sample was measured by individual well coverage to contain more than 10 cells, this is likely the result of these cells being in various stages of the cell cycle during collection.
Figure 2LFR haplotyping algorithm
(a) Variation extraction: Variations are extracted from the aliquot tagged reads. The 10-base Reed-Solomon codes enable tag recovery via error correction. (b) Heterozygous SNP-pair connectivity evaluation: The matrix of shared aliquots is computed for each heterozygous SNP-pair within a certain neighbourhood. Loop1 is over all the heterozygous SNPs on one chromosome. Loop2 is over all the heterozygous SNPs on the chromosome which are in the neighbourhood of the heterozygous SNPs in Loop1. This neighbourhood is constrained by the expected number of heterozygous SNPs and the expected fragment lengths. (c) Graph generation: An undirected graph is made, with nodes corresponding to the heterozygous SNPs and the connections corresponding to the orientation and the strength of the best hypothesis for the relationship between those SNPs. The orientation is binary and is shown in the figure with a colour. Red and green depict a flipped and unflipped relationship between heterozygous SNP pairs, respectively. The strength is defined by employing fuzzy logic operations on the elements of the shared aliquot matrix. (d) Graph optimization: The graph is optimized via a minimum spanning tree operation. (e) Contig generation: Each sub-tree is reduced to a contig by keeping the first heterozygous SNP unchanged, and flipping or not flipping the other heterozygous SNPs on the sub-tree, based on their paths to the first heterozygous SNP. The designation of Parent 1 (P1) and Parent 2 (P2) to each contig is arbitrary. The gaps in the chromosome-wide tree define the boundaries for different sub-trees/contigs on that chromosome. (f) Optional mapping of LFR contigs to parental chromosomes: Using parental information, a Mom or Dad label is placed on the P1 and P2 haplotypes of each contig.
Number of genes with multiple detrimental variations in each analysed sample
All phased SNPs were analysed by PolyPhen2[34] and a custom splice site detection algorithm (Supplementary Methods) to find variants with a high probability of coding for non-functional proteins. Only variants that were contained within the same contig for each gene were examined. Because LFR contigs are very long (N50>500 kb) very few variants were excluded based on this criteria. In each gene 5 kb of the regulatory region upstream of the transcription start site and 1 kb downstream were scanned for SNVs that significantly altered over 300 transcription factor binding sites (TFBS)[38,39]. These potentially detrimental variations in TFBSs were also phased with coding SNPs to create a more comprehensive list of genes whose function and/or expression might be altered in these individuals (Supplementary Methods).
| Sample | Ethnicity | Coding Only | Coding and TFBS | ||
|---|---|---|---|---|---|
| Both Alleles | One Allele | Both Alleles | One Allele | ||
| NA19240 Replicate 1 | Yoruban | 47 | 79 | 182 | 162 |
| NA19240 Replicate 2 | Yoruban | 55 | 85 | 207 | 174 |
| NA19240-10 cell pipeline 2.0 | Yoruban | 62 | 86 | 197 | 156 |
| NA19240-Replicate 1 High Coverage | Yoruban | 65 | 95 | 235 | 185 |
| NA19240-Replicates 1&2 combined | Yoruban | 65 | 99 | 241 | 197 |
| NA12877 Replicate 1 | European | 45 | 78 | 144 | 144 |
| NA12877 Replicate 2 | European | 44 | 82 | 146 | 141 |
| NA12877-Replicates 1&2 combined | European | 49 | 96 | 167 | 168 |
| NA12885 | European | 34 | 79 | 143 | 141 |
| NA12886 | European | 32 | 101 | 140 | 168 |
| NA12891 | European | 36 | 69 | 130 | 140 |
| NA12892 | European | 37 | 65 | 125 | 136 |
| NA20431 High Coverage | European | 36 | 70 | 115 | 127 |