| Literature DB >> 23741343 |
Julien Lajugie1, Rituparna Mukhopadhyay, Michael Schizas, Nathalie Lailler, Nicolas Fourel, Eric E Bouhassira.
Abstract
Phased genome maps are important to understand genetic and epigenetic regulation and disease mechanisms, particularly parental imprinting defects. Phasing is also critical to assess the functional consequences of genetic variants, and to allow precise definition of haplotype blocks which is useful to understand gene-flow and genotype-phenotype association at the population level. Transmission phasing by analysis of a family quartet allows the phasing of 95% of all variants as the uniformly heterozygous positions cannot be phased. Here, we report a phasing method based on a combination of transmission analysis, physical phasing by pair-end sequencing of libraries of staggered sizes and population-based analysis. Sequencing of a healthy Caucasians quartet at 120x coverage and combination of physical and transmission phasing yielded the phased genotypes of about 99.8% of the SNPs, indels and structural variants present in the quartet, a phasing rate significantly higher than what can be achieved using any single phasing method. A false positive SNP error rate below 10*E-7 per genome and per base was obtained using a combination of filters. We provide a complete list of SNPs, indels and structural variants, an analysis of haplotype block sizes, and an analysis of the false positive and negative variant calling error rates. Improved genome phasing and family sequencing will increase the power of genome-wide sequencing as a clinical diagnosis tool and has myriad basic science applications.Entities:
Mesh:
Year: 2013 PMID: 23741343 PMCID: PMC3669306 DOI: 10.1371/journal.pone.0064571
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Sequencing strategy.
Libraries of three different sizes were prepared from each member of the quartet from family FNY01 and sequenced on an Illumina HiSeq 2000. Reads were then aligned using BWA and Novoalign and variants were called and filtrated. Then, MIEs were detected. After that, variants were phased by transmission and errors were called. Phasing was then refined by physical and population-based approaches. Finally, phasing from all approaches was merged and recombination blocks and error analysis were refined. Positions called as MIE, SCE and uncalled or partially called positions were imputed by Beagle.
Figure 2Phasing Strategy.
(A) Trio Phasing: the quartet was divided into 2 trios and all phasable positions (i.e. positions with at least one homozygous member) were phased. The children genotypes were arbitrarily ordered as follow: Maternal Allele | Paternal Allele. The parent genotypes were arbitrarily ordered as follow: Transmitted | Not Transmitted Allele. For each phasable position, trios were phased in three steps: First, we marked all heterozygous genotypes as phased. Then, we use the heterozygous variant to phase a parent-child pair. Finally, we phased the second parent-child pair. (B) Blocks of Inheritance: X-Y scatter plots were created for each parent by assigning a value of +1 to each phased variant where the children inherited the same parental chromosome and a value of -1 to each phased variant where the children inherited different chromosomes. These graphs were used to define the preliminary blocks of inheritance (see Figure S2 for more details).
Phasing rate.
| Sample | Phased # | Phased % | Unphased # | Unphased % | Het. Phased # | Het. Phased % | Het. Unphased % | |
|
|
|
|
|
|
|
|
|
|
| FNY01_2_2 | 1,645,173 | 91.43% | 8.57% | |||||
| FNY01_2_5 | 1,568,428 | 91.05% | 8.95% | |||||
| FNY01_3_2 | 1,647,335 | 91.44% | 8.56% | |||||
| FNY01_3_3 | 1,624,837 | 91.33% | 8.67% | |||||
|
|
|
|
|
|
|
|
|
|
| FNY01_2_2 | 2,509,520 | 64.65% | 1,372,351 | 35.35% | 438,386 | 24.21% | 75.79% | |
| FNY01_2_5 | 2,822,590 | 72.71% | 1,059,281 | 27.29% | 674,398 | 38.90% | 61.10% | |
| FNY01_3_2 | 2,860,926 | 73.70% | 1,020,945 | 26.30% | 790,581 | 43.64% | 56.36% | |
| FNY01_3_3 | 2,874,595 | 74.05% | 1,007,276 | 25.95% | 781,847 | 43.70% | 56.30% | |
|
|
|
|
|
|
|
|
|
|
| FNY01_2_2 | 1,711,834 | 95.13% | 4.87% | |||||
| FNY01_2_5 | 1,635,089 | 94.91% | 5.09% | |||||
| FNY01_3_2 | 1,713,996 | 95.14% | 4.86% | |||||
| FNY01_3_3 | 1,691,498 | 95.08% | 4.92% |
Statistics are based on phasing of the stringently filtered list of 3, 900,000 SNPs (see text).
Combined: transmission and read-backed phasing methods combined.
Phased # and %: Number and percentage of SNPs phased.
Unphased # and %: Number and percentage of SNPs that could not be phased. All unphased positions are positions where the four members of the quartet are heterozygous.
Het. Phased # and %: Number and percentage of heterozygous SNPs that could be phased.
Het. Unphased %: Percentage of heterozygous SNPs that could not be phased.
Error analysis, SNPs.
| SNP # | MIE # | MIE % | SCE # | SCE % | GDE-corrected | False positive | False negative | Ti/Tv | Ti/Tv Known | Ti/Tv Novel | DbSNP % | |
| GATK Calls | 5,163,231 | 153,552 | 2.91% | 227,866 | 4.32% | 4,781,813 | 2.66% | not applicable; | 2.01 | 2.07 | 1.16 | 95.55% |
| Read Depth Filtration | 4,958,825 | 139,757 | 2.76% | 152,586 | 3.01% | 4,666,482 | 2.07% | 2.47% | 2.03 | 2.08 | 1.15 | 95.91% |
| Recalibration (99 = PASS) | 4,269,078 | 35,465 | 0.83% | 25,935 | 0.36% | 4,207,678 | 0.48% | 13.64% | 2.12 | 2.13 | 2.02 | 98.14% |
| PL Filtration >20 | 3,892,936 | 5,318 | 0.14% | 5,408 | 0.07% | 3,882,210 | 0.09% | 23.17% | 2.12 | 2.13 | 2.01 | 98.13% |
| Hemizygous Correction | 3,892,936 | 4,713 | 0.12% | 5,408 | 0.14% | 3,882,815 | 0.09% | 23.15% | 2.12 | 2.13 | 2.01 | 98.13% |
SNP # = Number of SNP called (total # of SNPs – # of partially called SNPs).
GDE corrected = Number of SNPs that are neither MIE nor SCE.
#of False positive: 1/3 of GDEs (MIEs+SCEs); FALSE positive % = 100*0.33*(MIE+SCE)/GDE-corrected SNP.
False negative %: 100*[4,781,813– [GDEs - corrected SNPs] (after filter)]/[GDEs - corrected SNPs] (after filter)]].
Ti/Tv known = transition to transversion ratio of SNPs found in dbSNP.
Ti/Tv novel = transition to transversion ratio of SNPs not found in dbSNP.
Heterozygous phasing and MIE, SCE and uncalled SNPs imputation with Beagle.
| GATK (filter: RDF) | GATK (Filter: RDF, PASS, PL>20 ) | |
|
| ||
| Total # of phased SNP1 | 4,666,482 | 3,882,210 |
| SNP’s in HapMap | 4,490,885 | 3,738,248 |
| Number of unphased SNPs (quad. Het.)2 | 134,957 | 88,280 |
| Number of unphased SNPs (quad. Het.) in HapMap | 106,050 | 86,860 |
| Phasing rate (transmission plus physical phasing) | 93.1% | 97.80% |
| Phased by Beagle3 | 101,687 | 83,409 |
| Phasing rate (transmission plus physical phasing plus beagle)4 | 96.3% | 99.8% |
|
| ||
| Number of MIEs | 139,757 | 5,847 |
| Number of MIEs in HapMap | 61,531 | 4,688 |
| Number of MIEs imputed by Beagle | 54,887 | 4,433 |
| Number of SCEs | 152,586 | 5,408 |
| Number of SCEs in HapMap | 39,775 | 3,941 |
| Number SCE imputed by Beagle | 34,934 | 3,661 |
| Number of uncalled SNPs | 104,500 | 14,416 |
| Number of uncalled SNPs in HapMap | 36,392 | 10,990 |
| Number of uncalled SNP imputed by Beagle | 31,772 | 9,853 |
| Total # of Imputed positions | 121,593 | 17,947 |
1: GDE-corrected SNP number minus number of variants filtered with RDF (See Table 2).
2: All unphased SNPs are quadruple heterozygous SNPs.
3: inconsistencies between trios were not included. Position that were MIE, SCE or uncalled after GATK analysis and that were in the HapMap are not included,
4: SNPs were first phased by transmission, then by physical phasing and then using Beagle.
Figure 3Haplotypes.
Graphs illustrating haplotype structure in the quartet. A 550 Kb region of chromosome 5 is shown (chr 5∶97,464,122-98,064,122. Homozygous SNPs (red), heterozygous SNPs on chromosome A (blue) or heterozygous SNPs on chromosome B (green) were plotted and haplotypes were called using the island finder function of GenPlay. Regions in which islands were found only in the 1|1 track were named 1|1 blocks, regions in which islands were called only in either the 0|1 or 1|0 tracks were named 0|1 or 1|0 blocks. Regions where islands were found in either 1|1 and 0|1 or 1|1 and 1|0 or 0|1 and 1|0 or in all three tracks at the same time were named mixed-blocks. Regions in which no islands were found were named SNP-poor blocks.
Haplotype structure.
| 0|1 Blocks | 1|0 Blocks | 1|1 Blocks | Mixed Blocks | SNP Poor Blocks | Genome Wide | |
| Block # | 8,027 | 8,800 | 10,723 | 10,186 | 13,930 | 51,666 |
| SNP # | 512,526 | 573,193 | 819,557 | 979,448 | 263,673 | 3,148,397 |
| Total Length (bp) | 379,895,027 | 427,998,919 | 629,719,719 | 577,878,545 | 1,133,640,956 | 3,149,133,166 |
| Average Length (bp) | 47,333 | 48,642 | 58,726 | 56,733 | 81,387 | 58,564 |
| % Genome in Block | 12.06% | 13.59% | 20.00% | 18.35% | 36.00% | 100.00% |
| SNP Density (SNP/Kb) | 1.39 | 1.38 | 1.34 | 2.15 | 0.36 | 1 |
0|1 Blocks = Blocks rich in heterozygous SNPs on the second allele.
1|0 Blocks = Blocks rich in heterozygous SNPs on the first allele.
1|1 Blocks = Blocks rich in homozygous SNPs.
Mixed Blocks = Blocks rich in heterozygous on either allele and rich in homozygous SNPs.
SNP Poor Blocks = Regions poor in SNPs.
Error analysis, indels.
| Indel # | MIE # | MIE % | SCE # | SCE % | GDE-corrected | False positives | False negatives | Ins/Del | Ins/Del Known | Ins/Del Novel | Mills&1000G % | |
| GATK Calls | 714,192 | 31,858 | 4.46% | 91,431 | 12.80% | 590,903 | 6.88% | not applicable | 0.94 | 0.86 | 1.05 | 56.31% |
| Read Depth Filtration | 707,637 | 31,713 | 4.48% | 89,079 | 12.50% | 586,845 | 6.79% | 0.69% | 0.94 | 0.86 | 1.05 | 56.56% |
| Recalibration (99 = PASS) | 668,869 | 29,068 | 4.35% | 77,994 | 11.66% | 561,807 | 6.28% | 5.18% | 0.93 | 0.86 | 1.06 | 58.66% |
| PL Filtration >20 | 448,066 | 3,016 | 0.67% | 20,254 | 4.52% | 424,796 | 1.80% | 39.10% | 0.89 | 0.84 | 1.01 | 65.64% |
| Hemizygous Correction | 448,066 | 2,965 | 0.66% | 20,253 | 4.52% | 424,848 | 1.80% | 39.09% | 0.89 | 0.84 | 1.01 | 65.64% |
Indel # = Number of indels called.
GDE corrected = Number of indels that are neither MIE nor SCE.
False positive = 1/3 of GDE (MIE #+SCE #).
False negative %: 100*[590,903– [GDEs - corrected] (after filter)]/[GDEs - corrected] (after filter)]].
Ins/Del = ratio of insertion to deletion.
Ins/Del Known = ratio of insertion to deletion in indels found in Mills and 1000G.
Ins/Del Novel = ratio of insertion to deletion in indels not found in Mills and 1000G.
Structural variant numbers.
| Deletion # | Duplication # | |
| Total | 1,459 | 72 |
| SVs >100 Kb | 12 | 15 |
| SVs between 10 Kb–100 Kb | 102 | 25 |
| SVs between 1 Kb–10 Kb | 654 | 18 |
| SVs between 100 bp–1 Kb | 803 | 14 |
| Known SVs | 1,252 | 60 |
| Novel SVs | 319 | 17 |
| SVs that do not affect a gene | 879 | 21 |
| SVs that affect a gene | 580 | 51 |
| SVs that affect at least one protein coding gene | 546 | 38 |
| SV that affect at least 1 exon in protein coding gene | 125 | 38 |
Variant Genotypes.
| FNY01_2_2 | FNY01_2_5 | FNY01_3_2 | FNY01_3_3 | Quartet (Avg. ± Stdv) | 2_2 Relative to 2_5 | ||
| SNPs | Homozygous | 1,120,406 | 1,129,224 | 1,110,198 | 1,119,615 | 1,119,861±6,734 | 328,222 |
| Heterozygous | 1,799,927 | 1,723,224 | 1,802,157 | 1,779,490 | 1,776,200±31,836 | 2,313,081 | |
| Deletions (Indel) | Homozygous | 59,068 | 58,980 | 58,046 | 58,329 | 58,606±431 | 19,001 |
| Heterozygous | 109,928 | 106,840 | 114,241 | 112,672 | 110,920±2,816 | 131,983 | |
| Deletions (SV) | Homozygous | 445 | 428 | 499 | 549 | 480±47 | 151 |
| Heterozygous | 926 | 920 | 960 | 928 | 934±15 | 1,183 | |
| Insertions (Indels) | Homozygous | 63,670 | 63,121 | 61,885 | 62,489 | 62,791±670 | 16,607 |
| Heterozygous | 93,239 | 90,284 | 97,666 | 96,504 | 94,423±2,889 | 107,849 | |
| Insertions (SV) | Homozygous | 15 | 18 | 14 | 11 | 15±3 | 7 |
| Heterozygous | 36 | 32 | 38 | 39 | 36±3 | 60 |
2_2 Relative to 2_5 = Number of SNPs found in FNY01_2_2 when using FNY01_2_5 as reference.
Variant effects.
| FNY01_ 2_2 | FNY01_2_5 | FNY01_3_2 | FNY01_3_3 | Quartet average | |||
| SNPs | High impact | All genotypes | 641 | 601 | 627 | 633 | 625.5±15.0 |
| Moderate impact | All genotypes | 23,948 | 23,325 | 24,082 | 24,091 | 23861.5±314.9 | |
| Modifier | All genotypes | 31,474 | 31,193 | 31,852 | 32,022 | 31635.25±323.23 | |
| High impact (homozygous) | Homozygotes | 113 | 113 | 110 | 116 | 113±2.1 | |
| Double-heterozygotes | 3 | 1 | 1 | 1 | 1.5±0.9 | ||
| Double heterozygotes in trans | 2 | 0 | 0 | 0 | 0.5±0.9 | ||
| Double heterozygotes in cis | 1 | 1 | 1 | 1 | 1±0.0 | ||
| Phasing undetermined | 0 | 0 | 0 | 0 | 0 | ||
| High impact+Moderate | Homozygotes | 2266 | 2222 | 2231 | 2261 | 2245±18.9 | |
| Double-heterozygotes | 388 | 365 | 440 | 424 | 404.25±29.5 | ||
| Double heterozygotes in trans | 107 | 88 | 121 | 113 | 107.25±12.2 | ||
| Double heterozygotes in cis | 281 | 277 | 319 | 311 | 297±18.3 | ||
| Phasing undetermined | 55 | 64 | 62 | 51 | 58±5.2 | ||
| Percent in trans | 27.58% | 24.11% | 27.50% | 26.65% | 26.53% | ||
| Percent in cis | 72.42% | 75.89% | 72.50% | 73.35% | 73.47% | ||
| Indels | High impact | All genotypes | 444 | 417 | 482 | 422 | 441.25±25.6 |
| Moderate impact | All genotypes | 288 | 304 | 338 | 319 | 312.25±18.5 | |
| Double (moderate/high impact) | 38 | 38 | 40 | 37 | 38.25±1.1 | ||
| High Impact | Homozygotes | 97 | 96 | 99 | 95 | 96.75±1.5 | |
| Double-heterozygotes | 1 | 2 | 1 | 1 | 1.25±0.4 | ||
| Double heterozygotes in trans | 0 | 0 | 0 | 1 | 0.25±0.4 | ||
| Double heterozygotes in cis | 0 | 1 | 0 | 0 | 0.25±0.4 | ||
| Phasing undetermined | 1 | 1 | 1 | 1 | 1 | ||
| Percent in trans | 0.00% | 0.00% | 0.00% | 100.00% | 50% ±0.4 | ||
| Percent in cis | 0.00% | 50.00% | 0.00% | 0.00% | 50% ±0.2 |
High-impact variants were defined as variants that (1) formed or deleted a stop codon, (2) that altered a splice donor or acceptor site, (3) that caused a frameshift, or (4) that deleted one or more exon. Moderate-impact variants were defined as variants that altered a single amino-acid (non-synonymous substitution, codon insertion or deletion) or that deleted at least part of an exon.