| Literature DB >> 20838461 |
Kimberly Pelak1, Kevin V Shianna, Dongliang Ge, Jessica M Maia, Mingfu Zhu, Jason P Smith, Elizabeth T Cirulli, Jacques Fellay, Samuel P Dickson, Curtis E Gumbs, Erin L Heinzen, Anna C Need, Elizabeth K Ruzzo, Abanish Singh, C Ryan Campbell, Linda K Hong, Katharina A Lornsen, Alexander M McKenzie, Nara L M Sobreira, Julie E Hoover-Fong, Joshua D Milner, Ruth Ottman, Barton F Haynes, James J Goedert, David B Goldstein.
Abstract
We present the analysis of twenty human genomes to evaluate the prospects for identifying rare functional variants that contribute to a phenotype of interest. We sequenced at high coverage ten "case" genomes from individuals with severe hemophilia A and ten "control" genomes. We summarize the number of genetic variants emerging from a study of this magnitude, and provide a proof of concept for the identification of rare and highly-penetrant functional variants by confirming that the cause of hemophilia A is easily recognizable in this data set. We also show that the number of novel single nucleotide variants (SNVs) discovered per genome seems to stabilize at about 144,000 new variants per genome, after the first 15 individuals have been sequenced. Finally, we find that, on average, each genome carries 165 homozygous protein-truncating or stop loss variants in genes representing a diverse set of pathways.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20838461 PMCID: PMC2936541 DOI: 10.1371/journal.pgen.1001111
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Summary of genomic and exonic coverage in the twenty sequenced genomes.
| Individual ID | Covered Genomic Bases (autosome) | % Covered Genomic Bases (autosome) | Genomic Coverage (autosome) | Covered Exonic Bases (autosome) | % Covered Exonic Bases (autosome) | Exonic Coverage (autosome) |
| hemo0001 | 2,670,818,496 | 99.61% | 30.4× | 63,569,350 | 97.10% | 26.6× |
| hemo0004 | 2,546,304,227 | 94.97% | 23.0× | 63,738,611 | 97.35% | 26.2× |
| hemo0005 | 2,575,871,501 | 96.07% | 36.2× | 64,204,173 | 98.06% | 48.4× |
| hemo0006 | 2,626,377,905 | 97.95% | 51.0× | 64,757,369 | 98.91% | 67.7× |
| hemo0007 | 2,545,912,138 | 94.95% | 34.2× | 63,830,877 | 97.49% | 44.6× |
| hemo0011 | 2,479,945,673 | 92.49% | 31.6× | 63,061,145 | 96.32% | 46.3× |
| hemo0017 | 2,636,817,553 | 98.34% | 33.4× | 64,562,534 | 98.61% | 38.6× |
| hemo0019 | 2,523,147,259 | 94.10% | 20.2× | 62,931,312 | 96.12% | 22.7× |
| hemo0020 | 2,616,985,451 | 97.60% | 36.4× | 64,558,006 | 98.61% | 45.2× |
| hemo0022 | 2,589,531,152 | 96.58% | 38.7× | 64,325,506 | 98.25% | 49.0× |
| Control 1 | 2,672,018,880 | 99.65% | 32.3× | 64,680,869 | 98.79% | 32.5× |
| Control 2 | 2,669,408,111 | 99.56% | 28.0× | 64,622,306 | 98.70% | 28.4× |
| Control 3 | 2,636,454,707 | 98.33% | 23.6× | 62,016,800 | 94.72% | 21.7× |
| Control 4 | 2,665,796,091 | 99.42% | 30.5× | 64,770,818 | 98.93% | 32.9× |
| Control 5 | 2,626,655,211 | 97.96% | 27.4× | 63,742,412 | 97.36% | 27.9× |
| Control 6 | 2,599,845,577 | 96.96% | 24.9× | 63,583,260 | 97.12% | 26.0× |
| Control 7 | 2,637,966,004 | 98.38% | 23.3× | 62,924,185 | 96.11% | 21.6× |
| Control 8 | 2,621,651,349 | 97.78% | 26.9× | 64,136,845 | 97.96% | 27.0× |
| Control 9 | 2,672,035,152 | 99.65% | 39.0× | 63,992,418 | 97.74% | 35.3× |
| Control 10 | 2,642,657,667 | 98.56% | 31.4× | 63,867,622 | 97.55% | 31.8× |
|
|
|
|
|
|
|
|
Coverage was defined as the percentage of bases in the genome/exome that have at least 5 reads with a Phred-like consensus score of greater than zero at that position. The total size for the autosomal genome is 2,681,301,098 bp, which is the total reference length of the autosomes (NCBI human genome assembly build 36) minus the reference sequence gaps (‘N’ calls in reference sequence). The total size for the autosomal exons is 65,471,109 bp, which is the total length of the autosomal exons, defined as all protein coding gene entries in Ensembl core database version 50 [13]. The Ensembl database version 50 is based on the NCBI human genome assembly build 36 as well as its annotations (GeneBank).
Figure 1Average per-genome overlap between SNVs in genomic databases and SNVs identified by whole-genome sequencing.
On average, 3,473,639 SNVs were observed in each genome (Table S2). A per-genome average of 87.28% of these SNVs were present in the dbSNP database (version 129, validated) (Table S3).
Figure 2Concordance between sequencing and genotyping calls.
The sequenced samples were also run on either the Illumina Human 1M-Duo v3 BeadChip or the Illumina 610-Quad BeadChip. The concordance rate between the sequencing and the Illumina BeadChip genotype calls is plotted against sequencing coverage of the autosomes. A data point is plotted for each of the twenty genomes.
Figure 3Coding indel length distribution.
Shown is a side-by-side comparison of the length of the coding indels in this study as compared to a previous publication [26]. (A) Indel lengths observed in J.C. Venter's exome [26] versus (B) indel lengths observed in this study. The data from our study have been restricted to the canonical genes or transcripts that are captured by the Agilent SureSelect Targeted Enrichment system. Indels that are a multiple of 3bp in length are marked in green.
Prioritization of protein-truncating or stop loss variants enriched in hemophilia samples.
| Rank | Gene | # controls het (homo, but with low coverage) | SNV Count | Indel Count | Total Count | Comment |
| 1 |
| 0 | 1 | 4 | 5 | Visual inspection of the alignment shows a sixth sample that also has a deletion in |
| 2 |
| 2 | 0 | 5 | 5 | |
| 3 |
| 4 | 0 | 4 | 4 | |
| 4 |
| 2 (+1) | 4 | 0 | 4 | |
| 5 |
| 2 (+1) & 0 | 0 | 3 | 3 | 2 variants. Both occur in the same cases, with the same zygosity |
Only canonical genes (with a defined HUGO Gene Nomenclature Committee (HGNC) database entry [26]) were included.
See Table S14 for a full list of genes.
Figure 4Rank of the F8 gene as the number of control genomes increases.
The gene ranking was ordered by the number of case genomes that carried protein-truncating or stop loss variants, in homozygous form or on the X-chromosome, that were not present in control genomes in homozygous form. Ranking was performed with a “gene prioritization” function implemented in the SVA software tool [21] (Text S1). Protein-truncating variants were defined as SNVs that cause a premature stop codon, and insertions or deletions that cause a frameshift coding change. The ranks represent an average taken from five permutations. When comparing 10 hemophilia cases to just one control, F8 ranks in the top 40 genes. Once 5 or more controls are available, it ranks in the top 5 genes.
Figure 5Number of novel SNVs and novel knocked-out genes as the number of genomes increases.
The total number of novel variants, and the total number of novel genes containing protein truncating or stop loss variants, continues to drop as additional genomes are added to the analysis. Shown are the number of unique SNVs (A) and unique genes carrying a homozygous protein-truncating or stop loss variant (B) per genome, as a function of the number of genomes already considered. The genomes were added in a random order to both analyses, and 1000 permutations were performed and averaged.