Literature DB >> 20815917

The first Irish genome and ways of improving sequence accuracy.

Young Seok Ju¹, Yun Joo Yoo, Jong-Il Kim, Jeong-Sun Seo.

Abstract

Whole-genome sequencing of an Irish person reveals hundreds of thousands of novel genomic variants. Imputation using previous known information improves the accuracy of low-read-depth sequencing.

Entities: Disease Gene Species

Mesh：

Substances：

Year: 2010 PMID： 20815917 PMCID： PMC2965376 DOI： 10.1186/gb-2010-11-9-132

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Research highlight

In the past 10 years, numerous human genomic variants have been discovered and catalogued, mostly through the efforts of the International HapMap project and personal genome studies [1]. Information on human genomic variants may serve as a valuable resource for developing personalized medicine because some of these variants could potentially predispose humans to complex diseases. The 2009 version (version 130) of the dbSNP database included approximately 13.9 million (13.9 M) single nucleotide polymorphisms (SNPs) and 4.5 M small insertions and deletions (indels). However, many issues need to be addressed before personalized medicine becomes a reality. These include an understanding of: the kind and number of variants that exist in the entire human genome; the number of populations and individuals needed to detect most, if not all, human genomic variants with efficiency and accuracy; the frequency of common and rare variants in an individual genome; and finally the number of variants that influence human diseases. Recent advancements in next-generation sequencing technology have dramatically decreased both the cost and the time required for sequencing [1]. Consequently, in the past few years, sequencing of individual genomes (personal genomes) has gained in popularity. Currently, whole-genome sequencing is thought to be the best way to detect human genomic variations because it could detect novel variants [2]. Excluding cancer genomes, so far at least 15 personal genomes have been sequenced and analyzed using various platforms (Figure 1). In this issue of Genome Biology, Tong and colleagues [3] present data on the first whole-genome sequence of an Irish person using the Illumina Genome Analyzer platform. As the authors [3] suggest, the Irish population could be a good candidate for genomic studies as it is isolated and located in the western fringes of Europe, and thus may possess many polymorphisms unique to this population.

Figure 1

Personal genomes sequenced so far. Each male (yellow) or female (pink) sequenced is shown in the approximate geographical position of their ethnic origin. The name or codename and ethnicity of the individual sequenced, the date of publication, the sequencing platform used, the coverage and the reference are given for each genome. Platforms: 3730xl (Applied Biosystems by Life Technologies, Carlsbad, CA, USA); Complete Genomics Analysis Platform (CGA) Complete Genomics, Mountain View, CA, USA); FLX (Roche, Basel, Switzerland); Genome Analyzer (GA) (Illumina, San Diego, CA, USA); Heliscope (He6licos BioScience, Cambridge, MA, USA); SOLiD (Applied Biosystems by Life Technologies). The authors [3] generated 440 M short reads from the Irish genome and obtained 11X sequencing coverage genome-wide. Despite the lower read-depth compared with other personal genomes (Figure 1), they discovered more than 3 M SNPs. Approximately 13% of these SNPs (0.4 M, approximately 3% of the total number of SNPs catalogued in dbSNP version 130) may be designated as new variants, as they were not previously deposited in the SNP database. They also found more than 20,000 potentially disease-related new SNPs. For example, they have identified a new non-synonymous SNP in the Macrophage-stimulating 1 (MST1) gene, which may have a functional role in inflammatory bowel disease. In addition, the authors detected about 200,000 short indel polymorphisms, half of which have not been reported before. Their results [3] clearly suggest that the human genome still harbors a tremendous number of undetected and often population-specific variants, and they provide justification for more personal genome sequencing studies from worldwide populations.

Sequencing read-depth: improving accuracy by imputation

Despite these interesting results from the Irish genome analysis [3], its low read-depth of sequencing coverage (11X) must be examined in some detail. With the exception of the first two personal genomes sequenced by relatively longer reads, most of the other human whole-genome analyses were carried out using more than 20X sequencing coverage [1]. Low coverage may dramatically reduce the accuracy of genome sequencing because it risks misclassification of heterozygous variants as wild type (missing the variant; this is called undercalling) or misclassification of heterozygous variants as homozygous ones (missing the wild-type allele; overcalling). Consequently, in low-depth sequencing, both the detection of sensitivity and the positive predictive value of genomic variants are compromised [4]. When we look at the data from individual genomes analyzed by high-depth sequencing using the Illumina platform, most of the personal genomes have more than 3.4 M SNPs, approximately 0.3 M more than the number for the Irish (Figure 2a) [4-7]. Recently published personal genomes of European origin also have more than 3.4 M SNPs (NA10851 and Lupski) [8,9]. From this perspective, if the Irish genome was sequenced at higher depth, it could potentially reveal an additional 0.3 M SNPs (Figure 2a). Many of the additional variants might be heterozygous and novel (Figure 2b). Furthermore, higher-depth sequencing would not only increase the total number of genome variants, but would also bring the false discovery rate down to less than 0.1% from the current 1.4%.

Figure 2

The influence of read-depth on discovering personal genomic variants. Sequences included in our analysis are represented by blue squares, with names or codenames as shown in Figure 1. The lines indicate correlations; the YH genome (red square) was not included in the correlation analysis because it is thought to be an outlier, possibly owing to its higher percentage of single-end-read sequencing. (a) Higher read-depth sequencing can reveal more SNPs, as shown by the positive correlation between read-depth and number of SNPs detected. (b) The ratio of heterozygous to homozygous SNPs shows a positive relationship with read-depth of sequencing, suggesting that high-depth sequencing can detect heterozygous SNPs better. We could also consider the relative merits of personal genome sequencing from another perspective. Do we want all personal genome sequencing to exceed 99.9% accuracy? If personal genome data are not used for diagnostic purposes, why should we invest a lot of resources, time and effort in doing additional 10X to 20X sequencing to boost the accuracy from 99% to 99.9%? With limited resources, precise estimation of an individual's genetic variation is in direct conflict with analyzing as many individual genomes as possible to obtain a broader picture of the genomic architecture of a given population. For instance, if one is not interested in understanding the detailed genomic architecture of a specific person, but only in gaining a broader understanding of genomic characteristics of an ethnic group, then it would be more prudent to sequence many individuals with lower depth than a limited number with high depth. One of the attractive features of the study by Tong and colleagues [3] is that they have suggested ways to improve the precision of low-coverage sequencing without investing additional resources. The authors [3] have demonstrated that the accuracy of known SNPs in low-depth sequencing can be dramatically improved by integrating the previously known genotype or haplotype data assembled for European populations by the HapMap and 1000 Genomes projects into low coverage sequencing projects. The authors have shown that over 99% accuracy can be achieved using imputation methods using these other datasets, with only 5X sequencing coverage. What is even more interesting is that just 2X sequencing can provide genotype calls with over 95% accuracy. These tantalizing observations suggest that even low-depth sequencing can be effective with prior detailed information on related genomes. In addition, with accurate genomic data on Irish genomes, the power of imputation methods could be even better, and this would also be the case for other populations. These predictions further emphasize the need for additional personal whole-genome sequencing of a large number of individuals from diverse ethnic groups.

Evidence for positive selection

In the past, many investigators have reported signatures of selection in the human genome. Tong and colleagues [3] have used an interesting approach to study positive selection in the human genome using the Irish genome and the available sequence data on nine personal genomes from previous studies. Despite the small sample size and varied sequencing methods used in previous studies, this attempt can be considered as an initial step toward developing an 'official' whole-genome population genetics study. Thus, this study may give a taste of future insights into population genetics research and of some of the challenges specific to whole human genome data. This study has shown evidence for balancing selection at the sites related to olfactory and taste receptors, mostly confirming the previous results from genome-wide SNP studies. Also, their analysis of ten genomes reveals elevated positive selection in fairly recently duplicated genes. Taken together [3], these results clearly show that whole-genome analysis can shed new light on the field of human evolutionary genetics. Some population statistics based on haplotype patterns would benefit from complete sequence data, since relatively accurate haplotype phase can be inferred. Recently, Higasa and colleagues [10] showed that errors of population-based haplotype inference affected the results of some statistics for positive selection more than others. Currently available haplotype inference software may have limitations depending on the data size and availability of external information. Development of accurate haplotyping and haplotype inference methods suitable for genome sequence data will be a key to successful population genetics study using haplotype information. Ten years have passed since the first drafts of human genome sequences were published, and we now have at least 15 individual whole-genome sequences, thanks to the dramatic progress in sequencing technology. However, there remain many unsolved questions on human genome diversity. To expand our understanding, we need more personal genome data from worldwide populations. With limited resources, quality (accuracy) and quantity (number of individuals) are always difficult to balance. The study by Tong and colleagues [3] is the first attempt to tackle this question. This approach is the first of its kind and will likely be improved on in the near future as other researchers see the potential of such an approach, however, this method and the findings, will help to open new prospect in the field of human genome research.

9 in total

1. Personal genome sequencing: current approaches and challenges.

Authors: Michael Snyder; Jiang Du; Mark Gerstein
Journal: Genes Dev Date: 2010-03-01 Impact factor: 11.361

2. Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy.

Authors: James R Lupski; Jeffrey G Reid; Claudia Gonzaga-Jauregui; David Rio Deiros; David C Y Chen; Lynne Nazareth; Matthew Bainbridge; Huyen Dinh; Chyn Jing; David A Wheeler; Amy L McGuire; Feng Zhang; Pawel Stankiewicz; John J Halperin; Chengyong Yang; Curtis Gehman; Danwei Guo; Rola K Irikat; Warren Tom; Nick J Fantin; Donna M Muzny; Richard A Gibbs
Journal: N Engl J Med Date: 2010-03-10 Impact factor: 91.245

3. The diploid genome sequence of an Asian individual.

Authors: Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman; Wei Fan; Junqing Zhang; Jun Li; Juanbin Zhang; Yiran Guo; Binxiao Feng; Heng Li; Yao Lu; Xiaodong Fang; Huiqing Liang; Zhenglin Du; Dong Li; Yiqing Zhao; Yujie Hu; Zhenzhen Yang; Hancheng Zheng; Ines Hellmann; Michael Inouye; John Pool; Xin Yi; Jing Zhao; Jinjie Duan; Yan Zhou; Junjie Qin; Lijia Ma; Guoqing Li; Zhentao Yang; Guojie Zhang; Bin Yang; Chang Yu; Fang Liang; Wenjie Li; Shaochuan Li; Dawei Li; Peixiang Ni; Jue Ruan; Qibin Li; Hongmei Zhu; Dongyuan Liu; Zhike Lu; Ning Li; Guangwu Guo; Jianguo Zhang; Jia Ye; Lin Fang; Qin Hao; Quan Chen; Yu Liang; Yeyang Su; A San; Cuo Ping; Shuang Yang; Fang Chen; Li Li; Ke Zhou; Hongkun Zheng; Yuanyuan Ren; Ling Yang; Yang Gao; Guohua Yang; Zhuo Li; Xiaoli Feng; Karsten Kristiansen; Gane Ka-Shu Wong; Rasmus Nielsen; Richard Durbin; Lars Bolund; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

4. Reference-unbiased copy number variant analysis using CGH microarrays.

Authors: Young Seok Ju; Dongwan Hong; Sheehyun Kim; Sung-Soo Park; Sujung Kim; Seungbok Lee; Hansoo Park; Jong-Il Kim; Jeong-Sun Seo
Journal: Nucleic Acids Res Date: 2010-08-27 Impact factor: 16.971

5. A highly annotated whole-genome sequence of a Korean individual.

Authors: Jong-Il Kim; Young Seok Ju; Hansoo Park; Sheehyun Kim; Seonwook Lee; Jae-Hyuk Yi; Joann Mudge; Neil A Miller; Dongwan Hong; Callum J Bell; Hye-Sun Kim; In-Soon Chung; Woo-Chung Lee; Ji-Sun Lee; Seung-Hyun Seo; Ji-Young Yun; Hyun Nyun Woo; Heewook Lee; Dongwhan Suh; Seungbok Lee; Hyun-Jin Kim; Maryam Yavartanoo; Minhye Kwak; Ying Zheng; Mi Kyeong Lee; Hyunjun Park; Jeong Yeon Kim; Omer Gokcumen; Ryan E Mills; Alexander Wait Zaranek; Joseph Thakuria; Xiaodi Wu; Ryan W Kim; Jim J Huntley; Shujun Luo; Gary P Schroth; Thomas D Wu; HyeRan Kim; Kap-Seok Yang; Woong-Yang Park; Hyungtae Kim; George M Church; Charles Lee; Stephen F Kingsmore; Jeong-Sun Seo
Journal: Nature Date: 2009-07-08 Impact factor: 49.962

6. Complete Khoisan and Bantu genomes from southern Africa.

Authors: Stephan C Schuster; Webb Miller; Aakrosh Ratan; Lynn P Tomsho; Belinda Giardine; Lindsay R Kasson; Robert S Harris; Desiree C Petersen; Fangqing Zhao; Ji Qi; Can Alkan; Jeffrey M Kidd; Yazhou Sun; Daniela I Drautz; Pascal Bouffard; Donna M Muzny; Jeffrey G Reid; Lynne V Nazareth; Qingyu Wang; Richard Burhans; Cathy Riemer; Nicola E Wittekindt; Priya Moorjani; Elizabeth A Tindall; Charles G Danko; Wee Siang Teo; Anne M Buboltz; Zhenhai Zhang; Qianyi Ma; Arno Oosthuysen; Abraham W Steenkamp; Hermann Oostuisen; Philippus Venter; John Gajewski; Yu Zhang; B Franklin Pugh; Kateryna D Makova; Anton Nekrutenko; Elaine R Mardis; Nick Patterson; Tom H Pringle; Francesca Chiaromonte; James C Mullikin; Evan E Eichler; Ross C Hardison; Richard A Gibbs; Timothy T Harkins; Vanessa M Hayes
Journal: Nature Date: 2010-02-18 Impact factor: 49.962

7. Sequencing and analysis of an Irish human genome.

Authors: Pin Tong; James G D Prendergast; Amanda J Lohan; Susan M Farrington; Simon Cronin; Nial Friel; Dan G Bradley; Orla Hardiman; Alex Evans; James F Wilson; Brendan Loftus
Journal: Genome Biol Date: 2010-09-07 Impact factor: 13.583

8. Evaluation of haplotype inference using definitive haplotype data obtained from complete hydatidiform moles, and its significance for the analyses of positively selected regions.

Authors: Koichiro Higasa; Yoji Kukita; Kiyoko Kato; Norio Wake; Tomoko Tahira; Kenshi Hayashi
Journal: PLoS Genet Date: 2009-05-08 Impact factor: 5.917

9. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

9 in total

4 in total

1. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals.

Authors: Young Seok Ju; Jong-Il Kim; Sheehyun Kim; Dongwan Hong; Hansoo Park; Jong-Yeon Shin; Seungbok Lee; Won-Chul Lee; Sujung Kim; Saet-Byeol Yu; Sung-Soo Park; Seung-Hyun Seo; Ji-Young Yun; Hyun-Jin Kim; Dong-Sung Lee; Maryam Yavartanoo; Hyunseok Peter Kang; Omer Gokcumen; Diddahally R Govindaraju; Jung Hee Jung; Hyonyong Chong; Kap-Seok Yang; Hyungtae Kim; Charles Lee; Jeong-Sun Seo
Journal: Nat Genet Date: 2011-07-03 Impact factor: 38.330

2. Deep whole-genome sequencing of 100 southeast Asian Malays.

Authors: Lai-Ping Wong; Rick Twee-Hee Ong; Wan-Ting Poh; Xuanyao Liu; Peng Chen; Ruoying Li; Kevin Koi-Yau Lam; Nisha Esakimuthu Pillai; Kar-Seng Sim; Haiyan Xu; Ngak-Leng Sim; Shu-Mei Teo; Jia-Nee Foo; Linda Wei-Lin Tan; Yenly Lim; Seok-Hwee Koo; Linda Seo-Hwee Gan; Ching-Yu Cheng; Sharon Wee; Eric Peng-Huat Yap; Pauline Crystal Ng; Wei-Yen Lim; Richie Soong; Markus Rene Wenk; Tin Aung; Tien-Yin Wong; Chiea-Chuen Khor; Peter Little; Kee-Seng Chia; Yik-Ying Teo
Journal: Am J Hum Genet Date: 2013-01-03 Impact factor: 11.025

3. Rapid and efficient human mutation detection using a bench-top next-generation DNA sequencer.

Authors: Qian Jiang; Tychele Turner; Maria X Sosa; Ankit Rakha; Stacey Arnold; Aravinda Chakravarti
Journal: Hum Mutat Date: 2011-10-17 Impact factor: 4.878

4. Global assessment of genomic variation in cattle by genome resequencing and high-throughput genotyping.

Authors: Bujie Zhan; João Fadista; Bo Thomsen; Jakob Hedegaard; Frank Panitz; Christian Bendixen
Journal: BMC Genomics Date: 2011-11-14 Impact factor: 3.969

4 in total