Literature DB >> 21051338

TIARA: a database for accurate analysis of multiple personal genomes based on cross-technology.

Dongwan Hong¹, Sung-Soo Park, Young Seok Ju, Sheehyun Kim, Jong-Yeon Shin, Sujung Kim, Saet-Byeol Yu, Won-Chul Lee, Seungbok Lee, Hansoo Park, Jong-Il Kim, Jeong-Sun Seo.

Abstract

High-throughput genomic technologies have been used to explore personal human genomes for the past few years. Although the integration of technologies is important for high-accuracy detection of personal genomic variations, no databases have been prepared to systematically archive genomes and to facilitate the comparison of personal genomic data sets prepared using a variety of experimental platforms. We describe here the Total Integrated Archive of Short-Read and Array (TIARA; http://tiara.gmi.ac.kr) database, which contains personal genomic information obtained from next generation sequencing (NGS) techniques and ultra-high-resolution comparative genomic hybridization (CGH) arrays. This database improves the accuracy of detecting personal genomic variations, such as SNPs, short indels and structural variants (SVs). At present, 36 individual genomes have been archived and may be displayed in the database. TIARA supports a user-friendly genome browser, which retrieves read-depths (RDs) and log2 ratios from NGS and CGH arrays, respectively. In addition, this database provides information on all genomic variants and the raw data, including short reads and feature-level CGH data, through anonymous file transfer protocol. More personal genomes will be archived as more individuals are analyzed by NGS or CGH array. TIARA provides a new approach to the accurate interpretation of personal genomes for genome research.

Entities: Disease Gene Species

Mesh：

Year: 2010 PMID： 21051338 PMCID： PMC3013693 DOI： 10.1093/nar/gkq1101

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recently developed high-throughput DNA technologies have revolutionized human genomics. Massively parallel sequencing—next generation sequencing (NGS)—has been used to analyze nearly 20 personal genomes (1–10). The cost of sequencing a single genome is decreasing dramatically, and we are now approaching an era in which personal genomic sequencing will cost US$1000. The sequencing of a large number of individual genomes, possibly more than 1000, is expected to be complete within the next year (http://www.1000genomes.org). Current sequencing technologies, which provide sufficient read depth (RD), enable the detection of genome-wide SNPs and short indels with >99.9% accuracy (3,4). Comparative genomic hybridization (CGH) arrays have been used to detect copy number variants (CNVs), a major type of structural variant (SV) in the human genome (11–16). CNVs are irregular in size and often reside in ambiguous regions (e.g. repetitive sequences) making them difficult to detect by NGS technologies alone. Although several sequencing approaches have attempted to detect CNVs (6,17–19), CGH arrays remain a standard approach to CNV detection (11–16). Human genomic variants are believed to have important functional impacts on human biology and medicine. To evaluate the potential biological functions of the large number of variants, it is essential to develop intuitive methods for comparing multiple genomes using raw-level data generated by diverse technologies. Moreover, the cooperative integration of different genomic technologies is necessary for high-accuracy detection of variants, especially of CNVs (6,10,16). Although many genomic databases and browsers have been developed (20–24), the comparison and integration of genomic data sets from different platforms is not yet feasible. We describe here a new database, the Total Integrated Archive of Short-Read and Array (TIARA; http://tiara.gmi.ac.kr) database, integrated with a genome browser. This database accumulates raw-level personal genomic data from whole genome NGS and CGH arrays. At present, it contains 36 individual genomic data sets that have been analyzed and reported by the Genomic Medicine Institute (GMI) at Seoul National University (6,10,16). To retrieve the large quantities of genomic data in real-time on TIARA, we have implemented an efficient index using Apache Lucene (http://lucene.apache.org) along with client-side Asynchronous JavaScript and XML (AJAX) scripts that reduce the volume of data exchange and processing within the web server.

MATERIALS AND METHODS

Massively parallel sequencing data

TIARA contains massively parallel sequencing data from five individuals, three of whom—[AK1 (6), AK2 (16), and NA10851 (10,16)]—have been described previously. The other two genomes deposited in TIARA, AK4 and AK6, were sequenced using the Illumina Genome Analyzer. The average RDs of the sequencing coverage for these five individuals were 27.8x, 27.5x, 25.0x, 22.3x, and 23.1x, respectively. The details of the whole genome sequencing process have been described previously (6,10,16). Briefly, short-reads from the Illumina Genome Analyzer and AB SOLiD were aligned using the GSNAP and BioScope alignment tools, respectively, with respect to the human reference genome build 36.3 (6,16,25). The RDs of sequencing coverage were obtained by adjusting the effects of GC content as described previously (10,18).

High-resolution CGH array data

CGH array data from 33 individuals (11 Koreans, 10 Chinese, 10 Japanese, 1 European and 1 West African) were obtained using a whole genome tiling CGH array comprising 24-M probes (16) (Supplementary Table S1). In addition to the usual type of CGH data, which depends on a comparison with a reference sample (NA10851), the absolute or reference-free CGH array data were also provided.

Genome variants

SNPs and indels were discovered by applying conservative filter criteria to the NGS data as described elsewhere (6). Briefly, four matches from uniquely aligned short reads with a quality score ≥20 were required for SNP identification. CNVs were identified in the CGH array using the ADM2 algorithm (16,26) in the Agilent Genomic Workbench Standard Edition 5.0.14. The summary statistics of each individual genome are provided in Table 1.

Table 1.

Summary of massively parallel sequencing data in TIARA

Sample name	Technology	Read length (in bp)	Insert size	Number of reads	Total bases	Sequencing coverage	Aligned coverage	SNPs	Indels	CNV (region)
AK1	Illumina Genome Analyzer	1 × 36 2 × 36 2 × 88 2 × 106	200	519 486 218 1 646 543 336 123 322 768 177 416 122	18 701 503 848 59 275 560 096 10 852 403 584 18 806 108 932	35.9x	27.8x	3 453 653	170 202	1237 (24 193 059)
AK2	AB SOLiD	2 × 25 2 × 50	1500 4700	6 371 995 780 3 390 922 334	159 299 894 500 169 546 116 700	109.6x	27.5x	3 586 271	213 718	607 (9 248 044)
AK4	Illumina Genome Analyzer	2 × 76 2 × 101	500	444 312 562 430 032 812	33 767 754 712 43 433 314 012	25.7x	23.1x	3 630 428	429 258	696 (8 463 889)
AK6	Illumina Genome Analyzer	2 × 36 2 × 76 2 × 101	500	55 752 362 540 079 624 301 478 526	2 007 085 032 41 046 051 424 30 449 331 126	24.5x	22.3x	3 558 703	413 949	706 (11 958 848)
NA10851	Illumina Genome Analyzer	2 × 36 2 × 76 2 × 101	500	1 114 121 056 318 924 496 203 842 434	40 108 358 016 24 238 261 696 20 588 085 834	28.3x	25.0x	3 683 016	319 266	1309 (23 198 937)

Summary of massively parallel sequencing data in TIARA Illumina Genome Analyzer 1 × 36 2 × 36 2 × 88 2 × 106 519 486 218 1 646 543 336 123 322 768 177 416 122 18 701 503 848 59 275 560 096 10 852 403 584 18 806 108 932 2 × 25 2 × 50 1500 4700 6 371 995 780 3 390 922 334 159 299 894 500 169 546 116 700 Illumina Genome Analyzer 2 × 76 2 × 101 444 312 562 430 032 812 33 767 754 712 43 433 314 012 Illumina Genome Analyzer 2 × 36 2 × 76 2 × 101 55 752 362 540 079 624 301 478 526 2 007 085 032 41 046 051 424 30 449 331 126 Illumina Genome Analyzer 2 × 36 2 × 76 2 × 101 1 114 121 056 318 924 496 203 842 434 40 108 358 016 24 238 261 696 20 588 085 834

RESULTS

System configuration

The TIARA system mainly consists of a ‘genome data repository’ and a ‘genome browser’ (Figure 1). The genome data repository has three types of storage archive: (i) a ‘Lucene index file system’, (ii) a ‘MySQL database’ and (iii) an ‘anonymous file transfer protocol (FTP) archive’. These archives were built on a virtualization file system designed to support high-performance computing clusters. The Lucene index file system includes inverted index files for real-time query processing of genomic data, including SNPs, indels, RDs and log2 ratios. Inverted index files are generated using an ‘index build module’. The MySQL database stores information about the aligned short reads, such as read length, alignment position and quality. The anonymous FTP archive enables downloading of the raw CGH and short read data as well as the filtered genome variants, including SNPs, non-synonymous SNPs, indels and CNVs.

Figure 1.

System configuration of TIARA.

System configuration of TIARA. The genome browser consists of: (i) ‘a genome data search engine’ and (ii) ‘a genome data display engine’. The genome data search engine retrieves genomic data from the genome data repository. The genome data display engine visualizes the SNPs, indels, RDs and log2 ratios obtained from the genome data search engine in the genome browser of TIARA. The properties of the genome data that are exchanged between modules are XML and JavaScript Object Notation (JSON) files. The modules in each engine are described in detail in the Supplementary Information 1.

User interface of TIARA

In this section, we describe the user interface of TIARA, the structure of which is displayed in Figure 2a. In area (A) of Figure 2a, the user can specify the genomic region and individual regions of interest for browsing. Areas (B), (C), (D) and (E) present, respectively, the RefSeq gene, SNPs, indels and RDs from the high-throughput sequencing data. Areas (F) and (G) present the CNV regions and log2 ratios from the high-resolution CGH array data, respectively. Once the user selects or deselects an individual genome data set, the personal genome data are displayed in or removed from areas (C), (D), (E) and (G). The ‘GeneSearch’ button allows the user to browse the genome data for a specific gene selected by the user. For example, the user can browse the TP53 gene locus (Figure 2b). The ‘XMLDownload’ button exports an XML document that contains structured information describing the SNPs, indels, RDs and log2 ratios visualized in the genome browser. The downloaded XML document permits analysis of the selected genomic region using other genomic browsers or custom scripts of the user’s creation. A schema of each XML document is shown in the Supplementary Figures 1 and 2.

Figure 2.

User interface of TIARA. (a) The user interface consists of areas (A) control panel, (B) Refseq gene, (C) SNPs, (D) indels, (E) RD display window from high-throughput sequencing, (F) CNV regions and (G) log2 ratio display window for high-resolution CGH array data. (b) Retrieval of genomic data for the TP53 gene using the gene name search function. (c) An example of heterozygous and homozygous SNPs for the same position in several selected individuals. These SNPs are related to colorectal and endometrial cancer. (d) An example of the popup window for the common SNPs displaying.

High-throughput sequencing data SNP display window (see (C) of Figure 2a) All SNPs detected across multiple individuals for a selected genomic region are displayed as points in the SNP display window. Homozygous and heterozygous SNPs are colored in blue and red, respectively. Users may click on one of these SNPs to receive information about the short read data. For example, Figure 2c shows the information on short reads for the SNP at the 74 583 581 bp position of chromosome 14. A comparison of multiple genomes provides clues as to the functional impact of each variant. For example, the SNP shown in Figure 2c appears to have a high frequency because five individuals have the SNP (homozygous SNPs for AK1 and AK6 and heterozygous SNPs for AK2, AK4 and NA10851). Although the SNP is associated with colorectal and endometrial cancers according to reports in the Online Mendelian Inheritance in Man (OMIM) database, its high allele frequency suggests that its disease effect may be limited. In addition, we provide the popup window that is available to display the common SNPs from multiple individuals in Figure 2d. Indel display window (see area (D) of Figure 2a) The indel start position is marked with a circle. As with the SNPs, homozygous and heterozygous indels are colored in blue and red, respectively. The insertions are indicated by a filled circle and the deletions are indicated by an open circle. RD display window (see area (E) of Figure 2a) The coverage graph for the genomic region selected by the user is drawn in the RD display window, the size of which is adjusted according to the amount of data extracted from that region of the genome. High-resolution CGH array data CNV region display window (see area (F) of Figure 2a) To enhance the CNV study, CNV regions reported by Conrad et al. (15) were browsed in area (F). A line indicates the CNV region from the start position to the stop position. Log2 ratio display window (see area (G) of Figure 2a) The log2 ratios of the CGH array for individuals selected by the user may be visualized in the log2 ratio display window. Both the conventional log2 ratio (relative to the CGH reference DNA, NA10851) and the reference-free log2 ratio (10) are displayed if the ‘Absolute called’ checkbox is selected. By combining the RDs, relative log2 ratios and absolute log2 ratios, the copy number status of a selected region from individual genomes can be identified accurately. For example, as shown in Figure 2a, although the relative log2 ratio in the genomic region of AK1 appears to be a gain (G), it is apparent that AK1 has no CNVs because both the RD (E) and the absolute log2 ratio (G) indicate that not AK1, but NA10851 contains CNVs in the region. User interface of TIARA. (a) The user interface consists of areas (A) control panel, (B) Refseq gene, (C) SNPs, (D) indels, (E) RD display window from high-throughput sequencing, (F) CNV regions and (G) log2 ratio display window for high-resolution CGH array data. (b) Retrieval of genomic data for the TP53 gene using the gene name search function. (c) An example of heterozygous and homozygous SNPs for the same position in several selected individuals. These SNPs are related to colorectal and endometrial cancer. (d) An example of the popup window for the common SNPs displaying.

DISCUSSION

We have described the development of the TIARA genome database, into which massively parallel sequencing data, high-resolution array CGH data and genomic variants of human whole genomes have been deposited. The TIARA genome browser is a unique visualization tool that facilitates multi-individual and cross-technology analysis of complex human genomic variations. TIARA will be upgraded to improve the efficiency of genome research by developing advanced genome browser functions and by adding more personal genomes. GMI-SNU has recently completed sequencing of the entire genomes of 10 Korean individuals using NGS and high-resolution CGH arrays. Our group plans to analyze 1000 Asian genomes and release the data through TIARA before the end of the next year. We believe that TIARA and the genomic data will prove to be an invaluable resource for human genome research.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Korean Ministry of Knowledge Economy (grant number 0411-20100061); Korean Ministry of Education, Science and Technology (grant number 2010-0013662); Green Cross Therapeutics (0411-20080023). Funding for open access charge: Korean Ministry of Education, Science and Technology (grant number 2010-0013662). Conflict of interest statement. None declared.

26 in total

1. Detection of large-scale variation in the human genome.

Authors: A John Iafrate; Lars Feuk; Miguel N Rivera; Marc L Listewnik; Patricia K Donahoe; Ying Qi; Stephen W Scherer; Charles Lee
Journal: Nat Genet Date: 2004-08-01 Impact factor: 38.330

2. Efficient calculation of interval scores for DNA copy number data analysis.

Authors: Doron Lipson; Yonatan Aumann; Amir Ben-Dor; Nathan Linial; Zohar Yakhini
Journal: J Comput Biol Date: 2006-03 Impact factor: 1.479

Review 3. Copy-number variation and association studies of human disease.

Authors: Steven A McCarroll; David M Altshuler
Journal: Nat Genet Date: 2007-07 Impact factor: 38.330

4. Personal genome sequencing: current approaches and challenges.

Authors: Michael Snyder; Jiang Du; Mark Gerstein
Journal: Genes Dev Date: 2010-03-01 Impact factor: 11.361

5. Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing.

Authors: Hansoo Park; Jong-Il Kim; Young Seok Ju; Omer Gokcumen; Ryan E Mills; Sheehyun Kim; Seungbok Lee; Dongwhan Suh; Dongwan Hong; Hyunseok Peter Kang; Yun Joo Yoo; Jong-Yeon Shin; Hyun-Jin Kim; Maryam Yavartanoo; Young Wha Chang; Jung-Sook Ha; Wilson Chong; Ga-Ram Hwang; Katayoon Darvishi; Hyeran Kim; Song Ju Yang; Kap-Seok Yang; Hyungtae Kim; Matthew E Hurles; Stephen W Scherer; Nigel P Carter; Chris Tyler-Smith; Charles Lee; Jeong-Sun Seo
Journal: Nat Genet Date: 2010-04-04 Impact factor: 38.330

6. Global variation in copy number in the human genome.

Authors: Richard Redon; Shumpei Ishikawa; Karen R Fitch; Lars Feuk; George H Perry; T Daniel Andrews; Heike Fiegler; Michael H Shapero; Andrew R Carson; Wenwei Chen; Eun Kyung Cho; Stephanie Dallaire; Jennifer L Freeman; Juan R González; Mònica Gratacòs; Jing Huang; Dimitrios Kalaitzopoulos; Daisuke Komura; Jeffrey R MacDonald; Christian R Marshall; Rui Mei; Lyndal Montgomery; Kunihiro Nishimura; Kohji Okamura; Fan Shen; Martin J Somerville; Joelle Tchinda; Armand Valsesia; Cara Woodwark; Fengtang Yang; Junjun Zhang; Tatiana Zerjal; Jane Zhang; Lluis Armengol; Donald F Conrad; Xavier Estivill; Chris Tyler-Smith; Nigel P Carter; Hiroyuki Aburatani; Charles Lee; Keith W Jones; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2006-11-23 Impact factor: 49.962

7. Fast and SNP-tolerant detection of complex variants and splicing in short reads.

Authors: Thomas D Wu; Serban Nacu
Journal: Bioinformatics Date: 2010-02-10 Impact factor: 6.937

8. Reference-unbiased copy number variant analysis using CGH microarrays.

Authors: Young Seok Ju; Dongwan Hong; Sheehyun Kim; Sung-Soo Park; Sujung Kim; Seungbok Lee; Hansoo Park; Jong-Il Kim; Jeong-Sun Seo
Journal: Nucleic Acids Res Date: 2010-08-27 Impact factor: 16.971

9. Origins and functional impact of copy number variation in the human genome.

Authors: Donald F Conrad; Dalila Pinto; Richard Redon; Lars Feuk; Omer Gokcumen; Yujun Zhang; Jan Aerts; T Daniel Andrews; Chris Barnes; Peter Campbell; Tomas Fitzgerald; Min Hu; Chun Hwa Ihm; Kati Kristiansson; Daniel G Macarthur; Jeffrey R Macdonald; Ifejinelo Onyiah; Andy Wing Chun Pang; Sam Robson; Kathy Stirrups; Armand Valsesia; Klaudia Walter; John Wei; Chris Tyler-Smith; Nigel P Carter; Charles Lee; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2009-10-07 Impact factor: 49.962

10. Paired-end mapping reveals extensive structural variation in the human genome.

Authors: Jan O Korbel; Alexander Eckehart Urban; Jason P Affourtit; Brian Godwin; Fabian Grubert; Jan Fredrik Simons; Philip M Kim; Dean Palejev; Nicholas J Carriero; Lei Du; Bruce E Taillon; Zhoutao Chen; Andrea Tanzer; A C Eugenia Saunders; Jianxiang Chi; Fengtang Yang; Nigel P Carter; Matthew E Hurles; Sherman M Weissman; Timothy T Harkins; Mark B Gerstein; Michael Egholm; Michael Snyder
Journal: Science Date: 2007-09-27 Impact factor: 47.728

9 in total

1. Extensive genomic and transcriptional diversity identified through massively parallel DNA and RNA sequencing of eighteen Korean individuals.

Authors: Young Seok Ju; Jong-Il Kim; Sheehyun Kim; Dongwan Hong; Hansoo Park; Jong-Yeon Shin; Seungbok Lee; Won-Chul Lee; Sujung Kim; Saet-Byeol Yu; Sung-Soo Park; Seung-Hyun Seo; Ji-Young Yun; Hyun-Jin Kim; Dong-Sung Lee; Maryam Yavartanoo; Hyunseok Peter Kang; Omer Gokcumen; Diddahally R Govindaraju; Jung Hee Jung; Hyonyong Chong; Kap-Seok Yang; Hyungtae Kim; Charles Lee; Jeong-Sun Seo
Journal: Nat Genet Date: 2011-07-03 Impact factor: 38.330

2. Clinical and histopathological study of Charcot-Marie-Tooth neuropathy with a novel S90W mutation in BSCL2.

Authors: B-O Choi; M-H Park; K W Chung; H-M Woo; H Koo; H-K Chung; K-G Choi; K D Park; H J Lee; Y S Hyun; S K Koo
Journal: Neurogenetics Date: 2012-11-10 Impact factor: 2.660

3. Exploring the implications of INDELs in neuropsychiatric genetics: challenges and perspectives.

Authors: R R Lemos; M B R Souza; J R M Oliveira
Journal: J Mol Neurosci Date: 2012-02-16 Impact factor: 3.444

Review 4. Sequencing technologies and genome sequencing.

Authors: Chandra Shekhar Pareek; Rafal Smoczynski; Andrzej Tretyn
Journal: J Appl Genet Date: 2011-06-23 Impact factor: 3.240

5. Identification of CDH23 mutations in Korean families with hearing loss by whole-exome sequencing.

Authors: Hae-Mi Woo; Hong-Joon Park; Mi-Hyun Park; Bo-Young Kim; Joong-Wook Shin; Won Gi Yoo; Soo Kyung Koo
Journal: BMC Med Genet Date: 2014-04-28 Impact factor: 2.103

6. TIARA genome database: update 2013.

Authors: Dongwan Hong; Jongkeun Lee; Thomas Bleazard; HyunChul Jung; Young Seok Ju; Saet-byeol Yu; Sujung Kim; Sung-Soo Park; Jong-Il Kim; Jeong-Sun Seo
Journal: Database (Oxford) Date: 2013-03-20 Impact factor: 3.451

7. Whole-exome sequencing identifies MYO15A mutations as a cause of autosomal recessive nonsyndromic hearing loss in Korean families.

Authors: Hae-Mi Woo; Hong-Joon Park; Jeong-In Baek; Mi-Hyun Park; Un-Kyung Kim; Borum Sagong; Soo Kyung Koo
Journal: BMC Med Genet Date: 2013-07-17 Impact factor: 2.103

8. Recessive C10orf2 mutations in a family with infantile-onset spinocerebellar ataxia, sensorimotor polyneuropathy, and myopathy.

Authors: Mi-Hyun Park; Hae-Mi Woo; Young Bin Hong; Ji Hoon Park; Bo Ram Yoon; Jin-Mo Park; Jeong Hyun Yoo; Heasoo Koo; Jong-Hee Chae; Ki Wha Chung; Byung-Ok Choi; Soo Kyung Koo
Journal: Neurogenetics Date: 2014-05-10 Impact factor: 2.660

9. Korean Genome Project: 1094 Korean personal genomes with clinical information.

Authors: Sungwon Jeon; Youngjune Bhak; Yeonsong Choi; Yeonsu Jeon; Seunghoon Kim; Jaeyoung Jang; Jinho Jang; Asta Blazyte; Changjae Kim; Yeonkyung Kim; Jungae Shim; Nayeong Kim; Yeo Jin Kim; Seung Gu Park; Jungeun Kim; Yun Sung Cho; Yeshin Park; Hak-Min Kim; Byoung-Chul Kim; Neung-Hwa Park; Eun-Seok Shin; Byung Chul Kim; Dan Bolser; Andrea Manica; Jeremy S Edwards; George Church; Semin Lee; Jong Bhak
Journal: Sci Adv Date: 2020-05-27 Impact factor: 14.136

9 in total