Literature DB >> 22053303

Introduction to bioinformatics: sequencing technology.

Abstract

Bioinformatics, the study of integrating high throughput biological data and statistical model through intensive computation, has been attracting great interest in recent times and Sequencing is at the very center of it. The large amount of information obtained from sequencing has deepened our understanding and fundamental knowledge of organisms. This review will aim to provide a brief summary of new sequencing technology, current issues, and projects focused on medical applications. The article is organized in three parts. Part I explains common terminologies and background of sequencing technology, and Part II compares distinct features of currently available platforms. Part III contains applications in various medical fields.

Entities: Chemical Disease Gene Species

Keywords: Bioinformatics, Next-generation sequencing; Massively parallel sequencing

Year: 2011 PMID： 22053303 PMCID： PMC3206250 DOI： 10.5415/apallergy.2011.1.2.93

Source DB: PubMed Journal: Asia Pac Allergy ISSN： 2233-8276

Part I. Introduction to sequencing technology

On April 15, 1953, Francis Crick and James Watson proposed the double helical structure of the DNA molecular structure [1]. Since then, methods have been devised to determine the sequence of DNA residues, which serves as a blueprint of organism. A conventional sequencing means the Sanger-type sequencing, which is capillary-based, laboratory-intensive work. The human genome project (HGP) accelerated progress in sequencing, but sequencing remained a cumbersome procedure despite of its importance. The first commercial next-generation sequencing (NGS) was launched in 2005 [2], about 50 years after the discovery of the DNA structure by Crick and Watson. Although NGS is the current generation sequencing method, it does not have a prototype model/platform yet. Generally it refers to a type of sequencing that does not need bacterial artificial chromosome cloning but runs automatically based on enzymological amplification [3]. Alternatively, high-throughput sequencing or massively parallel sequencing is often used to describe extremely vast amounts of outputs. The Roche/454 FLX Titiniaum and the Illumina/HiSeq2000 are the most commonly available platforms. Recently a new sequencing method termed next-NGS (NNGS) or no-amplication NGS had been developed. Compared to NGS, NNGS requires no amplification step; therefore it is a single-molecular-based technology, hence it is often defined as the third generation sequencing replacing the first generation automated Sanger method and the second generation NGS [4]. Helicos is the first commercially available NNGS platform but Pacific Bioscience is emerging as a new pathfinder. An even newer fourth generation sequencer has been developed, which is a platform that enables sequencing without imaging. Up to third generation sequencing, sequencers depend upon enzymatic cascade and optical (fluorescence) detection, which is high-cost-and-low-efficiency. Ion Torrent is the first fourth generation sequencer, making a breakthrough by utilizing digital detection on chips. The main driving force behind the development of new generations of sequencers has been the reduction of cost. The-thousand-dollar-human-genome has been a catchword for the new sequencing technology. DNA sequencing costs has been decreasing at a very high rate [5]. Note that the HGP cost $3 billion and now it is estimated to cost less than $1,000 within a few years. Nonetheless, sequencing the whole genome is expensive and cost-ineffective. An alternative approach is to sequence specific regions of DNA rather than whole genomes, a strategy called targeted sequencing. For example, exome (-targeted-) sequencing, which focuses on the exome (exonic regions, which is ~5% of human genomes) has been popular because of its effectiveness to identify potential mutations. More than 20 rare Mendelian disorders have been identified so far utilizing this method, including Miller syndrome (family-based design) [6], Kabuki syndrome (unrelated individuals) [7], and even a set of ion channel mutations believed to cause Mendelian disorders. See [8] for a review on recent works. From the technological point-of-view, the in vitro cloning step clonally amplified polymerase chain reaction (PCR) enabled NGS due to its time-saving processing. The most common methods are emulsion PCR (emPCR) and bridge PCR. emPCR is a method for DNA amplification using waters and oil emulsion to amplify without loss of DNA molecules. Bridge PCR is performed on a slide where affixed primers provide series of DNA amplification. Clonally amplified PCR has opened a new era in NGS but ironically it is destined to discontinue due to development of single-molecular-based sequencing technology. For a comprehensive review of the sequencing techniques, please refer to [9-11]. Regardless of different platforms that use diverse tactics to shorten time and cost, the only product and by-product obtained from sequencing is read. Read, or sequence tag is a very short DNA sequence that is assumed to be a copy of the true genome sequence. In practice, however, it also contains footprints of individual variants and systematic errors of current platforms. The short length of the read was recognized as one of the obstacles of early NGS technology. The first read sequence of Solexa (Illumina, USA) was 20-30 bases on average, which could be likened to like assembling 108-piece puzzle sets to reconstruct the human template. Even though the Sanger method still produces relatively long reads (300-1,000 bases), a short read length is not a barrier anymore for some platforms: the 454 FLX Titanium (Roche, USA) produces 400 bases, and the PacBio RS (Pacific Biosciences, USA) produces 1,000 bases on average. Also, Illumina (USA) has been successfully proven that rather short sequences (e.g., 50 bases) are sufficient for re-sequencing purpose and plausible even for de novo sequencing [12]. Currently most reads from NGS are single-end but there are some modifications for longer read length and accurate alignment, such as mate-paired/paired-end, strobed reads. Those reads (paired-end or strobed read) acquire local positional information and tell reads located nearby, whereas single-end reads are only aligned to one position on the genome like an island. There are subtle differences between mate-paired (library) and paired-end (sequencing) in terms of how the library is made [13]. Beside the length of read, another aspect to consider is the amount of reads. The (average) coverage is defined as the average number of times a position in the genome is actually sequenced. In rare cases, the percent coverage is used to represent the percentage that a position is sequenced at least once. Reads are chopped randomly as shotgun style and there are repeated regions in the genome like retrotransposon. It is, therefore, generally agreed that at least 20-30 coverage is required to resequence the human genome acceptably. It should be noted that coverage is not uniformly distributed, possibly for several reasons: not randomly sheared fragment, not uniformly amplified DNA molecules due to genomic sequences composition differences [14] or chromatin status [15]. Having sufficient reads is not the only important issue in sequencing and bioinformatics. After obtaining reads, proper modeling plays an essential role in dissecting data to abstract important results from a vast amount of sequencing output. For example, assembly and alignment is the key procedure to match a read into its real location in the genome. See [16] for computational resources like clouding computing and [17, 18] for sequence-specific analysis and integrative approach.

Part II. Platform comparisons

As of July 2011, 6 sequencing platforms are commercially available. Table 1 summarizes distinctive attributes of some sequencing platforms. See [13] for the most updated comparisons in terms of technical aspects.

Table 1

Comparison of sequencing platforms

Each sequencing platform has its advantages and deficiencies. It is sometimes recommended to combine data from different platforms to overcome limitation and maximize efficiency [19]. Box 1. Sequencing Application Depending on input materials: Genome sequencing, RNA-seq (transcriptome, exome), ChIP-seq (methylome, transcript-DNA binding), small-RNA-seq (miRNA, piRNA, siRNAs), and ClIP-seq (transcript-RNA binding) [20]. Depending on purpose: de novo sequencing, (targeted) re-sequencing for variants and structural variants detection, transcriptome analysis, epigenetic changes and methylation pattern, Metagenomics including Microbial diversity and Paleogenomics, and so on.

Part III. Applications

Medical interests mostly focus on re-sequencing to find variants linked with diseases or specific phenotype. Some large-scale projects are listed below. The Cancer Genome Atlas project was launched by the National Cancer Institute and the National Human Genome Research Institute to provide genetic underpinning of cancer by extensive sequencing. Data is open to the research community. Recently it released detailed ovarian cancer data confirming that the mutation of TP53, BRCA1 and BRCA2 are highly associated with ovarian cancer [21]. The 1000 Genomes Project started to provide understanding of the human genome variants (SNPs, structural variants, and their haplotypes) from population-scale sequencing by international collaboration. In 2010, the project reported its pilot phase and the updated data are released monthly. For the details of the 1000 Genomes Project, refer to [22] presenting the overview of human genome sequence variation studies and the pilot phase of the project. Metagenomics of the Human Intestinal Tract project aims to link human health and intestinal microbiota. The human gut has the potential to diagnose the health of individuals and a large study was done defining the minimal gut metagenome of 124 European individuals [23]. The International Rare Disease Research Consortium, launched in early 2011 aiming to diagnose rare diseases by 2020, announced €100-million (US $140-million) call for research proposals [24]. There are several next-generation studies based on new techniques. For example, a web-based genome-wide association study was done by one direct-to-consumer (DTC) company. In many cases, medical science is slow. Multicenter genetic study on Parkinson's diseases spanned 6 years [25]. Now it can be reproduced within about 1 year once the cohort was constructed from the customer database of the personal genetics company [26]. Of course, this study was approved by an external IRB. Another good example is integration of genome-sequencing analysis and social-network analysis [27]. This study is a convergence of the classical and the modern. Combining sequencing and epidemiology uncovered a tuberculosis outbreak. The analysis comprised relatively very small bacterial sequencing, social-network questionnaires from 41 patients, and a linked-network model. Apart from the academic challenges stated in previous paragraphs, business models such as the personal genome industry or DTC have been using sequencing. If you want to know yourself, DTC companies such as 23andMe, deCODEme and Navigenics will calculate a set of disease risks under $500 once you provide DNA samples, like saliva or a cheek swab. However, you might want to read J. Craig Venter's opinion for current limitations [28]. The human leukocyte antigen (HLA) polymorphism detection is a specific application of sequencing for diagnosis. Many genes located in the major histocom patibility complex (MHC) on chromosome 6 are related to immunological functions such as HLA expression. Clinically, matching HLA haplotypes is essential for further therapy of bone marrow transplanting. The comparison between the first generation Sanger sequencing and the second generation NGS was done from the study on HLA polymorphism [29]. Comprehensive comparison results from selected genes (Class I and II genes) showed that NGS outperforms the conventional Sanger method in terms of timing and cost but there are still considerable issues regarding notable sequencing alignment error and the need for intensive computational support. Another effort on sequencing MHC region showed that direct sequencing of the whole gene regions can reveal several variants in Tubulin beta of patients with acute myeloid leukaemia undergoing HLA-matched allogenic hematopoietic stem cell transplantation [30].

Summary

Massively parallel sequencing guides us to new phase of medical research and application. For example, detection of rare single nucleotide variants is a direct clinical usage of new sequencing technology. It also enables to diagnosing structural variants detection such as chromosome rearrangement or genome-wide variants from somatic diseases like cancer [31, 32]. There is no standard platform and researchers should consider the capacities of different platforms depending upon the aim of research. As the cost of sequencing is gradually decreasing and more integrative work is being developed, we will gain deeper understanding of human biology. A new paradigm of clinical study is about to begin.

30 in total

Review 1. Next-generation genomics: an integrative approach.

Authors: R David Hawkins; Gary C Hon; Bing Ren
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

2. Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid.

Authors: J D WATSON; F H CRICK
Journal: Nature Date: 1953-04-25 Impact factor: 49.962

Review 3. Next-generation DNA sequencing methods.

Authors: Elaine R Mardis
Journal: Annu Rev Genomics Hum Genet Date: 2008 Impact factor: 8.929

4. An agenda for personalized medicine.

Authors: Pauline C Ng; Sarah S Murray; Samuel Levy; J Craig Venter
Journal: Nature Date: 2009-10-08 Impact factor: 49.962

Review 5. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

6. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak.

Authors: Jennifer L Gardy; James C Johnston; Shannan J Ho Sui; Victoria J Cook; Lena Shah; Elizabeth Brodkin; Shirley Rempel; Richard Moore; Yongjun Zhao; Robert Holt; Richard Varhol; Inanc Birol; Marcus Lem; Meenu K Sharma; Kevin Elwood; Steven J M Jones; Fiona S L Brinkman; Robert C Brunham; Patrick Tang
Journal: N Engl J Med Date: 2011-02-24 Impact factor: 91.245

Review 7. Computational solutions to large-scale data management and analysis.

Authors: Eric E Schadt; Michael D Linderman; Jon Sorenson; Lawrence Lee; Garry P Nolan
Journal: Nat Rev Genet Date: 2010-09 Impact factor: 53.242

8. Sequence capture and next generation resequencing of the MHC region highlights potential transplantation determinants in HLA identical haematopoietic stem cell transplantation.

Authors: Johannes Pröll; Martin Danzer; Stephanie Stabentheiner; Norbert Niklas; Christa Hackl; Katja Hofer; Sabine Atzmüller; Peter Hufnagl; Christian Gülly; Hanns Hauser; Otto Krieger; Christian Gabriel
Journal: DNA Res Date: 2011-05-28 Impact factor: 4.458

9. Transcriptome sequencing to detect gene fusions in cancer.

Authors: Christopher A Maher; Chandan Kumar-Sinha; Xuhong Cao; Shanker Kalyana-Sundaram; Bo Han; Xiaojun Jing; Lee Sam; Terrence Barrette; Nallasivam Palanisamy; Arul M Chinnaiyan
Journal: Nature Date: 2009-01-11 Impact factor: 49.962

10. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors: Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal: Nucleic Acids Res Date: 2008-07-26 Impact factor: 16.971