We believe the field of bioinformatics for genetic analysis will be one of
the biggest areas of disruptive innovation in life science tools over the next
few years.While the term “bioinformatics” originated from Paulien Hogeweg, a Dutch theoretical
biologist, in 1970 [2], the
groundwork was laid by the founding fathers of information theory and computer science,
Claude Shannon and Alan Turing.Shannon, widely known for his Master’s thesis work in the application of Boolean logic to
digital design, outlined the basis for bioinformatics in his 1940 PhD thesis entitled
“An Algebra for Theoretical Genetics” [3,4]. However, as Shannon’s
main interest was in communication theory and cryptanalysis, he never pursued this work
any further [4].In 1943, Turing visited Shannon at Bell Labs to share British code-breaking methods, and
the two exchanged ideas. Later on, near the end of his life, Turing applied his
computing acumen to biology, penning his famous 1952 work in spatial modeling of
morphogenesis [5]. Building on
work by D’Arcy Thompson, Turing, essentially employing bioinformatics, modeled
biological growth in fir cones, using the Manchester University Mark I computing machine
[4].From the beginning, bioinformatics has been a field driven by big data. It is now defined
as the interdisciplinary toolset for applying computer science, mathematics, and
statistics to the classification and analysis of biological information. At the highest
level, bioinformatics is used to analyze large datasets to help answer biological
questions, both in fundamental biology and in the biology that underlies disease.As such, developments in this field have been driven by our ever-increasing ability to
accrue large amounts of biological data, and they ultimately trace back to the
underlying data generation technology. The first major effort that propelled this
technology was the International Human Genome Project (HGP) [6,7],
which was enabled by the introduction of automated DNA sequencing instruments in the
mid-1980s. This resulted in the unprecedented development of sequence analysis
techniques for assembling our genome.This project turned into a 10-year, multinational, multi-billion dollar effort to
generate the sequence of a single reference human genome. While this project was a
necessary impetus for the development of the field, such large-scale projects are not
easily sustained. Nevertheless, it led to the democratization of data generation.The first democratizing technology was the DNA microarray, which was introduced in the
mid-1990s [8]. Microarrays enabled
researchers to perform a single-day, low-cost experiment that surveyed the gene activity
of all genes in a cell sample — an effort that would have otherwise taken several
man-years to produce. This thousand-fold advance resulted in many labs producing
enormous datasets of gene activity, and it stimulated a great deal of bioinformatics
development in the areas of deducing gene function and classifying disease conditions.
Microarray technology was further extended to provide sequence information at a large
number of variable marker sites in the genome. This enabled large-scale statistical
studies of the association between gene variants and the risk for disease [9]. In these gene expression and genotyping
roles, and in the hands of many researchers and labs, DNA microarray data was a major
driving force for bioinformatics development between 1998 and 2008.The democratization of data generation was further propelled by the introduction of
massively parallel DNA sequencers in 2005 [10]. These provided a major performance advance over the automated
sequencers that powered the HGP and ushered in a new wave of diverse DNA sequence data.
As a display of the power of this new technology, it was used to sequence the first
genome of a specific individual, that of Dr. James Watson, co-discoverer of the double
helical structure of DNA and Director of the original HGP [11]. Taking merely a few months, this process cost less
than $1 million and reproduced the entire HGP on a scale that could be performed by a
single lab. Such sequencers facilitated developments in personal genome and
next-generation sequencing technology [12]. Eventually, genome-scale sequencing became accessible to local
research centers and larger labs, as costs dropped to $5,000-$10,000 genomes,
deliverable in 1 to 2 weeks. Ultimately, sequencing-based approaches began to supplant
DNA microarrays for many research applications.Currently, and in the foreseeable future, the explosion of data emanating from human
genome sequencing will drive bioinformatics. However, the field will undergo a critical
inflection, as more of this data will be generated in translational and clinical
settings. This change was recently reflected at the highest levels of government
planning, when, earlier this year, the NIH created the National Center for Advancing
Translational Sciences with a first-year budget of $575 million and a mandate to develop
this new field of genomic medicine. This landmark initiative is the first national
funding shift in genomics since the inception of the original HGP and parallels the
events of 1989, when the NIH created the National Center for Human Genome Research (now
NHGRI) as the organizing and funding agency that would initiate and drive the HGP. While
thousands of research subject genomes are currently being sequenced per year, it is
anticipated that millions of patient genomes will be sequenced annually, in combination
with medical records. This, in turn, will result in unprecedented power and need to use
bioinformatics to resolve the genetic components of human diseases.In order for this goal to be realized, continued advancements in data creation must be
made. Such advances are needed to deliver the $1,000 same-day human genome, a
performance landmark that has long been considered the threshold for when genome
sequencing can enter the realm of routine medical diagnostics [13,14]. While such advanced technology will most likely propel the creation
of an entirely new industry, the first platform capable of this was introduced this year
[15,16]. The technological advance employed was to perform the
sequencing entirely using a standard semiconductor sensor chip, thereby placing
sequencing on the same technological foundation as computers, digital cameras, and
smartphones. This enables sequencing to directly leverage the trillion-dollar investment
that the semiconductor industry has made in chip manufacturing capability and to
directly benefit from Moore’s Law ― the doubling of chip performance every two years. To
highlight the importance of this historical scaling, the first human genome sequenced
with this new chip-based technology was that of Gordon Moore [15], founder of Intel Corporation and
originator of Moore’s Law [17].
Such sequencing technology will prove vital in driving the next wave of clinical genome
sequencing.Yale has played an important role in this technological progression, as alumnus Jonathan
Rothberg (PhD Yale, ’91, Biology) introduced the first next-generation sequencing
platform, the 454 system [10].
Most recently, through Ion Torrent [15], Rothberg introduced the first platform that could provide the
$1,000 genome.As a university with one of the leading research medical centers and departments of Human
Genetics, Yale is ideally positioned at the forefront of this next wave of clinical
genome sequencing. Yale’s Center for Genomic Medicine, under the direction of Richard
Lifton, Chair of Human Genetics and Professor of Medicine, is pioneering the use of
clinical genome sequencing with the latest Ion Torrent technology.Because bioinformatics is data-driven biology, it is highly interdisciplinary and
requires the application of mathematical knowledge, computer science, and statistical
skills for working with biological data. Constructing a sound education in this field is
challenging, as there are no time-honored curricula to follow.In the first article, Bagga describes the challenges of developing a well-structured
bioinformatics undergraduate degree program. This piece is an invaluable summary of a
decade of experience in the creation and evolution of such a curriculum. Auerbach then
addresses these issues from the perspective of a bioinformatics graduate student, with a
helpful guide to navigating complex educational choices.As the future of bioinformatics will increasingly be driven by the clinical use of genome
sequencing and patient data handling, bioinformaticians who work in clinical
environments and physicians familiar with bioinformatics techniques will be in high
demand. This raises further educational issues, and Rubinstein outlines the ideal
educational path for those who will be working at this critical nexus between clinical
medicine and large-scale data analysis.The ultimate bioinformatics challenge in medicine is to elucidate the genetic basis and
treatment of cancer. The diversity and complexity of tumor genetics and biology,
combined with the severity and ubiquity of cancer, create unmatched potential for
bioinformatics applications. Two articles illustrate the status and challenge of cancer
to bioinformatics.The review article by Fendler et al. summarizes what we have learned from years of DNA
microarray research and early efforts at sequencing cancer genomes. It also highlights
many of the remaining challenges and opportunities.The original research article by Parisi et al. provides an excellent illustration of the
complexity of cancer biology, particularly the complexity arising from heterogeneous
cell populations within a single tumor. The paper then shows how this complexity can be
teased apart by integrative bioinformatics analysis drawing on technologies such as
targeted genome sequencing via DNA microarray assessment of SNP markers and DNA
microarray assessment of genome-wide gene expression.The major target for bioinformatics in the clinical arena will be massive databases of
electronic health care records, combined with genome sequence data. The creation of such
databases is essential to discovering genetic risk factors for disease causation and
treatment response. A pair of articles by Kerr et al. and Ronquillo describes the
factors that will support the creation of such databases, the barriers to their
creation, and the ultimate utility they can provide in personalized medicine.We hope that this special issue of YJBM provides both inspiration and
guidance to those interested in becoming bioinformaticians or merely learning more about
this exciting field. The new era of clinical genome sequencing is just beginning, and
there has never been a better time for bioinformaticians to make a real and lasting
impact on human health.
Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki Journal: Nature Date: 2001-02-15 Impact factor: 49.962
Authors: David A Wheeler; Maithreyan Srinivasan; Michael Egholm; Yufeng Shen; Lei Chen; Amy McGuire; Wen He; Yi-Ju Chen; Vinod Makhijani; G Thomas Roth; Xavier Gomes; Karrie Tartaro; Faheem Niazi; Cynthia L Turcotte; Gerard P Irzyk; James R Lupski; Craig Chinault; Xing-zhi Song; Yue Liu; Ye Yuan; Lynne Nazareth; Xiang Qin; Donna M Muzny; Marcel Margulies; George M Weinstock; Richard A Gibbs; Jonathan M Rothberg Journal: Nature Date: 2008-04-17 Impact factor: 49.962