Michael A Beer1. 1. Department of Biomedical Engineering, and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, 733 N. Broadway, Baltimore, MD21205.
Quantitative and computational biomedical research continues to evolve rapidly, and there remains a significant need for rigorous modeling in the tradition of the physical sciences. The fields of computational genomics and interdisciplinary biology are addressing fundamental scientific questions in cellular and developmental biology, evolution, and genetics. The robust and continuing development of these fields requires additional talented researchers trained in physics, mathematics, statistics, and computer science. Previous articles have addressed many factors contributing to successful biomedical research in general [1]. However, this essay focuses specifically on encouraging career transitions into biomedical research and will attempt to point out examples demonstrating areas of greatest need and likelihood of success. Although there is a long tradition of physicists and chemists making seminal contributions to the life sciences (Schrödinger, Delbrück, Crick, etc.), earlier reviews on alternative career paths for physicists tend not to emphasize the very real possibility of transitioning to biomedical research [2]. In this essay, I will describe specific elements of training and personal interest that anticipate a successful transition from physics or engineering into computational biology and genomics, and other areas of quantitative biology. The aim is to reduce some of the uncertainty associated with entry into biomedical fields and to encourage talented researchers to consider such a move.These comments are largely informed by my own personal experience. I transitioned from a Ph.D. in theoretical plasma physics to computational biology and genomics just before the completion of the human genome sequence [3,4]. While deciding on a research focus for graduate school, I was interested in both biophysics and plasma physics and fusion science, and I was drawn to the fact that both fields had a significant potential positive societal impact. In the end, I chose computational plasma physics, as, at the time, biophysics was much less quantitative. Theoretical physics was the field that interested me most and choosing it served me well. While day-to-day research often involves sitting in front of a computer for long periods of time, solving difficult problems, working on the type of problem that you intrinsically find most interesting can turn this potentially tedious activity into an exciting intellectual endeavor. Although the specific problems you study may change over time, it is most important to develop and hone a core set of analytical and problem-solving tools that you enjoy applying. Although I eventually left plasma physics, I do not regret my choice of training in plasma physics at all since my research interests continue to be very mathematical, and I learned how to work as a computational scientist from some of the best in the field. If I had entered biophysics earlier, my research would likely have settled on fewer computational problems. My doctoral and postdoctoral training in computational plasma physics in the area of controlled fusion energy research was excellent preparation in almost all respects for my eventual transition to biomedical research. Based on this personal experience and observations of my colleagues, I feel that the types of quantitative training that can most readily be successfully applied to research in quantitative biology share common characteristics that can be distilled into four major themes.
STRONG INTERPLAY BETWEEN MODELING AND EXPERIMENTS
Although biology has historically been descriptive, experiments are now largely driven by technological advances (e.g., DNA sequencing-based assays[5] and CRISPR[6]), and the field is entering an era where predictive and theoretical modeling will play an ever-increasing role. Training in fields of physics and engineering where experimental results and theoretical models are both evolving rapidly and informing each other will be needed. As an example, in the mid-and late 1990s, several major fusion experiments were performing the first experiments with a deuterium-tritium fuel mixture and generating record levels of fusion power [7,8]. These experiments were also measuring transport and turbulent fluctuations with unprecedented accuracy, and theories and sophisticated nonlinear computer simulations of these fluctuations were being quantitatively compared. Initially, the theories contradicted experiments in important ways, but as they increased in complexity and evolved, they eventually began to agree with many important aspects of the observations for the first time [9-11]. The comparisons and agreement between simulations and experiments [9,10,12] dramatically increased confidence in projections of future fusion experiment performance. Although these new models ultimately did not yield optimistic projections for fusion’s economic competitiveness, there was a feeling that this hypothesis-driven modeling could now more accurately constrain future fusion reactor designs, and I learned first-hand the excitement and real-world impact that computational and theoretical modeling could have. The interplay between stellar evolution modelling and experimental astrophysics, cosmology, and even particle physics (e.g., neutrino oscillations[13]) is another excellent example. Coming from complex research areas where measurements and nonlinear computational models and simulations are being rigorously tested is an invaluable experience for future success in biomedical research. To be successful working in an environment where technology and theory are evolving rapidly, one must develop the ability to quickly and reliably identify which questions are worth asking, which datasets are worth modeling, and which datasets should be most heavily weighted when they do not all agree. This most useful skill is immediately transferrable to genomics and biomedical research.In genomics, technologies are always evolving rapidly, and understanding how to internally check the consistency of a dataset is extremely useful. When I first started working in genomics, my first project involved analyzing microarray gene expression data. After performing PCA on the data matrix, I realized that the two “t=0” time points in two different time courses were actually just identical copies of one experiment. In addition to this type of statistical analysis, it is also useful to have a set of genes that can be used as a positive control to assess the quality of the data. On similar microarray datasets, I could test the overall quality of each dataset by comparing the correlation of a set of genes I knew should be tightly co-regulated across the conditions, like the ribosomal proteins or other relevant protein complexes. Working in any data-rich field leads to the development of similar sets of analysis tricks and tools, which will be valuable in biomedical research.Another benefit of working in an active field is exposure to a wider range of the psychological dimensions of research. The proponents of each model and experimental system have a vested interest in their success, and these factors often influence how results are reported. Knowing how to detect these factors helps one efficiently navigate the exponentially growing literature and build a more accurate reflection of reality. The behavior of the underlying complex biological system clearly is not affected by such bias, however slight it may be. Carl Sagan called this “having a good baloney detector,” and it is also a completely transferrable skill across disciplines.
THE ART OF BUILDING AND ANALYZING APPROXIMATE PHYSICAL AND BIOPHYSICAL MODELS
Genomics is becoming increasingly quantitative and is looking more like physics every day. For example, we are beginning to model cell-type, or cell-state transitions in development and disease (such as cancer) with dynamical models [14-16], which admit bifurcations and oscillations quite similar to those in other highly nonlinear physical systems, for example, the suppression of turbulence by the generation of sheared flows or pressure gradients [17-19]. Predictive quantitative models of how these cellular systems transition between normal and disease states will necessarily be built upon higher-level abstractions of the biochemical interactions between protein complexes and gene regulatory states. While the underlying elements are often combinations of DNA or protein sequence motifs [20,21], their interactions are best described by biophysical and/or polymer models [22,23]. A thorough understanding of the statistical mechanics behind such biophysical modeling gives investigators a significant advantage.
UNDERSTANDING THE MATHEMATICS BEHIND COMPUTATION, MACHINE LEARNING, AND STATISTICAL MODELING
Statistical and machine learning will continue to play an important role in the development of such quantitative cellular and organismal models. Historically, artificial intelligence has suffered from exaggeration of its impact, and it seems increasingly likely that the biomedical field is currently experiencing an inflated expectation bubble of this type regarding “deep learning.” Deep learning uses multiple network layers to detect and build hierarchical models of complex features built out of simpler ones. This is an appealing approach, but other algorithms typically have quite similar overall accuracy, and any differences should be understood to develop improved models. Rigorous and systematic experimental and computational evaluation of these methods is essential to be able to use machine learning to make robust and accurate predictions. Systematic testing of computational models is best done in blind assessments of multiple models submitted prior to scoring on datasets explicitly reserved for model evaluation, as has recently been performed for several regulatory genomics predictions [24-26]. Inferring the direction of causal relationships is generally more challenging than inferring statistical dependencies between categorical data [27], and dynamical models whose structure is designed to be consistent with real physical constraints will likely be needed to infer causality in complex biological systems and to make more accurate predictions. Training in the physical sciences and engineering will likely continue to be among the best preparations to contribute to these important questions.Applying machine learning and statistical model inference correctly requires a thorough understanding of the mathematics behind the algorithms. For statistical learning, graduate-level linear algebra [28] and multivariable calculus are among the most important courses. Even some well-cited genomics papers [29] were eventually found to be of little or no predictive value because of a lack of understanding of potential errors in statistical learning and model evaluation [30,31]. Especially important is an understanding of overfitting, which is a potential problem in the exponentially large parameter space of deep learning algorithms. While many techniques to manage overfitting have been developed in the machine learning community (e.g., weight decay, weight sharing, dropout, and data augmentation), the application of these methods to genomic training datasets of biologically limited size and correlation structure has not yet been fully explored.
WORKING WITH LARGE CONSORTIA AND PUBLIC DATASETS
Much progress has been made by focusing a broad set of researchers with complementary experimental technologies on a standardized set of biological substrates in consortium efforts[5,32,33]. Learning how to access, model, combine, and leverage these large datasets is itself an important skill. Working in large groups also requires a distinct set of communication, presentation, and organizational abilities, and experience gained in one consortium largely transfers to another. Genomics has moved strongly toward open access to these large datasets and compendia, so researchers with experience in similar fields will be needed and welcome.
CONCLUSION
In summary, the data-rich fields of genomics and computational interdisciplinary biology remain in great need of talented researchers from the quantitative fields of physics, mathematics, statistics, and computer science. One daunting aspect of biological research is the relatively large amount of jargon and detail of biological systems, but since there are ~20,000 genes in the human genome [3,4], there is a bound to this complexity at a level comparable to learning a language. If willing to take on this challenge, one will be pleased to find that most other common characteristics of successful training in quantitative fields transfer surprisingly directly to biomedical research.
Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki Journal: Nature Date: 2001-02-15 Impact factor: 49.962
Authors: Anat Kreimer; Haoyang Zeng; Matthew D Edwards; Yuchun Guo; Kevin Tian; Sunyoung Shin; Rene Welch; Michael Wainberg; Rahul Mohan; Nicholas A Sinnott-Armstrong; Yue Li; Gökcen Eraslan; Talal Bin Amin; Ryan Tewhey; Pardis C Sabeti; Jonathan Goke; Nikola S Mueller; Manolis Kellis; Anshul Kundaje; Michael A Beer; Sunduz Keles; David K Gifford; Nir Yosef Journal: Hum Mutat Date: 2017-03-09 Impact factor: 4.878
Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu Journal: Science Date: 2001-02-16 Impact factor: 47.728