Literature DB >> 20813861

Quality, quantity and harmony: the DataSHaPER approach to integrating data across bioclinical studies.

Isabel Fortier¹, Paul R Burton, Paula J Robson, Vincent Ferretti, Julian Little, Francois L'Heureux, Mylène Deschênes, Bartha M Knoppers, Dany Doiron, Joost C Keers, Pamela Linksted, Jennifer R Harris, Geneviève Lachance, Catherine Boileau, Nancy L Pedersen, Carol M Hamilton, Kristian Hveem, Marilyn J Borugian, Richard P Gallagher, John McLaughlin, Louise Parker, John D Potter, John Gallacher, Rudolf Kaaks, Bette Liu, Tim Sprosen, Anne Vilain, Susan A Atkinson, Andrea Rengifo, Robin Morton, Andres Metspalu, H Erich Wichmann, Mark Tremblay, Rex L Chisholm, Andrés Garcia-Montero, Hans Hillege, Jan-Eric Litton, Lyle J Palmer, Markus Perola, Bruce H R Wolffenbuttel, Leena Peltonen, Thomas J Hudson.

Abstract

BACKGROUND: Vast sample sizes are often essential in the quest to disentangle the complex interplay of the genetic, lifestyle, environmental and social factors that determine the aetiology and progression of chronic diseases. The pooling of information between studies is therefore of central importance to contemporary bioscience. However, there are many technical, ethico-legal and scientific challenges to be overcome if an effective, valid, pooled analysis is to be achieved. Perhaps most critically, any data that are to be analysed in this way must be adequately 'harmonized'. This implies that the collection and recording of information and data must be done in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place.
METHODS: This conceptual article describes the origins, purpose and scientific foundations of the DataSHaPER (DataSchema and Harmonization Platform for Epidemiological Research; http://www.datashaper.org), which has been created by a multidisciplinary consortium of experts that was pulled together and coordinated by three international organizations: P³G (Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow Project).
RESULTS: The DataSHaPER provides a flexible, structured approach to the harmonization and pooling of information between studies. Its two primary components, the 'DataSchema' and 'Harmonization Platforms', together support the preparation of effective data-collection protocols and provide a central reference to facilitate harmonization. The DataSHaPER supports both 'prospective' and 'retrospective' harmonization.
CONCLUSION: It is hoped that this article will encourage readers to investigate the project further: the more the research groups and studies are actively involved, the more effective the DataSHaPER programme will ultimately be.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 20813861 PMCID： PMC2972444 DOI： 10.1093/ije/dyq139

Source DB: PubMed Journal: Int J Epidemiol ISSN： 0300-5771 Impact factor: 7.196

Introduction

Scientific developments in the wake of the Human Genome and HapMap, projects are helping to shape the future of public health and clinical medicine. However, while dramatic progress has been made in detecting genetic associations with complex diseases,, the role of genetic determinants is only a part of a much larger picture. The role of lifestyle, environmental and social factors in modulating the risk and/or progression of chronic diseases has been recognized and explored for many years., This is entirely logical even from the perspective of functional genomics: the concept of ‘fitness’ that is central to natural selection and human evolution has, as its fundamental basis, the interaction between prevailing environment and the genome. This implies that causal pathways leading to disease should be ‘expected’ to involve gene–environment interactions., It is therefore clear that bioscience needs access to studies that incorporate social, environmental and lifestyle factors as well as genetic determinants., Provided that the quality of the information that such studies generate is adequate and that the statistical power of key analyses can be rendered sufficient, it will then be possible to successfully pursue a comprehensive investigation of the direct and interactive effects of a broader range of relevant classes of aetiological determinants. However, in the real world, the attainment of adequate statistical power presents a serious challenge. For example, when appropriate account is taken of assessment errors in both determinants and outcomes, sample-size estimates for analyses involving gene–environment interactions comparable in magnitude with the direct genetic effects that have so far been replicated,,, typically indicate a requirement for ‘tens of thousands of cases’.,, This means that even the largest, and best measured, of contemporary studies will only be able to generate enough cases—or subjects—for the commonest of complex diseases., This in turn implies that the analysis of synthesized data across several studies is set to become increasingly important., Such harmonization may be used to support targeted scientific projects, and to facilitate synthesis of information among studies or data portals. Fortunately, extensive experience already exists in the synthesis of epidemiological studies., For example, data synthesis was pivotal to the success of the EPIC study (the European Prospective Investigation into Cancer and Nutrition) which starting in the 1990s, recruited more than 500 000 participants via (initially) 22 centres across nine European countries., EPIC’s focus on nutrition placed heavy demands on sample size, and effective data synthesis across all centres was therefore critical to many of its principal analyses. Although EPIC was designed prospectively as a coordinated consortium of studies, centre-specific questionnaires were used., In such a setting, the data synthesis was constrained by the quality of the underlying data and by their compatibility. One of the important achievements of the EPIC project was the development of methods and tools (e.g. EPIC SOFT) to enable calibration and pooling of data that had been collected under different protocols in different centres, so that data synthesis was rendered valid. However, in common with other major epidemiological consortia—e.g. GenomEUtwin project and EURALIM project41—EPIC demonstrated that information synthesis is far from easy. It demands time, resources and rigour.,,, Furthermore, as scientific ambitions and capacities have extended, the sample-size challenge continues to grow,, and the requirement for effective data synthesis has now become a regular necessity., Moreover, as different sets of outcome and exposure variables are required for different analyses—and no single study can afford to capture ‘all’ desired measures—individual studies are necessarily being pooled with different combinations of other studies—as demonstrated, e.g. by the number of different consortia involving studies such as Avon Longitudinal Study of Parents and Children (ALSPAC), EPIC-Norfolk and the 1958 Birth Cohort.,,, This implies that it would be beneficial to supplement consortium-specific approaches to harmonization, calibration and synthesis,,,, with more generic methods. However, the scientific utility of data synthesis is always constrained by the quantity and quality of the underlying data,, and by their compatibility between studies., The latter implies that the collection and recording of information and data must be carried out in a manner that is sufficiently similar in the different studies to allow valid synthesis to take place. When this is so, ‘harmonization’ may be said to exist. The fundamental challenge might therefore be viewed as being to increase sample size by synthesizing over an adequate number of studies, but to restrict that synthesis to those studies that are satisfactorily harmonized for the specific outcomes, genetic, environmental and lifestyle factors targeted., Two complementary approaches may be adopted to support effective data synthesis. The first one principally targets ‘what’ is to be synthesized, whereas the other one focuses on ‘how’ to collect the required information. Thus: (i) core sets of information may be identified to serve as the foundation for a flexible approach to harmonization,,; or (ii) standard collection devices (questionnaires and standard operating procedures) may be suggested as a required basis for collection of information., It is with all of these considerations in mind that the DataSHaPER project (DataSchema and Harmonization Platform for Epidemiological Research) has been launched. The DataSHaPER (http://www.datashaper.org) offers free access to questionnaires and core sets of variables that can be used to support the development of data-collection tools for emerging studies or to serve as a central reference for harmonization between pre-existing studies. The DataSHaPER is an international project that is being developed under the joint umbrellas of PG (the Public Population Project in Genomics,), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and CPT (Canadian Partnership for Tomorrow), in collaboration with more than 50 major studies from around the world. This conceptual article describes the motivation, aims and scientific foundation of the DataSHaPER project.

Harmonization

Standardization is a sine qua non of information pooling. However, scientific, technological, ethical, cultural and other constraints make it difficult to impose identical infrastructures and uniform procedures across studies. Furthermore, it is important to recognize that it is not always necessary to use precisely the same methods and tools for data collection in order to achieve valid data integration across studies. Rather, what ‘is’ crucial is that the information conveyed by each data set is ‘inferentially equivalent’. If the ‘quality’ of the data to be integrated is also adequate, inferential equivalence greatly increases the potential for collaboration between studies and, therefore, the scientific opportunities. The definition of equivalence will vary with the scientific context and must take into account both the primary information collected (e.g. serum cholesterol level) and the qualifying factors that can influence the interpretation of that information (e.g. whether the participant had been fasting prior to sample collection). In some situations, even a small change in the way information is collected can substantially modify scientific compatibility, whereas in others, considerable flexibility can be allowed. Formally, a valid balance must be struck between the use of precisely uniform specifications that render pooling straightforward (e.g. identical questions asked under identical conditions), and the acceptance of greater flexibility and diversity that may be appropriate and more realistic in a collaborative context (e.g. similar questions, but asked by an interviewer in one study and completed by the participant in another). In an ideal world, information would be ‘prospectively harmonized’: emerging studies would make use, where possible, of harmonized questionnaires and standard operating procedures., This enhances the potential for future pooling but entails significant challenges—ahead of time—in developing and agreeing to common assessment protocols. However, at the same time, it is important to increase the utility of existing studies by ‘retrospectively harmonizing’ data that have already been collected, to optimize the subset of information that may legitimately be pooled.,, Here, the quantity and quality of information that can be pooled is limited by the heterogeneity intrinsic to the pre-existing differences in study design and conduct.

The DataSHaPER

Concept

The DataSHaPER is both a scientific approach and a practical tool. Originally, the plan had been to develop a standardized questionnaire (or set of questionnaires) with the primary aim of facilitating prospective harmonization of future biobanks. But after some months of work, it became clear that complete standardization was too restrictive and would be of limited applicability to retrospective harmonization. This resulted in a fundamental change of approach that led to the piloting of the concept that is now known as the DataSHaPER. In order to understand the DataSHaPER, an important distinction must be drawn between core ‘variables’—the primary units of interest in a statistical analysis—and the specific ‘assessment items’ that are collected by individual studies (e.g. questions in questionnaires). It is a pre-defined set of ‘variables’ that serves as the reference for harmonization between studies. This approach provides an appropriate level of flexibility, because a given variable may potentially be constructed using different assessment items in different studies. It is important to note that this does not imply a reduction in scientific rigour; the specific information collected by a given study can only be viewed as harmonized to a particular DataSHaPER variable if the assessment items in that study can be used to generate a ‘valid’ equivalent to the required variable. This entails a formal scientific evaluation and validation process. Structurally, the DataSHaPER is a dynamically evolving entity that is built upon two primary components: the DataSchema Platform and the Harmonization Platform. The former incorporates and documents sets of core variables. The latter reflects a step-by-step approach that facilitates estimation of the potential for harmonization and pooling between studies for defined scientific purposes. The web-based application was developed by the DataSHaPER team with the support of experts in ontologies and open-source software. It is written in Java and uses open-source libraries (Spring Framework, Hibernate, Google Web Toolkit, Sesame/Elmo, etc.). The user interface makes extensive use of Asynchronous JavaScript and XML (AJAX) technologies, which provide for enhanced usability. Wherever possible, standard formats and ontologies are used. Thus, key information bearing objects (e.g. DataSchemas and component elements of the Harmonization Platform) are stored using a recognized ontological format to facilitate exchange with other applications. Where relevant, the Generic DataSchema makes use of terms defined in the National Cancer Institute Thesaurus ontology (published on the National Center for Biomedical Ontology BioPortal). Access to the DataSHaPER application and content is open and free (http://www.datashaper.org). The public website presents the published DataSchemas and offers links to their ontology files. However, to access the DataSchemas under development as well as the results generated by the pre-existing Harmonization Platforms, users need to respond to specific criteria, be authenticated by the DataSHaPER Team and use a username and password.

DataSchema Platform

A DataSchema is a hierarchical structure composed of variables nested within domains, themes and modules (Figure 1). Each DataSchema on the Platform is made up of variables that may be derived from: interview administration; health and risk-factor questionnaires; physical and cognitive measures; medical files; sample collection, handling, processing and banking; biochemical measures, registries (e.g. databases containing deaths, hospitalization episodes and environmental variables) and others. Variables may be of primary scientific interest in their own right or qualifying factors that contribute to the interpretation of other information of primary interest. A variable may be complete in itself [e.g. current smoker (yes/no) or measured weight] or it may derive from one or several others (e.g. body mass index).

Figure 1

Hierarchical structure of the module, theme and variables related to the ‘household status’ domain in the Generic DataSchema

Hierarchical structure of the module, theme and variables related to the ‘household status’ domain in the Generic DataSchema The DataSchema platform on the DataSHaPER website contains a comprehensive description of available schemas. Each may include: a list of variables with their definitions and formats; links to relevant ontologies; and access to reference questions/questionnaires, indexes and operating procedures that have been selected,, or developed. Where possible, variables have been defined such that they can reliably be constructed from standard questionnaires and classifications (e.g. The International Physical Activity Questionnaire for physical activity). Although a DataSchema aims primarily to provide a template for prospective and retrospective harmonization, it also provides a guide to help emerging projects select suitable assessment items and sample collection tools, even when data pooling is not planned. The ‘Generic DataSchema’ is the first schema to have been developed under the DataSHaPER project. It is aimed at supporting the construction of general-purpose baseline questionnaires for use in large cohorts enrolling middle-aged participants. Its construction was a collaborative effort involving investigators from more than 25 international cohorts in 14 countries. Its structure and contents were determined at a series of international consensus workshops held over 2 years (2006–08) with iterative rounds of comments and feedback between meetings. The contents were chosen so as to provide a core data set with broad international applicability. Ethnic and cultural specificity was therefore minimized and the schema was chosen so as to be simple enough to encourage widespread use, yet comprehensive enough to support meaningful research. Detailed selection criteria for individual variables are listed in Box 1. A fundamental aim was to restrict the Generic DataSchema to a limited number of variables identified as key by consensus. The variable is of substantial relevance to genomic or public health research. The variable may potentially be used for a wide range of research questions in a variety of populations. Each level of response to a categorical variable is of high enough prevalence to ensure that sufficient power can potentially be obtained. The assessment items required to generate the variable can be obtained in a valid way and reliably be assessed, where possible using standard questions and/or indexes. The assessment items required to generate the variable can be collected in a manner that entails no major burden to participants. The assessment items required to generate the variable can be collected at acceptable cost. A variable may be selected if it is of primary interest in its own right, is a qualifying variable that modifies the interpretation of other variables, or is viewed as being a potentially important confounder, or indicator of potential bias. The Generic DataSchema contains 3 modules, 13 themes, 45 domains and more than 180 variables. As an illustrative example of its content (Figure 1), the theme ‘sociodemographic characteristics’ contains the domain ‘household status’ (defined as a social unit comprised of one or more individuals living together in the same dwelling, all of whom need not be related) which in turn includes three variables: (i) ‘marital status (currently married; yes/no)’; (ii) ‘living with a partner in a common household (yes/no)’; and (iii) ‘number of people who live with the participant in the same household (number)’. Early versions of the Generic DataSchema were used by several large population-based studies to help create their data-collection tools. These included the LifeLines (The Netherlands) and LifeGene (Sweden) Projects as well as the five cohorts in the CPT Project and the Canadian Longitudinal Study on Aging.

Harmonization Platform

It is the Harmonization Platform that enables a DataSchema to be used as a basis for harmonization in a specific scientific context. It provides a rigorous approach to a three-step process that entails: (i) the development of rules providing a formal assessment of the potential for each individual study to generate each of the variables in the DataSchema; (ii) the application of these rules to determine and tabulate the ability of each study to generate each variable, thereby identifying the information that ‘can’ be shared; (iii) where a variable can be constructed by a given study, the development and application of a processing algorithm enabling that study to generate the required variable in an appropriate form. The compatibility of variables is formally assessed on a three-level scale of matching quality: ‘complete’, ‘partial’ or ‘impossible’ (e.g. see Table 1). This process is referred to as ‘pairing’. Rules generated for variable pairing are context specific and will vary according to each harmonization project. Rule creation and pairing are both systematic processes based on protocols involving iteration between domain experts, research assistants and a validation panel. The whole procedure is subject to appropriate quality assurance.

Table 1

Example of pairing results for the variables under the domain ‘Individual history of diabetes’

Individual history of diabetes	Study A	Study B	Study C	Study D	Study E
Occurrence of diabetes	Complete	Complete	Complete	Complete	Complete
Type of diabetes	Impossible	Impossible	Complete	Partial	Impossible
Onset of diabetes	Partial	Complete	Complete	Complete	Impossible

Example of pairing results for the variables under the domain ‘Individual history of diabetes’ The first use of the Harmonization Platform was in association with the Generic DataSchema. Pairing rules were therefore developed for all the variables in that schema. As an illustrative example, Table 2 details the rules created for the variable ‘Current quantity of red wine consumed (number of glasses of red wine/week)’. Using such pairing rules, the potential to harmonize 50 large population-based studies (each including at least 10 000 healthy participants) has now been explored for ‘all’ variables in the DataSchema: additional studies joined the collaboration to enable this formal evaluation to take place. In combination, these 50 collaborating studies have recruited or plan to recruit a total of approximately 5.4 million participants.

Table 2

Pairing rules used for the variable ‘Current quantity of red wine consumed’ and specific questions asked by exemplar studies

Pairing rules used for the variable ‘Current quantity of red wine consumed’ and specific questions asked by exemplar studies The detailed results of the full pairing analysis will form the basis of a second paper to follow. For the purposes of the present conceptual article, we will therefore do no more than provide a brief illustration of the nature of the results to be anticipated. For example, using the specific variable considered in Table 2 (‘Average number of glasses of red wine consumed by the participant per week’), 7 (14%) of the 50 studies generated a complete match, 3 (6%) a partial match and 38 (76%) an impossible match. In the particular case being considered, therefore, information from approximately 873 900 participants might potentially be co-analysed for the variable of interest; i.e. from those studies that provide a complete match. In contrast, when the variable ‘Current quantity of wine consumed’ was considered (with no specification of red or white wine), 21 (42%) studies provided a complete match (1.8 million participants). As another example, when the variable ‘measured weight’ was investigated, 36 (72%) studies (3.6 million participants) provided a complete match. According to the pairing rules in this setting, in order that it might be considered a ‘complete match’, the weight of the participant had necessarily to be ‘measured’ at least once by a trained nurse/interviewer with a standard device. Where, weight was ‘reported’ by the participant, it was viewed only as a ‘partial match’. However, in order to answer a real scientific question, the pairing statuses of more than one variable must usually be considered simultaneously. For example, if harmonized information is required on ‘Current quantity of wine consumed’, ‘Body Mass Index’ and ‘Current Tobacco Smoker’, a total of 12 studies provide a complete match for all three variables (approximately 1 million participants). At the same time, additional issues must also be taken into account. These include ethico-legal constraints on access to data or samples, the compatibility of different study designs and protocols, and the distribution of missing data. Consideration of such issues is fundamental to scientific rigour in using the DataSHaPER.

Discussion

The DataSHaPER was originally launched under the PG, and PHOEBE initiatives in response to requests from the members of both consortia for guidelines to support the construction of questionnaires to facilitate prospective harmonization of large population-based studies. But the overall focus evolved, and rapidly subsumed the critical need for tools to support retrospective synthesis of information between existing/legacy studies. As the nascent project progressed, it became clear that one of the primary needs of the scientific community was to have access to comprehensive documentation of the potential to synthesize data across subgroup of studies. It was also recognized that such documentation needed to include descriptions of the procedures used to collect data and to target both generic and specialized data collected by studies using various designs. The network of the DataSHaPER collaborators has therefore extended over time and now includes, e.g., scientists working in disease-oriented networks of studies such as Genecure (chronic kidney diseases). Clearly, the ongoing development of new DataSchemas and Harmonization Platforms will reflect the interests and needs of the scientific teams using and developing them. As illustrative examples, future DataSHaPERs may focus on particular conditions (e.g. stroke, type 2 diabetes), social and lifestyle factors (e.g. nutrition, environmental pollutants), or specific population subgroups (e.g. newborn, elderly). Documenting the potential to synthesize information across studies is critical and should foster collaboration, but it is only a step in the process leading to the final statistical analyses making use of synthesized data sets. In its recent development, the structure and web interface of the DataSHaPER is thus being consolidated in order to facilitate complementarity with other tools and approaches to harmonization, data access, processing, pooling and analysis (e.g. PhenX, dbGaP, DataSHIELD, OBiBa and SAIL). It is the access to such integrated suites of tools that will ultimately facilitate the generation of new scientific discoveries using large-scale synthesized data sets across networks of studies. The question ‘What would constitute the ultimate proof of success or failure of the DataSHaPER approach’ needs to be addressed. Such proof will necessarily accumulate over time, and will involve two fundamental elements: (i) ratification of the basic DataSHaPER approach; and (ii) confirmation of the quality of each individual DataSHaPER as they are developed and/or extended. An important indication of the former would be provided by the widespread use of our tools. However, the ultimate proof of principle will necessarily be based on the generation of replicable scientific findings by researchers using the approach. But, for such evidence to accumulate it will be essential to assure the quality of each individual DataSHaPER (see Box 2). Even if the fundamental approach is sound, its success will depend critically on how individual DataSHaPERs are constructed and used. It seems likely that if consistency and quality are to be assured in the global development of the approach, it will be necessary for new DataSHaPERs to be formally endorsed by a central advisory team. DataSchema Process—the development of a DataSchema should be scientifically driven and based upon iterative review and consensus methodologies., Outcome (list of core variables)—the structure and specific content of each schema has to be formally evaluated. This involves, e.g., the assessment of the content by external groups of experts and systematic comparison with current practice or relevant gold standards and guidelines. Furthermore, for the schemas underpinning data-collection devices (e.g. questionnaires), formal validation of these resultant tools should be undertaken. Harmonization Platform Final set of synthesized ‘data’ Process—methodical quality control should be implemented through all of the harmonization process. This should include systematic validation of the pairing rules that have been developed and analysis of the agreement between the pairing classification achieved by different staff and an independent control panel. Outcome (database including pairing results)—formal assessment of the impact of participant studies and variables characteristics on the pairing results should be undertaken to define factors influencing the potential for synthesis. Factors targeted include: (i) study characteristics such as design of the study (e.g. cohort or case–control study), nature of the population (e.g. minimal age at recruitment, sex, place of residence) and the procedural methods used to collect information (e.g. paper- or computer-based questionnaire); and (ii) variable characteristics such as whether the variable has a quantitative or a categorical format, and whether the information defining a particular variable relates to the participant or to his/her family. As for any database generated by a stand-alone study, once it has been created, the final product of the harmonization process (a synthesized database including data from all participating studies) should be subjected to standard data validation procedures including appropriate range checking and tests of internal validity. The novelty of the DataSHaPER is not in the scientific challenges or solutions being addressed and proposed: similar projects have been embarked upon before. However, the DataSHaPER provides access to useful tools (see Box 3) and has critical advantages. The approach aims to be generic, flexible and can be used both prospectively and retrospectively. Furthermore, the web interface can easily be updated as new DataSchema and Harmonization Platforms are added and thus, provides a good potential for constant improvement of the content. Finally, the DataSHaPER has emerged as a common approach to the concrete need to document the potential to synthesize data across biobanks and cohort studies. However, the scientific utility of any synthesized data set depends on the quality of data to be pooled and on the rigour of the harmonization and synthesis process. The DataSHaPER can make a valuable contribution. However, if it is to be successful, it must continue to evolve and it must be used both widely and wisely. For emerging studies For network of studies to be prospectively or retrospectively harmonized – Lists of core variables useful in the development of information collection tools relevant in specific scientific contexts. – Exemplar questionnaires and standard operation procedures enabling collection of these core variables. – A scientific method, web-based platform and provision of expertise for: (i) the definition of core set of variables to be shared; and (ii) the development and application of the harmonization process.

Funding

Genome Canada and Genome Quebec (The Public Population Project in Genomics); Canadian Partnership Against Cancer (CPT); European FP6 (LSHG-CT-2006- 518418 to Promoting Harmonization of Epidemiological Biobanks in Europe); Medical Research Council Project Grant (G0601625; methods programme in genetic epidemiology at the University of Leicester that focuses on genetic statistics and large-scale data harmonization and pooling); Wellcome Trust Supplementary Grant (086160/Z/08/A); Leverhulme Trust Research Fellowship (RF/9/RFG/2009/0062); National Institute for Health Research (Leicester Biomedical Research Unit in Cardiovascular Science); German Federal Ministry of Education and Research (BMBF) in the context of the German National Genome Research Network (NGFN-2 and NGFN-plus) (to E.W.); German Federal Ministry of Education and Research (BMBF) (Model attempt for networking in German research consortiadevelopment of a common concept for biobanks); European Framework 7 (Biobanking and Biomolecular Resources Research Infrastructure); J.L. is a Canada Research Chair in Human Genome Epidemiology.

53 in total

1. Misclassification in case-control studies of gene-environment interactions: assessment of bias and sample size.

Authors: M Garcia-Closas; N Rothman; J Lubin
Journal: Cancer Epidemiol Biomarkers Prev Date: 1999-12 Impact factor: 4.254

2. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

3. Data modeling and data communication in GenomEUtwin.

Authors: Jan-Eric Litton; Juha Muilu; Ann Björklund; Anne Leinonen; Nancy L Pedersen
Journal: Twin Res Date: 2003-10

4. The International HapMap Project.

Authors:
Journal: Nature Date: 2003-12-18 Impact factor: 49.962

5. International physical activity questionnaire: 12-country reliability and validity.

Authors: Cora L Craig; Alison L Marshall; Michael Sjöström; Adrian E Bauman; Michael L Booth; Barbara E Ainsworth; Michael Pratt; Ulf Ekelund; Agneta Yngve; James F Sallis; Pekka Oja
Journal: Med Sci Sports Exerc Date: 2003-08 Impact factor: 5.411

6. The human genome project is complete. How do we develop a handle for the pump?

Authors: Julian Little; Muin J Khoury; Linda Bradley; Mindy Clyne; Marta Gwinn; Bruce Lin; Mary-Lou Lindegren; Paula Yoon
Journal: Am J Epidemiol Date: 2003-04-15 Impact factor: 4.897

7. A vision for the future of genomics research.

Authors: Francis S Collins; Eric D Green; Alan E Guttmacher; Mark S Guyer
Journal: Nature Date: 2003-04-14 Impact factor: 49.962

8. The detection of gene-environment interaction for continuous traits: should we deal with measurement error by bigger studies or better measurement?

Authors: M Y Wong; N E Day; J A Luan; K P Chan; N J Wareham
Journal: Int J Epidemiol Date: 2003-02 Impact factor: 7.196

9. DataSHIELD: resolving a conflict in contemporary bioscience--performing a pooled analysis of individual-level data without sharing the data.

Authors: Michael Wolfson; Susan E Wallace; Nicholas Masca; Geoff Rowe; Nuala A Sheehan; Vincent Ferretti; Philippe LaFlamme; Martin D Tobin; John Macleod; Julian Little; Isabel Fortier; Bartha M Knoppers; Paul R Burton
Journal: Int J Epidemiol Date: 2010-07-14 Impact factor: 7.196

10. The sequence of the human genome.

Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal: Science Date: 2001-02-16 Impact factor: 47.728

68 in total

1. Stem cell banking: between traceability and identifiability.

Authors: Bartha M Knoppers; Rosario Isasi
Journal: Genome Med Date: 2010-10-05 Impact factor: 11.117

2. Current standards for the storage of human samples in biobanks.

Authors: Tim Peakman; Paul Elliott
Journal: Genome Med Date: 2010-10-05 Impact factor: 11.117

3. Using PhenX measures to identify opportunities for cross-study analysis.

Authors: Huaqin Pan; Kimberly A Tryka; Daniel J Vreeman; Wayne Huggins; Michael J Phillips; Jayashri P Mehta; Jacqueline H Phillips; Clement J McDonald; Heather A Junkins; Erin M Ramos; Carol M Hamilton
Journal: Hum Mutat Date: 2012-04-03 Impact factor: 4.878

Review 4. Toward Rigorous Data Harmonization in Cancer Epidemiology Research: One Approach.

Authors: Betsy Rolland; Suzanna Reid; Deanna Stelling; Greg Warnick; Mark Thornquist; Ziding Feng; John D Potter
Journal: Am J Epidemiol Date: 2015-11-20 Impact factor: 4.897

5. Recommendations and proposed guidelines for assessing the cumulative evidence on joint effects of genes and environments on cancer occurrence in humans.

Authors: Paolo Boffetta; Deborah M Winn; John P Ioannidis; Duncan C Thomas; Julian Little; George Davey Smith; Vincent J Cogliano; Stephen S Hecht; Daniela Seminara; Paolo Vineis; Muin J Khoury
Journal: Int J Epidemiol Date: 2012-05-16 Impact factor: 7.196

6. Asia-Pacific Health 2020 and Genomics without Borders: Co-Production of Knowledge by Science and Society Partnership for Global Personalized Medicine.

Authors: Vural Ozdemir; David H Muljono; Tikki Pang; Lynnette R Ferguson; Aresha Manamperi; Sofia Samper; Toshiyuki Someya; Anne Marie Tassé; Shih-Jen Tsai; Hong-Hao Zhou; Edmund J D Lee
Journal: Curr Pharmacogenomics Person Med Date: 2011-03-01

7. Intellectual property rights in publicly funded biobanks: much ado about nothing?

Authors: Saminda Pathmasiri; Mylène Deschênes; Yann Joly; Tara Mrejen; Francis Hemmings; Bartha Maria Knoppers
Journal: Nat Biotechnol Date: 2011-04 Impact factor: 54.908

Review 8. Biobanking residual tissues.

Authors: Peter H J Riegman; Evert-Ben van Veen
Journal: Hum Genet Date: 2011-08-04 Impact factor: 4.132

9. Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies.

Authors: Isabel Fortier; Dany Doiron; Julian Little; Vincent Ferretti; François L'Heureux; Ronald P Stolk; Bartha M Knoppers; Thomas J Hudson; Paul R Burton
Journal: Int J Epidemiol Date: 2011-07-30 Impact factor: 7.196

10. Statistical approaches to harmonize data on cognitive measures in systematic reviews are rarely reported.

Authors: Lauren E Griffith; Edwin van den Heuvel; Isabel Fortier; Nazmul Sohel; Scott M Hofer; Hélène Payette; Christina Wolfson; Sylvie Belleville; Meghan Kenny; Dany Doiron; Parminder Raina
Journal: J Clin Epidemiol Date: 2014-12-08 Impact factor: 6.437