| Literature DB >> 26306643 |
Ola Spjuth1,2, Maria Krestyaninova3,4, Janna Hastings3, Huei-Yi Shen5, Jani Heikkinen5, Melanie Waldenberger6,7, Arnulf Langhammer8, Claes Ladenvall9,10, Tõnu Esko11, Mats-Åke Persson9,10, Jon Heggland8, Joern Dietrich3, Sandra Ose3, Christian Gieger6,7, Janina S Ried12, Annette Peters6,7, Isabel Fortier13, Eco J C de Geus14, Janis Klovins15, Linda Zaharenko15, Gonneke Willemsen14, Jouke-Jan Hottenga14, Jan-Eric Litton1,16, Juha Karvanen17,18, Dorret I Boomsma14, Leif Groop5,9, Johan Rung3,19, Juni Palmgren1,5, Nancy L Pedersen1, Mark I McCarthy20,21,22, Cornelia M van Duijn23, Kristian Hveem8, Andres Metspalu11, Samuli Ripatti5,24,25, Inga Prokopenko20,21,26, Jennifer R Harris27.
Abstract
A wealth of biospecimen samples are stored in modern globally distributed biobanks. Biomedical researchers worldwide need to be able to combine the available resources to improve the power of large-scale studies. A prerequisite for this effort is to be able to search and access phenotypic, clinical and other information about samples that are currently stored at biobanks in an integrated manner. However, privacy issues together with heterogeneous information systems and the lack of agreed-upon vocabularies have made specimen searching across multiple biobanks extremely challenging. We describe three case studies where we have linked samples and sample descriptions in order to facilitate global searching of available samples for research. The use cases include the ENGAGE (European Network for Genetic and Genomic Epidemiology) consortium comprising at least 39 cohorts, the SUMMIT (surrogate markers for micro- and macro-vascular hard endpoints for innovative diabetes tools) consortium and a pilot for data integration between a Swedish clinical health registry and a biobank. We used the Sample avAILability (SAIL) method for data linking: first, created harmonised variables and then annotated and made searchable information on the number of specimens available in individual biobanks for various phenotypic categories. By operating on this categorised availability data we sidestep many obstacles related to privacy that arise when handling real values and show that harmonised and annotated records about data availability across disparate biomedical archives provide a key methodological advance in pre-analysis exchange of information between biobanks, that is, during the project planning phase.Entities:
Mesh:
Year: 2015 PMID: 26306643 PMCID: PMC4929882 DOI: 10.1038/ejhg.2015.165
Source DB: PubMed Journal: Eur J Hum Genet ISSN: 1018-4813 Impact factor: 4.246
Terminology used in this manuscript
| Specimen | An individual portion of human, animal, plant, mineral and so on, materials used for scientific research project |
| Biospecimen | An individual portion of a substance of biological origin, for example, tissue sample, blood sample, saliva sample and so on, derived from a single participant at a specific time and intended to be used for scientific research project, which in the context of this study is stored in a biobank |
| Sample | A synonym for ‘biospecimen', also called ‘biosample', meaning, for example, a blood sample, tissue sample, urine sample and so on A number of biospecimens selected for a particular scientific research project intended to be representative of a given population. For example, an experimental sample might contain 200 cancerous tissue biospecimen samples from various individuals across Europe or 1000 biospecimens of blood taken from various individuals within the United Kingdom |
| Biomedical data archive or data bank | A storage and retrieval facility or service for biological and medical data. All data archives have three primary functions: the collection, storage and preservation of data |
| Phenotypic variables | A characteristic that varies across a population of interest, for example, height, weight, eye colour, blood pressure and the presence or the absence of various clinical conditions such as diabetes |
| VOI | A phenotypic or genotypic variable that is relevant for a particular research project. A selection of such variables is referred to as the VOIs for the research project |
| HV | A single unified vocabulary that has been compiled from several individual vocabulary sources. Where there is partial overlap in the meaning of terms from separate vocabularies but with different exact labels used, synonyms from each of the underlying vocabularies are preserved in the resulting HV |
| Metadata | Information about, or description of, data. The metadata describing a biospecimen sample collection might include, for example, the number of specimens stored in the collection and summary statistics about the population from which the specimens were collected |
| CV | A list of words and phrases intended for use to mark up or index data, selected such that each unit in the vocabulary is unique and unambiguous within the overall vocabulary and thereby the use of controlled vocabularies ensure consistency in annotation |
| GWAS | Examines genetic variants, such as SNPs, across the genome in various individuals to see whether any variant is associated with a phenotype, for example, a disease such as diabetes |
Abbreviatons: CV, controlled vocabulary; GWAS, genome-wide association study; HV, harmonised vocabulary; SNP, single-nucleotide polymorphisms; VOI, variables of interest.
These definitions have been synthesised and modified from various sources and discussed among the authors, in order to achieve consistency across the manuscript. Many of these terms are used in different ways in different contexts.
Figure 1Data harmonisation proceeds on two levels: first, indexing of biospecimens in harmonised terms and, second, harmonisation of variables and descriptors. The left side of the image shows the process of collecting sample information or sample availability information from several resources, that is,. from biobanks, into a database. The right side of the image shows the format for such data submission, defined by harmonising variables. So-called ‘original' vocabularies are descriptors and terms that are used for annotating samples at the biobanks and collections (for the format, see the Methods section). ‘Harmonised' vocabularies are used as common representation of several varieties of original sample descriptors and these are used as submission format and as a configuration of an online resource discovery tool, Sample Availability Information System.
Figure 2Workflow and responsibilities for the iterative harmonisation process in the SAIL method, involving multiple curation teams and facilitated by a web-based application. (1) Providing a description of the available data from individual cohorts and the requirements from the researcher. (2) Based on this data, the system administrator creates pre-definitions of a possible parameter setting. (3) Testing and verifying the parameter setting by data provider. (4) Offering a pilot instance to the consortium user for checking and verifying the system or for getting any feedback for further alterations to the configuration. (5) Feedback is provided by the researcher and the entire process is iterated.
Summary of three applications where the SAIL method was applied
| Number of linked individuals/samples | 184 000 | 1000 | 30 494 |
| Number of linked collections | 15 | 2 (1 Biobank+1 health registry) | 15 |
| Number of harmonised variables | 92 | 13 | 43 |
| Availability software used | SAIL | SAIL | SAIL |
| Key purpose | Sharing and analysing the data from 39 cohorts among 18 consortium partners | Identify subsets in health registry for which there are biobank data available | Assistance in design of GWAS meta-studies for complications in diabetic patients |
| Vocabulary | MetS ( | bbqr ( | Summit ( |
| Web address | Public: sail.simbioms.org | Public (simulated data): sail.simbioms.org/bbqr Private: restricted access user: sailuser, pwd: karolinska | Public (simulated data): sail.simbioms.org/summit Private: restricted access |
Abbreviations: ENGAGE, European Network for Genetic and Genomic Epidemiology; GWAS, genome-wide association study; MetS, metabolic syndrome; SAIL, sample availability; SUMMIT, surrogate markers for micro- and macro-vascular hard endpoints for innovative diabetes tools.
For the Karolinska Institutet and SUMMIT projects there is a public instance of the SAIL software with simulated data, whereas the private instance with real data has restricted access.
Data contributors and institutions participated in mapping activities and data submission for the ENGAGE application harmonised with the SAIL method
| MolOBB | Oxford Centre for Diabetes, Endocrinology and Metabolism, Churchill Hospital, Old Road, Oxford OX3 7LJ, UK |
| NFBC66, Genmets case, Genmets control | FIMM, THL and University of Helsinki, Biomedicum Helsinki 2U, 00014 Helsinki, Finland |
| UK-twin | King's College London, UK |
| ERF | Department of Epidemiology and Biostatistics, Erasmus University Medical School, 3000 DR Rotterdam, The Netherlands |
| DGI | Lund University Diabetes Centre, Malmö, Sweden |
| EGCUT | The Estonian Genome Center of University of Tartu |
| KORAF3, KORAF4 | Helmholtz Zentrum München German Research Center for Environmental Health (GmbH) |
| STR | Karolinska Institutet (Karolinska) |
| HUNT1, HUNT2, HUNT3 | HUNT Research Centre, Norwegian University of Science and Technology (NTNU), Trondheim, Norway |
| Latvian Genome Data Base (LGDB) | Genome Centre, Latvian Biomedical Reserch and Study centre, Ratsupites 1, Riga LV-1067, Latvia |
Abbreviations: ENGAGE, European Network for Genetic and Genomic Epidemiology; SAIL, sample availability.