| Literature DB >> 25825668 |
Abstract
We review traits of reusable clinical data and offer a typology of clinical repositories with a range of known examples. Sources of clinical data suitable for research can be classified into types reflecting the data's institutional origin, original purpose, level of integration and governance. Primary data nearly always come from research studies and electronic medical records. Registries collect data on focused populations primarily to track outcomes, often using observational research methods. Warehouses are institutional information utilities repackaging clinical care data. Collections organize data from more organizations than a data warehouse, and more original data sources than a registry. Therefore even if they are heavily curated, their level of internal integration, and thus ease of use, can be less than other types. Federations are like collections except that physical control over data is distributed among donor organizations. Federations sometimes federate, giving a second level of organization. While the size, in number of patients, varies widely within each type of data source, populations over 10 K are relatively numerous, and much larger populations can be seen in warehouses and federations. One imagined ideal structure for research progress has been called an "Information Commons". It would have longitudinal, multi-leveled (environmental through molecular) data on a large population of identified, consenting individuals. These are qualities whose achievement would require long term commitment on the part of many data donors, including a willingness to make their data public.Entities:
Keywords: Big data; Data warehouse; Federated database; Information commons; Observational research; Registry
Year: 2014 PMID: 25825668 PMCID: PMC4340801 DOI: 10.1186/2047-2501-2-4
Source DB: PubMed Journal: Health Inf Sci Syst ISSN: 2047-2501
Data repository traits that are relevant to data reuse
| Repository trait | Definition |
|---|---|
| Sample size | Number of patients represented. |
| Data generations from source | Number of times data or access methods were modified, where generation 1 is original source data. |
| Level of integration | Extent of structuring that organizes data for query. |
| Longitudinal observations | Containing observations over multiple times per patient. |
| Personally identified | Capable of delivering direct patient identifiers to research projects. |
| Research accessibility | Extent to which data are accessible to researchers, whether within or outside of a home institution. |
| Data quality | Accuracy, completeness and consistency of data expression. |
| Linked biosamples | Having available biosamples linked to phenotypic information. |
| Biomolecular data | Having biomolecular/omics data linked to phenotypic information. |
Figure 1Biomedical repository types and sizes. Each type has exemplars with size or range of sizes shown as the log10 of the number of distinct patients represented. When a cell has a number, it is the coefficient of the log: e.g., a 2.7 in the 2 column means 2.7 × 102. A filled cell with no number is either part of a known range, or part of an order of magnitude estimated range. Generations from source refers to generations of modification of data or access methods, where the original source data is generation 1. Types and exemplars are discussed in the text. Specific exemplars only appear if data for estimating their size are available.
Types of clinical data repositories
| Repository type | Definition |
|---|---|
| Study | A database that collects observations for a specific clinical research study. |
| EHR | A database of observations made as a result of direct health care. |
| Registry | Observations collected and organized for the purpose of studying or guiding particular outcomes on a defined population. Associated studies are either multiple or longterm and evolving over time. |
| Warehouse | A repository that adds levels of integration and quality to the primary (research or clinical) data of a single institution, to support flexible queries for multiple uses. Is broader in application than a registry. |
| Collection | A library of heterogeneous data sets from more organizations than a warehouse or more sources than a registry. Organized to help users find a particular data set, but not to query for data combined across data sets. |
| Federation | A repository distributed across multiple locations, where each location retains control over access to its own data, and is responsible for making the data comparable with the data of other locations. |
Repositories vary based on the purpose, origin, control and integration of their data.