| Literature DB >> 32637104 |
Mary Mallappallil1, Jacob Sabu1, Angelika Gruessner1, Moro Salifu1.
Abstract
Universally, the volume of data has increased, with the collection rate doubling every 40 months, since the 1980s. "Big data" is a term that was introduced in the 1990s to include data sets too large to be used with common software. Medicine is a major field predicted to increase the use of big data in 2025. Big data in medicine may be used by commercial, academic, government, and public sectors. It includes biologic, biometric, and electronic health data. Examples of biologic data include biobanks; biometric data may have individual wellness data from devices; electronic health data include the medical record; and other data demographics and images. Big data has also contributed to the changes in the research methodology. Changes in the clinical research paradigm has been fueled by large-scale biological data harvesting (biobanks), which is developed, analyzed, and managed by cheaper computing technology (big data), supported by greater flexibility in study design (real-world data) and the relationships between industry, government regulators, and academics. Cultural changes along with easy access to information via the Internet facilitate ease of participation by more people. Current needs demand quick answers which may be supplied by big data, biobanks, and changes in flexibility in study design. Big data can reveal health patterns, and promises to provide solutions that have previously been out of society's grasp; however, the murkiness of international laws, questions of data ownership, public ignorance, and privacy and security concerns are slowing down the progress that could otherwise be achieved by the use of big data. The goal of this descriptive review is to create awareness of the ramifications for big data and to encourage readers that this trend is positive and will likely lead to better clinical solutions, but, caution must be exercised to reduce harm.Entities:
Keywords: Big data; COVID-19; epidemiology/public health; medical research; real-world evidence; research paradigm
Year: 2020 PMID: 32637104 PMCID: PMC7323266 DOI: 10.1177/2050312120934839
Source DB: PubMed Journal: SAGE Open Med ISSN: 2050-3121
Figure.1.Big data in medicine.
Examples of big data and new research designs trials.
| Input data | Population | Possible prediction/conclusion |
|---|---|---|
| PIK3CA mutation used as a molecular pathology marker.[ | Patients with colorectal cancer. | Candidate for aspirin therapy. |
| DNA and RNA collected to determine early biomarkers, in addition to any over-the-counter or prescription drugs, vitamins, or herbs taken by the participant.[ | Family of those with Alzheimer’s disease (AD). | To determine who would have early onset AD. |
| A computational pathology model of breast cancer analyzed with AI found 6642 quantitated morphological features.[ | Patients with breast cancer | Accurately predicted negative outcomes; in addition, found prior unknown negative prognostic determinants, that is, stromal morphologic structure |
| 99,693 documents related to suicides from 163 social media sites. Taken from 2.35 billion posts over 2 years. Other additional variables including quality of life were used.[ | Korean adolescents | Researchers concluded that academic pressure was the biggest contributor to Korean adolescent suicide risk. |
| A nonrandomized real-world data study used propensity score (PS) matching to balance >120 confounders and determined 24,131 PS-matched pairs of linagliptin and glimepiride initiators.[ | Type 2 diabetes patients at risk for cardiovascular disease collected from Medicare and two other commercial insurance data sets. | Researchers concluded that linagliptin has noninferior risk of a composite cardiovascular outcome compared with glimepiride. |
| Lung-MAP is an umbrella design trial protocol for phase II/III.[ | Patients with recurrent or metastatic lung small cell cancer. | To determine optimal therapy for either matched targeted or non-matched therapy. |
AI: artificial intelligence.
Big data technology with examples of systems in use.
| Operational | Analytical | |
|---|---|---|
| Advantages | Allows for real-time capture and storage of data. | Allows data to undergo complex analysis rapidly to provide answers. |
| System format | NoSQL[ | Systems are designed for high throughput (measured in results/unit of time). |
| Data forms | May not be in the usual tabular relationship form. It is faster and less expensive than usual relational data bases, and can use the cloud to perform quicker big-volume computations, making big data implementation practical. | Examples include: |
| Computer network capability | Works across many clusters. | Works across many clusters. |
Non-structured query language.
Massively parallel processing.
Structured query language.
Figure 2.Big data security.
Weaknesses and consequences faced by big data in the changing research landscape.
| Weakness | Consequences | Examples |
|---|---|---|
| Big data is heterogeneous in nature. | Information may not be readily accessible. | Health fair data, local hospital data, non-electronic data, wearable monitoring devices, and specimens.[ |
| Limited insight into content and procedures | Imbalance in power between large complex systems international technology firms and the public. | In Internet-based genetic studies, the participants think the product they are paying for the test kit and services, and could be unaware that the real product is the data from their DNA.[ |
| Data systems may not be compatible or integrated with others. | Information silo: data remains isolated within a data set and is not adequately shared. RCT, regulators, biobanks, and participants may be disconnected. | Repeated consent may be needed for the same goals. In internet based studies informed consent forms may not be ideal.[ |
| Big data is vast and is not yet regulated under privacy laws. | Loss of privacy for participants or providers. | An encryption breach of provider data in an Australian study occurred.[ |
| Loss of privacy for biological relatives | Indirect loss of privacy was noted in the case of a relative of an ancestry seeker who was arrested for a serious crime. His discarded DNA was matched to his relative’s DNA, which has been sold to a third party, and which was accessed by law enforcement legally without a court order.[ | |
| Rushed preemptive release of drugs | The results of the interim phase 3 BELLINI trial, which had a greater risk of death in the treatment arm compared to the placebo arm Venetoclax, a BCL-2 inhibitor with bortezomib and steroids for the treatment of multiple myeloma, was inferior to the placebo in regards to mortality, and the FDA stopped clinical trial enrollment. | Highlighted the need for caution in use of a therapy in specific clinical use; the drug was safely used for other cancers.[ |
| Insufficient vetting process of technology | Theranos example where use of technology for laboratory testing was not verified instead direct consumer advertising attracted investors. | Need for testing the product /technology adequately re-emphasized.[ |
| AI can predict patterns and associations. | An ethical question of whether health insurance companies can charge those at risk from these predictions more for insurance. | Including labeling those at risk as having a preexisting condition. |
| Data ownership ambiguity | HeLa cells used for decades; supreme court rules no one can own a patent for the human genome.[ | Myriad Genetics cannot patent technology involving genes that affected breast cancer, which were held as a trade secret; question who regulates ownership and unclear if government intervention may partially repudiate the Bayh Dole Act of 1982, which allowed non-government agencies, including universities, to own patents on discoveries made with federal funding.[ |
| Finances | Questions about finances and bankruptcy challenged ownership of genetic material. After many successful studies, deCODE Genetics company in Iceland which had the country’s biobank went bankrupt and unclear ownership of data.[ | Understanding that creation and maintenance of a biobank need to include a fundamentally sound economic model, including understanding the market and the value chain for sustaining cost for a “total life cycle cost of ownership model” (TLCO) has been put forth by the National cancer institute for the human biobank.[ |
| Biobanks need publicity | Lack of public awareness | Limited general public information seems to be the norm, despite the presence of as many as 280 biobanks in Europe.[ |
RCTs: randomized control trials; FDA: Food and Drug Administration; AI: artificial intelligence.