| Literature DB >> 35899108 |
Meisha Mandal1, Josh Levy2, Cataia Ives1, Stephen Hwang1, Yi-Hui Zhou3,4, Alison Motsinger-Reif5, Huaqin Pan1, Wayne Huggins1, Carol Hamilton1, Fred Wright3,4, Stephen Edwards1.
Abstract
The need to test chemicals in a timely and cost-effective manner has driven the development of new alternative methods (NAMs) that utilize in silico and in vitro approaches for toxicity prediction. There is a wealth of existing data from human studies that can aid in understanding the ability of NAMs to support chemical safety assessment. This study aims to streamline the integration of data from existing human cohorts by programmatically identifying related variables within each study. Study variables from the Atherosclerosis Risk in Communities (ARIC) study were clustered based on their correlation within the study. The quality of the clusters was evaluated via a combination of manual review and natural language processing (NLP). We identified 391 clusters including 3,285 variables. Manual review of the clusters containing more than one variable determined that human reviewers considered 95% of the clusters related to some degree. To evaluate potential bias in the human reviewers, clusters were also scored via NLP, which showed a high concordance with the human classification. Clusters were further consolidated into cluster groups using the Louvain community finding algorithm. Manual review of the cluster groups confirmed that clusters within a group were more related than clusters from different groups. Our data-driven approach can facilitate data harmonization and curation efforts by providing human annotators with groups of related variables reflecting the themes present in the data. Reviewing groups of related variables should increase efficiency of the human review, and the number of variables reviewed can be reduced by focusing curator attention on variable groups whose theme is relevant for the topic being studied.Entities:
Keywords: ARIC; cardiovascular disease; cluster analysis; meta-analysis as topic; systems biology
Year: 2022 PMID: 35899108 PMCID: PMC9310100 DOI: 10.3389/fphar.2022.883433
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.988
FIGURE 1(A) Depicts the correlation, filtering, and clustering process applied to the 14,425 variables in the ARIC study. The variable correlations were calculated and then multiple filtering steps were performed including filtering by cutoff and N values, correcting for multiple testing, and excluding specific categories of variables. The variables were organized into clusters of interconnected nodes based on the filtered correlation values, resulting in 391 variable clusters. The average distance between variable clusters was calculated and clusters were grouped using community finding algorithms. The cluster groups were manually sorted into categories based on the main goals of the ARIC study. (B) Visual representations of the different levels of organization used in this study. (C) A chart showing definitions and examples for the different levels of organization used in this study.
Examples of human and programmatic evaluation of variable clusters. The table includes the relatedness category from manual review (Category), working definition of the category used by reviewers (Definition), examples of types of relationships in the category (General Examples), examples of ARIC variables that fit each relationship type (Study Variables), a cluster identifier (Cluster Number), and calculated relatedness score from the NLP analysis (Score). The scoring process is described in further detail in the methods section. Examples were selected to demonstrate different types of variable relationships that exist among ARIC variables and the associated relatedness category. See Supplemental Table S1 for all clusters.
| Category | Definition | General examples | Study variables | Cluster Number | Score |
|---|---|---|---|---|---|
| Unrelated | Clusters where a human reviewer would not expect correlation between the variables in the cluster. | Clusters related to a topic, such as MRI exclusion criteria, but are disparate and would not be expected to correlate | “Do you have a cardiac pacemaker or a heart valve prosthesis?” and “Do you have metal fragments in your eyes, brain, or spinal cord?” | 269 | 8.5 |
| “Enter code and specify brand and form below” and “What kind of fat do you usually use for baking?” | 213 | 7.9 | |||
| Related | Clusters where the variables would be expected to be correlated but not as highly would be “related”. | Clusters where the variables all relate to the same broad topic, such as history of cardiovascular disease | “Medications which secondarily affect cholesterol,” “Average mean arterial blood pressure,” and “Carotid Distensibility” | 1 | 10.5 |
| Clusters relating dietary intake of a nutrient and blood level of that nutrient | “In the past year, how often on average did you consume... Dark meat fish, such as salmon, mackerel, swordfish, sardines, bluefish” and “Omega fatty acid W20:5 and W22:6 [g]” | 38 | 11.6 | ||
| Highly Related | Clusters where a human reviewer would expect a high degree of correlation between the variables. | Clusters where one variable depends on the other | “Ever had emphysema” and “Age emphysema started” | 16 | 35.1 |
| Clusters where the variables all relate to the same narrow topic such as consumption of alcoholic beverages, or a history of wheezing | “How many drinks of hard liquor do you usually have per week?,” “How many days in a week do you usually drink beer?” and “Alcohol intake [g] per day” | 46 | 17.7 | ||
| “[Wheezing]. Ever have to stop for breath when walking at our own pace on the level?” and “[Wheezing]. Ever stop for breath after walking about 100 yards (or after a few minutes) on the level?” | 248 | 40.5 | |||
| Exact | Clusters where a human reviewer would expect almost complete correlation between the variables. | Clusters with variables that are repeat measurements during the same exam | First, second and third sitting blood pressure measurement at exam 2 | 58 | 44.2 |
| Clusters with variables that ask the same question, potentially in different ways | “I have a fiery temper,” “I am hotheaded,”, and “I am quick tempered” | 86 | 32.2 | ||
| “Have you ever been diagnosed by a doctor as having a polyp or noncancerous tumor of the colon or rectum?” and “Has a doctor ever told you that you had adenoma or polyp of the colon (large intestine)?” | 175 | 32.2 | |||
| Clusters with variables that are the same measurement at different time points | White blood cell count at exams 3 and white blood cell count at exam 4 | 226 | 47.2 |
FIGURE 4Plots of single cluster groups demonstrating cluster cohesiveness around a central theme. Each node is a variable cluster that is a member of the cluster group, and the group of interconnected nodes is one cluster group. (A) An example of a cluster group with clusters relating to maternal health history using a threshold of 0.12 for pruning prior to community finding and 0.07 for viewing. (B) An example of a cluster group with clusters relating to cigarette smoking and lung health using a threshold of 0.18 for pruning prior to community finding and 0.12 for viewing.
FIGURE 5Plots of multiple cluster groups demonstrating interconnectivity between cluster groups. Each node is a variable cluster with cluster groups being identified by node color. Black lines are intra-cluster edges and red lines are inter-cluster edges. The threshold for intra-cluster edges is 0.12 and for inter-cluster edges is 0.05. (A) Three interconnected cluster groups related to health history. The green (paternal health history-PHH) and blue (maternal health history-MHH) clusters are linked through the red clusters (family health history-FHH). (B) Three interconnected cluster groups. The green (physical activity) and blue (history of lung diseases), cluster groups are linked through the red cluster group (history of wheezing and breathlessness) but not directly connected to each other.
FIGURE 2Clusters selected to demonstrate successful clustering by the variable correlation analysis. (A) A cluster of variables related to maternal history of heart disease. (B) A cluster of variables related to coughing symptoms, frequency, and duration.
FIGURE 3(A) A depiction of the NLP-based cluster scoring process. (B) Pie chart of the manual scoring of the 391 variable clusters (C) Plot of cluster scores for clusters in the different relatedness categories.
Examples of clusters which both reflect (67, 14, 341, 213) and do not reflect (70, 49, 403) their programmatically generated scores. The table includes a cluster identifier (Cluster Number), calculated relatedness score from the NLP analysis (Score), relatedness category from manual review (Category), description of the overarching theme of the cluster (Description), and the ARIC variables in the cluster (Variables). Clusters were selected to highlight cases of agreement and disagreement between programmatic scoring and reviewer category assignment. See Supplemental Table S1 for all clusters.
| Cluster Number | Score | Category | Description | Variables |
|---|---|---|---|---|
| 67 | 42.0 | Exact | lung health history (lung disease) | Has a doctor ever said that you had any of the following: chronic lung disease, such as chronic bronchitis, or emphysema? Q10g [Home Interview, exam 1] |
| [Medical care]. Has a doctor ever said you had any of the following: (read each disease name and code N if No or Never Tested). Q5. Chronic lung disease, such as chronic bronchitis, or emphysema. Q5E [Health/Medical History, exam 2] | ||||
| 14 | 24.7 | Highly Related | lung health history (asthma) | [Asthma]. Ever had asthma? Q35 [Respiratory Symptoms and Physical Activity Form, exam 1] |
| [Asthma]. Age asthma started Q37 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| [Asthma]. Age asthma stopped. Q39 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| [Wheezing]. Age at first attack. Q18 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| Has a doctor ever said that you had any of the following: asthma? Q10h [Home Interview, exam 1] | ||||
| [Medical care]. Has a doctor ever said you had any of the following: (read each disease name and code N if No or Never Tested). Q5. Asthma. Q5F [Health/Medical History, exam 2] | ||||
| [Medical care]. Has a doctor ever said you had any of the following? Asthma. Q6e [Personal History form, exam 4] | ||||
| [Asthma]. Still have asthma? Q38 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| [Wheezing]. Short Of Breath Wheezing Attack? Q17 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| 341 | 16.4 | Related | lung health history (cough/wheezing) | [Wheezing]. Number years this wheezy or whistling sound been present. Q16 [Respiratory Symptoms and Physical Activity Form, exam 1] |
| [Cough]. Number years had trouble with phlegm. Q12 [Respiratory Symptoms and Physical Activity Form, exam 1] | ||||
| 213 | 7.8 | Unrelated | diet | [Other dietary items]. Enter code and specify brand and form below. Q78 [Dietary Intake Form (DTIC), exam 3] |
| [Other dietary items]. What kind of fat do you usually use for baking? Q77 [Dietary Intake Form (DTIC), exam 3] | ||||
| 70 | 6.0 | Exact | medication (cholesterol lowering) | Cholesterol lowering medication W/in 2 weeks.: using 2004 Med. code, visit 2 [Cohort, Exam 2] |
| Used statin (at visit 2) last 2 weeks (0 = no, 1 = yes) based on 2004 Med. code [Cohort, Exam 2] | ||||
| 49 | 5.7 | Exact | blood pressure measurements (ankle brachial) | Ankle Brachial Index, visit 1, definition 4 [Ankle Brachial Index Data, exam 1] |
| Ankle-Brachial index return [Ankle Brachial BP (Blood Pressure—ultrasound work station), exam 1] | ||||
| 403 | 50.0 | Related | medication | [Medication records]. Medication code number. Q12B [Medication Survey Form, exam 2] |
| [Medication records]. Medication code number. Q11B [Medication Survey Form, exam 2] |
FIGURE 6Cluster groups organized into topics based on the goals of the ARIC study. The outer black and white boxes are topics (e.g., Clinical Outcome and Medical Care) and each topic contains multiple cluster groups (e.g., Stroke and Lung Diseases) which are the blue boxes. Listed under the manually assigned label for each cluster groups are bullets representing the clusters which are members of that group. If there are multiple clusters within a group with the same name, after the cluster name they have an “x” and the number of times that cluster appears. For example, Anger x3 means there are three clusters in that group with the name Anger. Abbreviations: MMH, Maternal Health History; PHH, Paternal Health History; MH, Medical History.