| Literature DB >> 31856819 |
Ming Huang1, Nilay D Shah1, Lixia Yao2.
Abstract
BACKGROUND: Sequence alignment is a way of arranging sequences (e.g., DNA, RNA, protein, natural language, financial data, or medical events) to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients.Entities:
Keywords: Dynamic time warping; Electronic health record; Needleman-Wunsch algorithm; Patient similarity; Sequence alignment; Smith-Waterman algorithm; Temporal sequence
Mesh:
Year: 2019 PMID: 31856819 PMCID: PMC6921442 DOI: 10.1186/s12911-019-0965-y
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1An illustration demonstrating the significance of sequence alignment. (A) Two simplified temporal event sequences; (B) the scoring function to calculate the pairwise patient similarity; global sequence alignment algorithms, DTW (C) and NWA (D); local sequence alignment algorithms, DTWL (E) and SWA (F). The shapes with light blue and dash border are extra medical events inserted by DTW or DTWL during sequence alignment. “_” is a gap spot inserted by NWA or SWA during sequence alignment. The different shapes (e.g., diamond, triangle and circle) represent different medical events. J denotes Jaccard index
Different scenarios of patient clinical encounters on a single day
| Daily event | Medical scenario | Diagnosis recorda | |
|---|---|---|---|
| Single visit on a single day | (i) Single diagnosis | A patient went to see a primary care doctor and received a single diagnosis. | 01/01/2019b: Influenza |
| (ii) Multiple diagnoses | A patient went to see a primary care doctor and received multiple diagnoses. | 01/01/2018: Influenza | Pneumonia | |
| Multiple visits on a single day | (iii) Single and same diagnosis for multiple visits | A patient went to see a primary care doctor and then got transferred to Emergency Room immediately. | 01/01/2019: Acute myocarditis |
| 01/01/2019: Acute myocarditis | |||
| (iv) Multiple diagnoses for multiple visits | A patient went to see a primary care doctor for flu. He also visited an endocrinologist for a routine follow-up for type II diabetes. | 01/01/2019: Influenza with pneumonia | Acute myocarditis | |
| 01/01/2019: Type II diabetes | Benign essential hypertension | |||
aFor better readability, the diagnosis codes are not listed
b01/01/2019 is a hypothetical date used for illustrative purpose
Fig. 2The distribution of medical record length in terms of count of unique dates for patients with influenza (acute disease) and type II diabetes (chronic disease), and with three or more types of clinical encounters on a single day (specified in Table 1) in the REP database
Operations of Deleting, Updating, and Switching on Daily Event and Multi-day Event Block
| Operation | Level | |
|---|---|---|
| Daily event | Event block | |
| Deleting | Deleting a daily event | Deleting multiple consecutive daily events |
| Updating | Randomly changing a diagnosis in a daily event or randomly removing a diagnosis if the total number of diagnosis in a daily event is > 1 | Changing a block of daily events |
| Switching | Switching all the diagnoses in two randomly selected daily events | Switching all the daily events between two selected daily event blocks of same length |
Similarity scores of pairwise global sequence alignments
| ID | Operation | Seed Patient 1 ( | Seed Patient 2 ( | Seed Patient 3 ( | Seed Patient 4 ( | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DTW | NWA | REF | DTW | NWA | REF | DTW | NWA | REF | DTW | NWA | REF | ||
| 1 | x | 0.819 | 0.778 | 0.778 | 0.980 | 0.976 | 0.976 | 0.991 | 0.991 | 0.991 | 0.998 | 0.996 | 0.996 |
| 2 | x x | 0.597 | 0.556 | 0.556 | 0.976 | 0.952 | 0.952 | 0.987 | 0.982 | 0.982 | 0.994 | 0.991 | 0.991 |
| 3 | u | 0.852 | 0.852 | 0.852 | 0.996 | 0.996 | 0.996 | 0.996 | 0.996 | 0.996 | |||
| 4 | u u | 0.714 | 0.714 | 0.714 | 0.952 | 0.952 | 0.952 | 0.984 | 0.984 | 0.984 | 0.993 | 0.993 | 0.993 |
| 5 | s | 0.556 | 0.556 | 0.556 | 0.952 | 0.952 | 0.952 | 0.991 | 0.991 | 0.991 | 0.991 | 0.991 | 0.991 |
| 6 | s s | 0.286 | 0.286 | 0.175 | 0.905 | 0.905 | 0.905 | 0.988 | 0.988 | 0.988 | 0.983 | 0.983 | 0.983 |
| 7 | x u | 0.556 | 0.556 | 0.556 | 0.964 | 0.952 | 0.952 | 0.988 | 0.987 | 0.987 | 0.993 | 0.991 | 0.991 |
| 8 | x s | 0.611 | 0.556 | 0.556 | 0.929 | 0.929 | 0.929 | 0.978 | 0.973 | 0.973 | 0.989 | 0.987 | 0.987 |
| 9 | u s | 0.457 | 0.457 | 0.457 | 0.929 | 0.929 | 0.929 | 0.981 | 0.981 | 0.981 | 0.987 | 0.987 | 0.987 |
| 10 | x u s | 0.363 | 0.289 | 0.289 | 0.905 | 0.905 | 0.905 | 0.969 | 0.964 | 0.964 | 0.984 | 0.983 | 0.983 |
| 11 | X | 0.590 | 0.556 | 0.556 | 0.869 | 0.810 | 0.810 | 0.877 | 0.804 | 0.804 | 0.821 | 0.808 | 0.808 |
| 12 | X X | 0.179 | 0.111 | 0.111 | 0.702 | 0.667 | 0.667 | 0.657 | 0.633 | 0.633 | |||
| 13 | U | 0.667 | 0.667 | 0.667 | 0.821 | 0.821 | 0.821 | 0.832 | 0.832 | 0.832 | 0.831 | 0.831 | 0.831 |
| 14 | U U | 0.551 | 0.551 | 0.551 | 0.786 | 0.786 | 0.786 | 0.711 | 0.709 | 0.709 | 0.695 | 0.695 | 0.695 |
| 15 | S | 0.401 | 0.333 | 0.160 | 0.637 | 0.631 | 0.631 | 0.729 | 0.700 | 0.679 | 0.623 | 0.622 | 0.622 |
| 16 | S S | 0.319 | 0.310 | 0.310 | 0.405 | 0.393 | 0.351 | 0.278 | 0.266 | 0.262 | |||
| 17 | X U | 0.185 | 0.185 | 0.185 | 0.702 | 0.676 | 0.676 | 0.679 | 0.668 | 0.668 | 0.716 | 0.704 | 0.704 |
| 18 | X S | −0.204 | −0.289 | −0.333 | 0.539 | 0.530 | 0.530 | 0.577 | 0.552 | 0.495 | 0.509 | 0.501 | 0.474 |
| 19 | U S | −0.204 | −0.204 | −0.204 | 0.526 | 0.518 | 0.518 | 0.689 | 0.685 | 0.664 | 0.646 | 0.640 | 0.636 |
| 20 | X U S | −0.530 | − 0.530 | − 0.530 | 0.611 | 0.592 | 0.592 | 0.571 | 0.536 | 0.528 | 0.627 | 0.624 | 0.624 |
a. ID is the synthetic patient index. N is the number of daily events in a seed patient sequence
b. DTW, NWA and REF refer to as Dynamic Time Warping, Needleman-Wunsch Algorithm, and baseline reference, respectively
c. The lower case letters “x”, “u”, and “s” denote deleting, updating and switching a daily event, respectively. The upper case letters “X”, “U”, and “S” stand for deleting, updating and switching multi-day events (event block)
Similarity scores of pairwise local sequence alignments
| ID | Operation | Seed Patient 1 ( | Seed Patient 2 ( | Seed Patient 3 ( | Seed Patient 4 ( | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DTWL | SWA | REF | DTWL | SWA | REF | DTWL | SWA | REF | DTWL | SWA | REF | ||||||||||
| C | Sn | C | Sn | C (Sn) | C | Sn | C | Sn | C (Sn) | C | Sn | C | Sn | C (Sn) | C | Sn | C | Sn | C (Sn) | ||
| 1 | x | 1.000 | 0.980 | 1.000 | 0.976 | 0.738 | 1.000 | 0.991 | 1.000 | 0.991 | 0.571 | 1.000 | 0.998 | 1.000 | 0.996 | 0.500 | |||||
| 2 | x x | 1.000 | 0.597 | 1.000 | 0.556 | 0.333 | 1.000 | 0.976 | 1.000 | 0.952 | 0.452 | 1.000 | 0.987 | 1.000 | 0.982 | 0.509 | 1.000 | 0.994 | 1.000 | 0.991 | 0.511 |
| 3 | u | 1.000 | 0.852 | 1.000 | 0.852 | 0.444 | 1.000 | 0.988 | 1.000 | 0.988 | 0.964 | 1.000 | 0.996 | 1.000 | 0.996 | 0.741 | 1.000 | 0.996 | 1.000 | 0.996 | 0.917 |
| 4 | u u | 1.000 | 0.714 | 1.000 | 0.714 | 0.333 | 1.000 | 0.984 | 1.000 | 0.984 | 0.549 | 1.000 | 0.993 | 1.000 | 0.993 | 0.502 | |||||
| 5 | s | 0.778 | 0.556 | 0.778 | 0.556 | 0.556 | 0.976 | 0.976 | 0.976 | 0.976 | 0.976 | 1.000 | 0.991 | 1.000 | 0.991 | 0.665 | 1.000 | 0.991 | 1.000 | 0.991 | 0.472 |
| 6 | s s | 1.000 | 0.286 | 1.000 | 0.286 | 0.222 | 1.000 | 0.905 | 1.000 | 0.905 | 0.333 | 1.000 | 0.988 | 1.000 | 0.988 | 0.969 | 0.998 | 0.985 | 0.998 | 0.985 | 0.469 |
| 7 | x u | 0.889 | 0.667 | 0.889 | 0.667 | 0.556 | 1.000 | 0.964 | 1.000 | 0.952 | 0.405 | 1.000 | 0.988 | 1.000 | 0.987 | 0.411 | 1.000 | 0.993 | 1.000 | 0.991 | 0.541 |
| 8 | x s | 1.000 | 0.611 | 1.000 | 0.556 | 0.444 | 1.000 | 0.929 | 1.000 | 0.929 | 0.369 | 1.000 | 0.978 | 1.000 | 0.973 | 0.652 | 1.000 | 0.989 | 1.000 | 0.987 | 0.526 |
| 9 | u s | 1.000 | 0.457 | 1.000 | 0.457 | 0.333 | 1.000 | 0.929 | 1.000 | 0.929 | 0.893 | 1.000 | 0.981 | 1.000 | 0.981 | 0.487 | 1.000 | 0.987 | 1.000 | 0.987 | 0.373 |
| 10 | x u s | 0.444 | 0.400 | 0.444 | 0.400 | 0.222 | 1.000 | 0.905 | 1.000 | 0.905 | 0.560 | 1.000 | 0.969 | 1.000 | 0.964 | 0.335 | 1.000 | 0.984 | 1.000 | 0.983 | 0.648 |
| 11 | X | 0.778 | 0.778 | 0.778 | 0.778 | 0.778 | 1.000 | 0.869 | 1.000 | 0.810 | 0.560 | 0.891 | 0.891 | 0.891 | 0.891 | 0.891 | |||||
| 12 | X X | 0.333 | 0.333 | 0.333 | 0.333 | 0.333 | 1.000 | 0.702 | 1.000 | 0.667 | 0.619 | 1.000 | 0.708 | 1.000 | 0.625 | 0.509 | 1.000 | 0.657 | 1.000 | 0.633 | 0.618 |
| 13 | U | 1.000 | 0.821 | 1.000 | 0.821 | 0.762 | 0.835 | 0.835 | 0.835 | 0.835 | 0.835 | 0.893 | 0.893 | 0.893 | 0.893 | 0.893 | |||||
| 14 | U U | 1.000 | 0.551 | 1.000 | 0.551 | 0.222 | 0.952 | 0.826 | 0.952 | 0.826 | 0.821 | 1.000 | 0.711 | 1.000 | 0.709 | 0.634 | 1.000 | 0.695 | 1.000 | 0.695 | 0.445 |
| 15 | S | 0.893 | 0.714 | 0.893 | 0.708 | 0.548 | 1.000 | 0.729 | 0.862 | 0.711 | 0.545 | 0.838 | 0.650 | 0.838 | 0.650 | 0.478 | |||||
| 16 | S S | 0.556 | 0.389 | 0.556 | 0.333 | 0.222 | 0.690 | 0.500 | 0.690 | 0.500 | 0.452 | 0.728 | 0.463 | 0.728 | 0.460 | 0.335 | |||||
| 17 | X U | 0.556 | 0.556 | 0.556 | 0.556 | 0.556 | 1.000 | 0.702 | 0.679 | 0.679 | 0.679 | 1.000 | 0.679 | 1.000 | 0.668 | 0.317 | 0.939 | 0.756 | 0.939 | 0.745 | 0.535 |
| 18 | X S | 0.222 | 0.222 | 0.222 | 0.222 | 0.222 | 1.000 | 0.539 | 1.000 | 0.530 | 0.262 | 1.000 | 0.577 | 0.821 | 0.561 | 0.362 | 0.734 | 0.569 | 0.734 | 0.567 | 0.476 |
| 19 | U S | 0.222 | 0.222 | 0.222 | 0.222 | 0.111 | 1.000 | 0.526 | 0.833 | 0.518 | 0.310 | 1.000 | 0.689 | 1.000 | 0.685 | 0.625 | 1.000 | 0.646 | 1.000 | 0.640 | 0.498 |
| 20 | X U S | 0.222 | 0.222 | 0.222 | 0.222 | 0.111 | 0.857 | 0.706 | 0.857 | 0.706 | 0.571 | 0.795 | 0.608 | 0.795 | 0.598 | 0.576 | 0.893 | 0.731 | 0.893 | 0.731 | 0.618 |
a. ID is the synthetic patient index. N is the number of daily events in a seed patient sequence
b. DTWL, SWA and REF refer to as modified Dynamic Time Warping for Local alignment, Smith-Waterman Algorithm, and baseline reference, respectively. C is the coverage of the seed patient sequence aligned to a synthetic patient sequence. Sn is normalized highest alignment score (i.e., the highest alignment score divided by N). C (Sn) denotes that Sn = C
c. The lower case letters “x”, “u”, and “s” denote deleting, updating and switching a daily event, respectively. The upper case letters “X”, “U”, and “S” stand for deleting, updating and switching multi-day events (event block)
Fig. 3Scenarios of global sequence alignment: (a) Deleting, (b) Updating, and (c) Switching. REF, DTW, and NWA refer to as reference alignment, alignment with Dynamic Time Warping, and alignment with Needleman-Wunsch Algorithm, respectively. In each pair, seed sequence is listed on the top and aligned synthetic sequence is listed on the bottom. The similarity scores (Sn) between seed sequence and synthetic sequence are also listed on the right side of each pair
Fig. 4Scenarios of local sequence alignment: (a, b) Deleting, (c, d) Updating, and (e, f) Switching. REF, DTWL and SWA refer to as reference alignment, alignment with modified Dynamic Time Warping for Local alignment, and alignment with Smith-Waterman Algorithm, respectively. In each pair, seed sequence is listed on the top and aligned synthetic sequence is listed on the bottom. The coverage (C) and similarity scores (Sn) between seed sequence and synthetic sequence are also listed on the right side of each pair