| Literature DB >> 30305261 |
Feichen Shen1, Sijia Liu1, Yanshan Wang1, Andrew Wen1, Liwei Wang1, Hongfang Liu1.
Abstract
BACKGROUND: In the United States, a rare disease is characterized as the one affecting no more than 200,000 patients at a certain period. Patients suffering from rare diseases are often either misdiagnosed or left undiagnosed, possibly due to insufficient knowledge or experience with the rare disease on the part of clinical practitioners. With an exponentially growing volume of electronically accessible medical data, a large volume of information on thousands of rare diseases and their potentially associated diagnostic information is buried in electronic medical records (EMRs) and medical literature.Entities:
Keywords: electronic medical record; literature; rare diseases; text mining
Year: 2018 PMID: 30305261 PMCID: PMC6231873 DOI: 10.2196/11301
Source DB: PubMed Journal: JMIR Med Inform
Figure 1System workflow. EMR: electronic medical record; UMLS: Unified Medical Language System.
Statistics for prepared datasets.
| Datasets | EMRa only, n | EMR and literature (EMR+L), n | EMR and pruned literature (EMR+PL), n |
| Patients or literature sources | 38,607 | 40,241 | 39,677 |
| Phenotypes | 3271 | 3818 | 3271 |
| Rare diseases | 1074 | 1634 | 1074 |
| Phenotype-disease associations | 141,036 | 154,802 | 141,036 |
| GARDb categories covered | 28 | 31 | 28 |
aEMR: electronic medical record.
bGARD: Genetic and Rare Diseases Information Center.
Figure 2Root mean square error (RMSE) for k-nearest neighbors (KNN) with four similarity measurements. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; L: literature; LL: log likelihood ratio similarity; OL: overlap coefficient similarity; PL: pruned literature; TANI: Tanimoto coefficient similarity.
Figure 3Root mean square error (RMSE) for threshold patient neighbor (TPN) with four similarity measurements. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; L: literature; LL: log likelihood ratio similarity; OL: overlap coefficient similarity; PL: pruned literature; TANI: Tanimoto coefficient similarity.
Optimal thresholds for different evaluation groups.
| Optimal parameters | TANIa | LLb | OLc | FMGd | ||||||||
| EMRe | EMR+Lf | EMR+PLg | EMR | EMR+L | EMR+PL | EMR | EMR+L | EMR+PL | EMR | EMR+L | EMR+PL | |
| Optimal | 11 | 10 | 9 | 4 | 4 | 4 | 4 | 4 | 4 | 7 | 6 | 6 |
| Optimal | 0.19 | 0.19 | 0.2 | 0.72 | 0.73 | 0.76 | 0.51 | 0.49 | 0.51 | 0.12 | 0.11 | 0.12 |
aTANI: Tanimoto coefficient similarity.
bLL: log likelihood ratio similarity.
cOL: overlap coefficient similarity.
dFMG: Fager and McGowan coefficient similarity.
eEMR: electronic medical record.
fL: literature.
gPL: pruned literature.
hKNN: k-nearest neighbors.
iTPN: threshold patient neighbor.
Figure 4Precision-recall curves and area under the precision-recall curve (PRAUC) for Tanimoto coefficient similarity (TANI) with k-nearest neighbors (KNN) and threshold patient neighbors (TPN). EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; SNOMED: systematized nomenclature of medicine; TANI: Tanimoto coefficient similarity.
Figure 5Precision-recall curves and area under the precision-recall curve (PRAUC) for log likelihood ratio similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; LL: log likelihood ratio similarity; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Figure 6Precision-recall curves and area under the precision-recall curve (PRAUC) for overlap coefficient similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; OL: overlap coefficient similarity; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Figure 7Precision-recall curves and area under the precision-recall curve (PRAUC) for Fager and McGowan coefficient similarity with k-nearest neighbors and threshold patient neighbors. EMR: electronic medical record; FMG: Fager and McGowan coefficient similarity; GARD: Genetic and Rare Diseases Information Center; KNN: k-nearest neighbors; SNOMED: systematized nomenclature of medicine; TPN: threshold patient neighbor.
Mean average precision for TANIa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
| Matching criterion | EMR | EMR+L | EMR+PL | |||
| KNNe | TPNf | KNN | TPN | KNN | TPN | |
| String | 0.435 | 0.441 | 0.436 | 0.445 | ||
| SNOMEDg | 0.469 | 0.475 | 0.474 | 0.479 | ||
| GARDh | 0.739 | 0.742 | 0.742 | 0.745 | ||
aTANI: Tanimoto coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Mean average precision for FMGa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
| Matching criterion | EMR | EMR+L | EMR+PL | |||
| KNNe | TPNf | KNN | TPN | KNN | TPN | |
| String | 0.18 | 0.205 | 0.264 | 0.27 | ||
| SNOMEDg | 0.192 | 0.221 | 0.288 | 0.297 | ||
| GARDh | 0.568 | 0.584 | 0.651 | |||
aFMG: Fager and McGowan coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Figure 8Prediction performance for rare diseases. GARD: Genetic and Rare Diseases Information Center; SNOMED: systematized nomenclature of medicine.
Recommendation performance for selected rare diseases (3 high, 3 medium to high, 3 low).
| Approaches and top diseases | Number of patients affected | Precision | Recall | ||
| Holoprosencephaly | <10 | 0.75 | 1 | 0.86 | |
| Huntington disease | <10 | 1 | 0.67 | 0.8 | |
| Juvenile polyposis syndrome | <10 | 0.91 | 0.71 | 0.8 | |
| Sacrococcygeal teratoma | 15 | 0.83 | 0.67 | 0.74 | |
| Frontotemporal dementia | 202 | 0.69 | 0.58 | 0.63 | |
| Polycystic liver disease | 72 | 0.64 | 0.58 | 0.61 | |
| Hemicrania continua | 36 | 0.08 | 0.25 | 0.12 | |
| Intrahepatic cholangiocarcinoma | 94 | 0.08 | 0.22 | 0.12 | |
| Neuromyelitis optica | 50 | 0.16 | 0.1 | 0.12 | |
| Myxoid liposarcoma | 37 | 0.94 | 0.89 | 0.91 | |
| Linear scleroderma | 16 | 0.91 | 0.71 | 0.8 | |
| Migraine with brainstem aura | 15 | 0.75 | 1 | 0.86 | |
| Hypophosphatemic rickets | <10 | 0.83 | 0.75 | 0.79 | |
| Congenital radio ulnar synostosis | 14 | 0.67 | 0.86 | 0.75 | |
| Spasmodic dysphonia | 177 | 0.83 | 0.67 | 0.74 | |
| Acute graft-versus-host disease | 20 | 0.1 | 0.5 | 0.15 | |
| Cryptogenic organizing pneumonia | 37 | 0.14 | 0.17 | 0.15 | |
| Cerebellar degeneration | 29 | 0.14 | 0.17 | 0.15 | |
| Acrospiroma | <10 | 1 | 1 | 1 | |
| Birt-Hogg-Dube syndrome | <10 | 1 | 1 | 1 | |
| Dendritic cell tumor | <10 | 1 | 1 | 1 | |
| Acute promyelocytic leukemia | 15 | 0.97 | 0.93 | 0.95 | |
| Migraine with brainstem aura | 15 | 1 | 0.88 | 0.93 | |
| Thyroid cancer, anaplastic | 30 | 1 | 0.86 | 0.92 | |
| Addison disease | 34 | 0.88 | 0.45 | 0.6 | |
| Encephalocele | 56 | 0.4 | 0.59 | 0.48 | |
| Mixed connective tissue disease | 78 | 0.4 | 0.48 | 0.43 | |
aLL: log likelihood ratio similarity.
bKNN: k-nearest neighbors.
cSNOMED: Systematized Nomenclature of Medicine.
dGARD: Genetic and Rare Diseases Information Center.
Mean average precision for LLa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
| Matching criterion | EMR | EMR+L | EMR+PL | |||
| KNNe | TPNf | KNN | TPN | KNN | TPN | |
| String | 0.368 | 0.351 | 0.46 | 0.391 | ||
| SNOMEDg | 0.405 | 0.386 | 0.495 | 0.426 | ||
| GARDh | 0.683 | 0.67 | 0.745 | 0.71 | ||
aLL: log likelihood ratio similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.
Mean average precision for OLa with EMRb, EMR+Lc, and EMR+PLd (optimal in italics).
| Matching criterion | EMR | EMR+L | EMR+PL | |||
| KNNe | TPNf | KNN | TPN | KNN | TPN | |
| String | 0.117 | 0.11 | 0.167 | 0.148 | ||
| SNOMEDg | 0.126 | 0.122 | 0.179 | 0.162 | ||
| GARDh | 0.457 | 0.48 | 0.505 | 0.509 | ||
aOL: overlap coefficient similarity.
bEMR: electronic medical record.
cL: literature.
dPL: pruned literature.
eKNN: k-nearest neighbors.
fTPN: threshold patient neighbor.
gSNOMED: systematized nomenclature of medicine.
hGARD: Genetic and Rare Diseases Information Center.