| Literature DB >> 30944012 |
David A Hanauer1,2, Qiaozhu Mei3, V G Vinod Vydiswaran3,4, Karandeep Singh4, Zach Landis-Lewis4, Chunhua Weng5.
Abstract
BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes.Entities:
Keywords: Information retrieval; Lexical variation; Natural language processing
Mesh:
Year: 2019 PMID: 30944012 PMCID: PMC6448181 DOI: 10.1186/s12911-019-0784-1
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Lexical Variants Included in this Paper
| Lexical Variant Category | Examples |
|---|---|
| Positive integers | ‘three’, ‘thirty-three’, ‘seventy-three’ |
| Negative integers | ‘minus three’, ‘minus 3’ |
| Fractions | ‘one third’, ‘one thirds’, ‘six eights’ |
| Dimensions | ‘one by three’, ‘two by four’ |
| Ranges/odds | ‘one to three’, ‘two to four’ |
| Dates, including invalid | ‘January 35’, ‘June 31’, ‘September 38’ |
| Roman numerals | ‘X’, ‘XV’, ‘XXIV’, ‘XXVIII’, ‘XXXV’ |
| Medical classifications | ‘1A’, ‘IID’, ‘type 2’, ‘type II’, ‘class III’ |
| Ages, including implausible values | ‘135 year old’ ‘septuagenarian’ |
| Expressions of quantity | ‘billions’, ‘octillion’, ‘gobs of’ |
| Ordering/ranking | ‘1st’, ‘1rd’, ‘firstly’, ‘1stly’, ‘primary’ |
| Tuples | ‘single’, ‘double’, ‘triple’, ‘quadruple’ |
Negative Integers
| minus one | minus two | minus three | minus four | minus five | minus six | minus seven | minus eight | minus nine | minus ten |
| minus 1 | minus 2 | minus 3 | minus 4 | minus 5 | minus 6 | minus 7 | minus 8 | minus 9 | minus 10 |
| negative one | negative | negative three | negative four | negative | negative | negative seven | negative | negative | negative |
| negative 1 | negative 2 | negative 3 | negative 4 | negative 5 | negative 6 | negative 7 | negative 8 | negative 9 | negative 10 |
Fractions
| half(s)/halve(s) | third(s) | fourth(s) | fifth(s) | sixth(s) | seventh(s) | eighth(s) | ninth(s) | tenth(s) | |
|---|---|---|---|---|---|---|---|---|---|
| one | 287,671 | 57,040 | 4389 | 5454 | 177 | 48 | 1455 | 4 | 588 |
| two | 824 | 35,220 | 64 | 1112 | 6 | 21 | 9 | 1 | 182 |
| three | 2609 | 58 | 3347 | 286 | 6 | 19 | 287 | 0 | 91 |
| four | 1335 | 485 | 10 | 177 | 3 | 24 | 4 | 0 | 40 |
| five | 712 | 1 | 9 | 27 | 10 | 14 | 52 | 0 | 19 |
| six | 186 | 1 | 1 | 4 | 0 | 19 | 1 | 2 | 33 |
| seven | 89 | 0 | 0 | 7 | 0 | 0 | 33 | 0 | 19 |
| eight | 52 | 0 | 1 | 3 | 1 | 0 | 3 | 20 | 25 |
| nine | 36 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 48 |
| ten | 14 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
Dimensions
| one | two | three | four | five | six | seven | eight | nine | |
|---|---|---|---|---|---|---|---|---|---|
| one by | 2332 | 12 | 7 | 1 | 1 | 2 | 0 | 1 | 0 |
| two by | 13 | 51 | 23 | 59 | 1 | 1 | 0 | 0 | 0 |
| three by | 1 | 8 | 20 | 8 | 5 | 0 | 0 | 0 | 0 |
| four by | 1 | 4 | 13 | 76 | 3 | 1 | 0 | 15 | 0 |
| five by | 0 | 3 | 2 | 5 | 5 | 1 | 1 | 1 | 1 |
| six by | 5 | 2 | 2 | 1 | 0 | 3 | 0 | 2 | 2 |
| seven by | 1 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 |
| eight by | 0 | 1 | 0 | 4 | 0 | 0 | 0 | 2 | 0 |
| nine by | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ten by | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Ranges or Odds
| one | two | three | four | five | six | seven | eight | nine | |
|---|---|---|---|---|---|---|---|---|---|
| one to | 24,976 | 599,217 | 25,720 | 5151 | 3848 | 3964 | 496 | 170 | 40 |
| two to | 493 | 2456 | 510,983 | 100,399 | 4602 | 3196 | 476 | 522 | 46 |
| three to | 91 | 206 | 651 | 363,750 | 41,499 | 25,572 | 1904 | 985 | 192 |
| four to | 55 | 63 | 90 | 176 | 125,943 | 2,284,611 | 1897 | 5972 | 99 |
| five to | 19 | 31 | 54 | 44 | 97 | 59,322 | 22,705 | 2157 | 353 |
| six to | 12 | 22 | 30 | 62 | 33 | 86 | 27,403 | 538,729 | 7200 |
| seven to | 3 | 6 | 10 | 16 | 13 | 25 | 65 | 15,433 | 1650 |
| eight to | 12 | 5 | 9 | 15 | 20 | 28 | 12 | 41 | 8379 |
| nine to | 8 | 3 | 5 | 3 | 17 | 15 | 5 | 2 | 27 |
| ten to | 18 | 17 | 13 | 14 | 20 | 10 | 17 | 9 | 9 |
Invalid Datesa
| 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
|---|---|---|---|---|---|---|---|---|---|
| January | 55,596b | 7 | 11 | 3 | 11 | 6 | 3 | 5 | 8 |
| February | 30 | 5 | 6 | 2 | 4 | 1 | 5 | 0 | 3 |
| March | 56,701b | 23 | 7 | 12 | 113 | 1 | 12 | 9 | 5 |
| April | 285 | 6 | 8 | 4 | 4 | 0 | 4 | 2 | 8 |
| May | 50,884b | 19 | 9 | 18 | 4 | 4 | 16 | 8 | 11 |
| June | 31 | 273 | 10 | 5 | 6 | 5 | 3 | 5 | 15 |
| July | 59,207b | 9 | 7 | 11 | 7 | 8 | 4 | 1 | 3 |
| August | 57,896b | 5 | 10 | 6 | 8 | 8 | 5 | 5 | 7 |
| September | 257 | 6 | 0 | 5 | 6 | 4 | 1 | 4 | 5 |
| October | 59,150b | 13 | 10 | 4 | 2 | 3 | 5 | 5 | 3 |
| November | 234 | 6 | 2 | 3 | 10 | 7 | 1 | 5 | 3 |
| December | 25,840b | 7 | 10 | 6 | 2 | 3 | 2 | 4 | 3 |
aThe cell in the upper right corner would be ‘January 39’. Not included in this table is ‘February 30’ which appeared in 117 documents. Total number of invalid date instances in this table: 1917
b The 31st day for January, March, May, July, August, October, and December are, of course, valid
Roman Numerals
| I (34,856,243) | II (4,814,592) | III (3,467,400) | IIII (487) | IIIII (62) | IIIIII (5) | IIIIIII (3) | IIIIIIII (2) | IIIIIIIII (1) | |
|---|---|---|---|---|---|---|---|---|---|
| IV | V | VI | VII | VIII | IX | ||||
| X | XI | XII | XIII | XIV | XV | XVI | XVII | XVIII | XIX |
| XX | XXI | XXII | XXIII | XXIV | XXV | XXVI | XXVII | XXVIII | XXIX |
| XXX | XXXI | XXXII | XXXIII | XXXIV | XXXV | XXXVI | XXXVII | XXXVIII | XXXIX |
Medical Categorizationsa
| A | B | C | D | E | F | G | H | I | J | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 298,397 | 162,822 | 92,512 | 64,856 | 49,791 | 40,990 | 223,638 | 173,504 | 17,135 | 15,441 |
| 2 | 143,858 | 70,087 | 29,521 | 335,947 | 15,212 | 18,362 | 219,114 | 156,211 | 3232 | 2898 |
| 3 | 66,477 | 27,332 | 24,692 | 314,058 | 14,396 | 14,528 | 55,856 | 147,656 | 1874 | 1714 |
| 4 | 171,463 | 159,144 | 138,104 | 33,191 | 12,352 | 19,792 | 58,001 | 217,040 | 1146 | 1081 |
| 5 | 194,432 | 93,058 | 151,822 | 101,684 | 14,428 | 34,077 | 130,574 | 149,902 | 673 | 946 |
| I | 93,721 | 75,347 | 159,150 | 13,964,384 | 497,302 | 27,699,212 | 39,540 | 45,987 | 4,814,592 | 434,416 |
| II | 56,631 | 43,207 | 4846 | 274 | 372 | 2500 | 53 | 2158 | 3,467,400 | 2 |
| III | 65,347 | 45,687 | 33,381 | 60 | 97 | 9 | 5 | 21 | 487 | 2 |
| IV | 41,830 | 15,552 | 509,947 | 2695 | 40,328 | 90,9986 | 576 | 62,302 | 533 | 108 |
| V | 295,868 | 54,862 | 103,848 | 9929 | 158,751 | 106,698 | 9271 | 595,776 | 577,732 | 328 |
aThe term in the upper left would be ‘1A’. These are often used in classifying disorders such as Hyperlipoproteinemia Type IIA or Stage 3B Lung Cancer. Note that some of the terms with Roman numerals could be confused with other medical abbreviations (e.g., VA Veterans Affairs, 1G 1 g, 3D Three-dimensional, IC Intracardiac, ID Infectious diseases). IF is a common English word (case sensitive searches were not conducted for this analysis)
Additional Categorization Variationsa
| 1 | I | 2 | II | 3 | III | 4 | IV | IIII | 5 | V | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| type | 674,898 | 231,183 | 1,588,852 | 421,332 | 196,961 | 47,794 | 167,557 | 15,068 | 5 | 161,395 | 1673 |
| phase | 88,407 | 39,641 | 125,204 | 53,863 | 36,978 | 8975 | 1750 | 431 | 1 | 28,526 | 61 |
| grade | 639,287 | 184,486 | 426,407 | 155,115 | 221,568 | 94,407 | 55,841 | 30,020 | 23 | 20,740 | 5251 |
| stage | 149,938 | 357,732 | 169,038 | 273,244 | 332,2767 | 274,993 | 90,336 | 285,535 | 31 | 36,419 | 55,780 |
| class | 72,731 | 298,391 | 94,568 | 173,749 | 112,243 | 128,196 | 27,082 | 36,450 | 26 | 36,759 | 5707 |
| score | 171,243 | 15,607 | 107,100 | 266 | 121,064 | 246 | 100,209 | 133 | 0 | 112,719 | 100 |
aAdditional variations in how some categorizations in medicine are represented with either Arabic or Roman numerals. The cell in the upper right hand corner represents ‘type V’ whereas the lower left is ‘score 1’
Diabetes Terminology Variations
| Phrase | n |
|---|---|
| Type I diabetes | 41,007 |
| Type II diabetes | 109,739 |
| Type III diabetes | 6 |
| Type IV diabetes | 8 |
| TIDM | 607 |
| TIIDM | 992 |
| Type III DM | 2 |
| Type IV DM | 1 |
| T1DM | 12,725 |
| T2DM | 70,314 |
| T21DM | 5 |
| T12DM | 2 |
| Type 1 diabetes | 271,541 |
| Type 2 diabetes | 871,228 |
| Type 21 diabetes | 4 |
| Type 12 diabetes | 2 |
| DM1 | 17,166 |
| DM 1 | 7238 |
| DM2 | 167,534 |
| DM 2 | 25,407 |
| DMI | 79,253 |
| DM I | 8317 |
| DMII | 56,942 |
| DM II | 44,983 |
Biologically Implausible Ages
| Phrase | n |
|---|---|
| 123 year old | 3 |
| 124 year old | 1 |
| 125 year old | 22 |
| 126 year old | 2 |
| 127 year old | 4 |
| 128 year old | 2 |
| 129 year old | 2 |
| 130 year old | 55 |
| 131 year old | 1 |
| 132 year old | 2 |
| 133 year old | 2 |
| 134 year old | 3 |
| 135 year old | 4 |
| 136 year old | 2 |
| 137 year old | 29 |
| 138 year old | 4 |
| 139 year old | 1 |
| 140 year old | 29 |
| 150 year old | 128 |
| 160 year old | 13 |
| 170 year old | 3 |
| 180 year old | 5 |
| 190 year old | 3 |
| 200 year old | 23 |
Age Groups by Decade
| Phrase | n |
|---|---|
| quinquagenarian | 0 |
| sexagenarian | 1 |
| septuagenarian | 112 |
| octogenarian | 239 |
| nonagenarian | 45 |
| centenarian | 16 |
| supercentenarian | 0 |
Ordering and Rankinga
| st | nd | rd | th | |
|---|---|---|---|---|
| 1 | 862,447b | 79 | 7 | 299 |
| 2 | 282 | 801,375b | 360 | 270 |
| 3 | 27 | 617 | 626,822b | 694 |
| 4 | 17 | 46 | 432 | 442,238b |
| 5 | 16 | 16 | 54 | 481,412b |
aWays in which ordering and ranking is described. As an example, the cell in the upper right corner is the term ‘1th’
b Cells containing valid expressions
Very Large and Small Quantities
| Phrase | n |
|---|---|
| minus infinity | 0 |
| negative infinity | 2 |
| hundred | 17,760 |
| hundreds | 9215 |
| thousand | 14,917 |
| thousands | 6401 |
| hundred thousand | 146 |
| million | 75,013 |
| millions | 1179 |
| billion | 46,081 |
| billions | 381 |
| trillion | 51 |
| trillions | 27 |
| quadrillion | 2 |
| quadrillions | 1 |
| octillion | 3 |
| nonillion | 2 |
| undecillion | 1 |
| googolplex | 0 |
| googol | 0 |
| infinity | 6325 |
Imprecise and Informal Expressions of Quantity
| Phrase | n |
|---|---|
| couple of | 1673,735 |
| lots of | 328,506 |
| not much | 113,336 |
| few of | 35,803 |
| small number of | 12,358 |
| hundreds of | 7371 |
| all kinds of | 6940 |
| thousands of | 4611 |
| tons of | 3018 |
| too many to count | 1346 |
| massive amounts of | 1187 |
| very small number of | 1104 |
| far more than | 971 |
| way more than | 820 |
| very large number of | 623 |
| millions of | 561 |
| way too many | 364 |
| huge number of | 260 |
| gobs of | 199 |
| vanishingly small | 179 |
| uncountable | 133 |
| hell of a lot | 69 |
| lion’s share of | 67 |
| vast quantities of | 48 |
| waist deep in | 24 |
| infinitesimally small | 23 |
| tiny number of | 19 |
| infinitely more | 17 |
| miniscule amounts of | 14 |
| gazillion | 12 |
| crap load of | 8 |
| shit load | 7 |
| up the wazoo | 6 |
| infinitely small | 6 |
| bazillion | 5 |
| infinitely less | 3 |
| infinitely large | 3 |
| butt load | 3 |
| boat loads of | 3 |
| buttload | 1 |
Additional Ways in Which Ordering and Ranking are Described
| first | firstly | 1stly | primary | 1ary |
|---|---|---|---|---|
| second | secondly | 2ndly | secondary | 2ndary |
| third | thirdly | 3rdly | tertiary | 3rdary |
| fourth | fourthly | 4thly | quaternary | |
| fifth | fifthly | 5thly | quinary | |
| sixth | sixthly | 6thly | senary | |
| seventh | seventhly | 7thly | septenary | |
| hundredth (40) | ||||
| thousandth | unary | 2ary | ||
| millionth | binary | 3ary | ||
| billionth | ternary | 4ary |
Tuples
| singling | singled | singles | single | singleton |
|---|---|---|---|---|
| doubling | doubled | doubles | double | twins |
| tripling | tripled | triples | triple | triplets |
| quadrupling | quadrupled | quadruples | quadruple | quadruplets |
| quintupling | quintupled | quintuples | quintuple | quintuplets(122) |
| sextupling | sextupled | sextuples | sextuple | sextuplets |
| septupling | septupled | septuples | septuple | septuplets |
| octupling (0) | octupled (0) | octuples (0) | octuple (1) | octuplets |
Results from a Cohort Identification Experimenta
| (a) | (b) | (c) | (d) | (e) | (f) | (g) |
|---|---|---|---|---|---|---|
| Phrase 1 (containing the Arabic numerical variant) | Number of patients with Phrase 1 only | % of patients | Number of patients with both Phrase 1 and Phrase 2 | Number of patients with Phrase 2 only | % of patients | Phrase 2 (containing the Roman numerical variant) |
| citrullinemia type 1 | 2 | 25.0 | 1 | 1 | 50.0 | citrullinemia type I |
| type 2 diabetes mellitus | 43,777 | 10.5 | 7919 | 6053 | 75.8b | type II diabetes mellitus |
| type 1 neurofibromatosis | 181 | 24.5 | 56 | 77 | 57.6b | type I neurofibromatosis |
| Tanner Stage 3 | 7639 | 57.8b | 1373 | 12,367 | 35.7 | Tanner Stage III |
| grade 3 anaplastic astrocytoma | 42 | 36.7 | 27 | 40 | 38.5 | grade III anaplastic astrocytoma |
| stage 3 chronic kidney disease | 615 | 67.4b | 446 | 2190 | 18.9 | stage III chronic kidney disease |
| factor 9 deficiency | 14 | 68.1b | 51 | 139 | 6.9 | factor IX deficiency |
| class 3 malocclusion | 135 | 81.2b | 115 | 1079 | 10.2 | class III malocclusion |
| phase 1 clinical trial | 320 | 66.5b | 263 | 1158 | 18.4 | phase I clinical trial |
| Mallampati score: 4 | 121 | 27.8 | 1 | 47 | 71.6b | Mallampati score: IV |
aReesults from a cohort identification exercise for 10 diagnoses and clinical findings in the clinical notes, including counts of the number of patients identified by searching for phrases containing either the Arabic or Roman numeral variants, or both. The percentage of patients potentially missed by searching for only one of the variants is displayed
b Cells with percentages > 50%