| Literature DB >> 21992659 |
Abstract
BACKGROUND: Test collections for information retrieval are scarce. Domain specific test collections even more so, and medical test collections in the Swedish language non-existent prior to the making of the MedEval test collection. Most research in information retrieval has been performed in the English language, thus most test collections contain English documents. However, English is morphologically poor compared to many other European languages and a number of interesting and important aspects have not been investigated. Building a medical test collection in Swedish opens new research opportunities.Entities:
Year: 2011 PMID: 21992659 PMCID: PMC3194176 DOI: 10.1186/2041-1480-2-S3-S4
Source DB: PubMed Journal: J Biomed Semantics
The genres of the documents in the MedEval document collection
| Type of source | Number of documents | Percent of documents | Number of tokens | Percent of tokens |
|---|---|---|---|---|
| Journals and periodicals | 8,453 | 20.0 | 5.3 million | 34.6 |
| Specialized sites | 14,631 | 34.6 | 2.9 million | 19.1 |
| Pharmaceutical companies | 9,200 | 21.8 | 2.3 million | 14.8 |
| Government, faculties, institutes, and hospitals | 2,955 | 7.0 | 2.0 million | 13.3 |
| Health-care communication companies | 4,036 | 9.6 | 1.7 million | 11.3 |
| Media (TV, daily newspapers) | 2,980 | 7.1 | 1.0 million | 6.9 |
| Total | 42,255 | 100.1 | 15.2 million | 100 |
The genres and sizes of the MedEval document sources. The MedEval document collection is a snapshot of the MedLex corpus in October 2007. (D. Kokkinakis, p.c.)
Type and token frequencies of terms
| Entire collection | Assessed documents | Doctors assessed | Patients assessed | Common files | Doctors relevant | Patients relevant | |
|---|---|---|---|---|---|---|---|
| Number of documents | 42,250 | 7,044 | 3,272 | 4,334 | 562 | 1,233 | 1,654 |
| Tokens | 12,991,157 | 5,034,323 | 3,232,772 | 2,431,160 | 629,609 | 1,361,700 | 988,236 |
| Tokens/document | 307 | 715 | 988 | 561 | 1,120 | 1,104 | 596 |
| Average word length | 5.75 | 6.04 | 6.29 | 5.73 | 6.16 | 6.33 | 5.63 |
| Full form types | 334,559 | 181,354 | 154,901 | 92,803 | 50,961 | 87,814 | 43,825 |
| Lemma types | 267,892 | 146,631 | 126,217 | 73,121 | 40,857 | 71,974 | 34,263 |
| Lemma type token ratio | 48.5 | 34.3 | 25.6 | 33.2 | 15.4 | 18.9 | 28.8 |
| Compound tokens | 1,273,874 | 573,625 | 412,475 | 237,267 | 76,117 | 179,580 | 92,420 |
| Full form compound types | 187,904 | 99,614 | 83,846 | 47,387 | 24,083 | 45,257 | 20,157 |
| Lemma compound types | 144,159 | 78,508 | 66,907 | 37,151 | 19,685 | 36,867 | 16,006 |
| Ratio of compounds | 0.098 | 0.114 | 0.128 | 0.098 | 0.120 | 0.132 | 0.094 |
Statistics for different categories of terms in different subsets of documents in the MedEval test collection.
Figure 1Sample of information need. An example of an information need, Topic 51, whith ID, title, description, and narrative. The information need is first given in Swedish, as in the collection, thereafter in an English translation.
Frequencies of adjectives
| Doctor documents | Patient documents | ||||
|---|---|---|---|---|---|
| Term | Equivalent | Non-neuter singular indefinite | Plural and/or definite | Non-neuter singular indefinite | Plural and/or definite |
| sjuk | sick | 165 | 462 | 333 | 371 |
| smittad | infected | 115 | 501 | 332 | 320 |
| fet | fat | 67 | 137 | 219 | 193 |
| tjock | thick/fat | 59 | 15 | 152 | 28 |
| smal | thin | 22 | 21 | 41 | 25 |
| gravid | pregnant | 78 | 471 | 651 | 402 |
| allergisk | allergic | 364 | 210 | 432 | 282 |
| överkänslig | hypersensitive | 15 | 10 | 72 | 15 |
| deprimerad | depressed | 20 | 89 | 79 | 42 |
Adjectives in non-expert documents have a stronger tendency to be in the singular indefinite form than adjectives in the expert documents. This corresponds to the patient documents having a tendency of being interactive in their approach while doctor documents often describe generic cases.
Figure 2Recall bases for three topics and three scenarios. The recall bases of Topics 28, 36, and 92 represented in ideal cumulated gain for the three scenarios: None, Doctors and Patients. For Topic 28 most of the highly relevant and fairly relevant documents were assessed to have target group Doctors. Topic 36 had the relevant documents spread fairly evenly between the Doctors and Patients target groups. Topic 92 showed no documents of any relevance grade for documents marked for target group Doctors. Thus the None and the Patients ideal gain vector coincide fully, while the cumulated gain for the Doctors scenario is very low originating from the downgraded patient documents.
Runs for Topic 51
| Effectiveness | anemi ‘anemia’ | blodbrist ‘anemia’ | Both | |
|---|---|---|---|---|
| Recall@10 | 50% (4/8) | 0% (0/8) | 0% (0/8) | |
| Recall@20 | 87% (7/8) | 0% (0/8) | 0% (0/8) | |
| Recall@100 | 100% (8/8) | 0% (0/8) | 100% (8/8) | |
| nDCG@100 | 0.77 | 0.25 | 0.48 | |
| Recall@10 | 28% (5/18) | 33% (6/18) | 33% (6/18) | |
| Recall@20 | 39% (7/18) | 39% (7/18) | 50% (9/18) | |
| Recall@100 | 72% (13/18) | 56% (10/18) | 89% (16/18) | |
| nDCG@100 | 0.60 | 0.61 | 0.76 | |
Varför kan en patient med cancer drabbas av anemi?
Why may a patient with cancer contract anemia?
Runs for Topic 66
| Effectiveness | anafylaxi ‘anaphylaxis’ | allergisk chock ‘allergic shock’ | Both | |
|---|---|---|---|---|
| Recall@10 | 43% (3/7) | 0% (0/7) | 29% (2/7) | |
| Recall@20 | 57% (4/7) | 0% (0/7) | 43% (3/7) | |
| Recall@100 | 57% (4/7) | 0% (0/7) | 57% (4/7) | |
| nDCG@100 | 0.66 | 0.03 | 0.53 | |
| Recall@10 | 67% (2/3) | 0% (0/3) | 33% (1/3) | |
| Recall@20 | 67% (2/3) | 33% (1/3) | 100% (3/3) | |
| Recall@100 | 67% (2/3) | 33% (1/3) | 100% (3/3) | |
| nDCG@100 | 0.50 | 0.19 | 0.55 | |
Hur behandlas anafylaxi till följd av allergi?
How is anaphylaxis due to allergy treated?
Runs for Topic 63
| Effectiveness | ventrikel ‘stomach’ | magsäck ‘stomach’ | Both | |
|---|---|---|---|---|
| Recall@10 | 50% (2/4) | 0% (0/4) | 50% (2/4) | |
| Recall@20 | 50% (2/4) | 0% (0/4) | 50% (2/4) | |
| Recall@100 | 50% (2/4) | 50% (2/4) | 50% (2/4) | |
| nDCG@100 | 0.30 | 0.39 | 0.45 | |
| Recall@10 | 0% (0/6) | 50% (3/6) | 0% (0/6) | |
| Recall@20 | 0% (0/6) | 67% (4/6) | 17% (1/6) | |
| Recall@100 | 0% (0/6) | 83% (5/6) | 67% (4/6) | |
| nDCG@100 | 0.12 | 0.55 | 0.35 | |
Vilka tekniker och redskap används vid biopsi av magsäck vid cancermisstanke?
What techniques and equipment are used when performing biopsy of the stomach suspecting cancer?
Runs for Topic 48
| Effectiveness | esofagus ‘esophagus’ | matstrupe ‘esophagus’ | Both | |
|---|---|---|---|---|
| Recall@10 | 12% (2/16) | 0% (0/16) | 12% (2/16) | |
| Recall@20 | 25% (4/16) | 0% (0/16) | 19% (3/16) | |
| Recall@100 | 50% (8/16) | 19% (3/16) | 56% (9/16) | |
| nDCG@100 | 0.30 | 0.13 | 0.46 | |
| Recall@10 | 0% (0/7) | 0% (0/7) | 29% (2/7) | |
| Recall@20 | 29% (2/7) | 14% (1/7) | 29% (2/7) | |
| Recall@100 | 29% (2/7) | 57% (4/7) | 57% (4/7) | |
| nDCG@100 | 0.23 | 0.23 | 0.53 | |
Topic 48. Vad är prognosen vid olika typer av cancer i matstrupen?
What is the prognosis of various types of cancer of the esophagus?
Runs for Topic 7
| Effectiveness | cytostatika ‘chemotherapy’ | cellgift ‘chemo’ | Both | |
|---|---|---|---|---|
| Recall@10 | 19% (5/27) | 15% (4/27) | 7% (2/27) | |
| Recall@20 | 30% (8/27) | 19% (5/27) | 7% (2/27) | |
| Recall@100 | 52% (14/27) | 33% (9/27) | 37% (10/27) | |
| nDCG@100 | 0.54 | 0.28 | 0.28 | |
| Recall@10 | 17% (8/47) | 6% (3/47) | 4% (2/47) | |
| Recall@20 | 23% (11/47) | 11% (5/47) | 13% (6/47) | |
| Recall@100 | 70% (33/47) | 15% (7/47) | 30% (14/47) | |
| nDCG@100 | 0.60 | 0.29 | 0.33 | |
Vilka biverkningar kan man räkna med vid behandling av cancer med cellgift?
Which side effects can one expect when treating cancer with chemotherapy?
Runs for Topic 83
| Effectiveness | synkope ‘syncope’ | svimning ‘fainting’ | Both | |
|---|---|---|---|---|
| Recall@10 | 43% (3/7) | 43% (3/7) | 43% (3/7) | |
| Recall@20 | 43% (3/7) | 43% (3/7) | 57% (4/7) | |
| Recall@100 | 43% (3/7) | 57% (4/7) | 57% (4/7) | |
| nDCG@100 | 0.39 | 0.47 | 0.57 | |
| Recall@10 | 20% (2/10) | 50% (5/10) | 50% (5/10) | |
| Recall@20 | 20% (2/10) | 50% (5/10) | 60% (6/10) | |
| Recall@100 | 20% (2/10) | 60% (6/10) | 60% (6/10) | |
| nDCG@100 | 0.19 | 0.53 | 0.48 | |
Vilka är de bakomliggande orsakerna till synkope och hur behandlar man det?
What are the underlying causes of syncope and how is it treated?
Runs for Topic 68
| Effectiveness | trombos ‘thrombosis’ | blodpropp ‘blood clot’ | Both | |
|---|---|---|---|---|
| Recall@10 | 18% (6/34) | 6% (2/34) | 9% (3/34) | |
| Recall@20 | 21% (7/34) | 12% (4/34) | 15% (5/34) | |
| Recall@100 | 56% (19/34) | 29% (10/34) | 68% (23/34) | |
| nDCG@100 | 0.51 | 0.33 | 0.48 | |
| Recall@10 | 18% (3/17) | 24% (4/17) | 18% (3/17) | |
| Recall@20 | 24% (4/17) | 41% (7/17) | 24% (4/17) | |
| Recall@100 | 35% (6/17) | 65% (11/17) | 82% (14/17) | |
| nDCG@100 | 0.37 | 0.62 | 0.56 | |
Vilka symtom associeras med DVT, djup ventrombos, och hur ser behandlingen ut?
Which are the symptoms associated with DVT, deep venous thrombosis, and what does the treatment look like?