| Literature DB >> 16722582 |
Aaron M Cohen1, William R Hersh.
Abstract
BACKGROUND: The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories.Entities:
Year: 2006 PMID: 16722582 PMCID: PMC1440303 DOI: 10.1186/1747-5333-1-4
Source DB: PubMed Journal: J Biomed Discov Collab ISSN: 1747-5333
Number of papers total and available in the mouse, mus, or murine subset.
| JBC | 6566, 4199 | 6593, 4282 | 13159, 8481 |
| JCB | 530, 256 | 715, 359 | 1245, 615 |
| PNAS | 3041, 1382 | 2888, 1402 | 5929, 2784 |
| Total papers | 10137, 5837 | 10196, 6043 | 20333, 11880 |
Figure 1Document grouping. Grouping of documents for categorization subtasks.
Data set positive and negative sample counts.
| Training (year 2002) | 375 | 5462 | 5837 |
| Test (year 2003) | 420 | 5623 | 6043 |
Boundary cases for utility measure of triage task for training and test data.
| Completely perfect prediction | 1.0 | 1.0 |
| Triage everything | 0.27 | 0.33 |
| Triage nothing | 0 | 0 |
| Completely imperfect prediction | -0.73 | -0.67 |
Data file contents and counts for annotation hierarchy subtasks.
| Documents – PMIDs | 504 | 378 |
| Genes – Gene symbol, MGI identifier, and gene name for all used | 1294 | 777 |
| Document gene pairs – PMID-gene pairs | 1418 | 877 |
| Positive examples – PMIDs | 178 | 149 |
| Positive examples – PMID-gene pairs | 346 | 295 |
| Positive examples – PMID-gene-domain tuples | 589 | 495 |
| Positive examples – PMID-gene-domain-evidence tuples | 640 | 522 |
| Positive examples – all PMID-gene-GO-evidence tuples | 872 | 693 |
| Negative examples – PMIDs | 326 | 229 |
| Negative examples – PMID-gene pairs | 1072 | 582 |
Example required submission format for each task.
| Task | ||||||
| Triage | Format: <TASK> | <PMID> | <TAG> | |||
| Example: triage | Example: 12213961 | Example: OHSU_TR | ||||
| Annotation hierarchy | Format: <TASK> | <PMID> | <GENE> | <HIERARCHY> | <TAG> | |
| Example: annhi | 12213961 | Stat4 | BP | OHSU_AH | ||
| Annotation hierachy plus evidence | Format: <TASK> | <PMID> | <GENE> | <HIERARCHY> | <EVIDENCE CODE> | <TAG> |
| Example: annhiev | 12213961 | Stat4 | BP | IDA | OHSU_AHPE | |
Triage subtask runs, sorted by utility.
| dimacsTfl9d | rutgers.dayanik [3] | 0.1579 | 0.8881 | 0.2681 | 0.6512 |
| dimacsTl9mhg | rutgers.dayanik [3] | 0.1514 | 0.8952 | 0.259 | 0.6443 |
| dimacsTfl9w | rutgers.dayanik [3] | 0.1553 | 0.8833 | 0.2642 | 0.6431 |
| dimacsTl9md | rutgers.dayanik [3] | 0.173 | 0.7952 | 0.2841 | 0.6051 |
| pllsgen4t3 | patolis.fujita [4] | 0.149 | 0.769 | 0.2496 | 0.5494 |
| pllsgen4t4 | patolis.fujita [4] | 0.1259 | 0.831 | 0.2186 | 0.5424 |
| pllsgen4t2 | patolis.fujita [4] | 0.1618 | 0.7238 | 0.2645 | 0.5363 |
| pllsgen4t5 | patolis.fujita [4] | 0.174 | 0.6976 | 0.2785 | 0.532 |
| pllsgen4t1 | patolis.fujita [4] | 0.1694 | 0.7024 | 0.273 | 0.5302 |
| GUCwdply2000 | german.u.cairo [11] | 0.151 | 0.719 | 0.2496 | 0.5169 |
| KoikeyaTri1 | u.tokyo (none) | 0.0938 | 0.9643 | 0.171 | 0.4986 |
| OHSUVP | ohsu.hersh [5] | 0.1714 | 0.6571 | 0.2719 | 0.4983 |
| KoikeyaTri3 | u.tokyo (none) | 0.0955 | 0.9452 | 0.1734 | 0.4974 |
| KoikeyaTri2 | u.tokyo (none) | 0.0913 | 0.9738 | 0.167 | 0.4893 |
| NLMT2SVM | nlm.umd.ul [12] | 0.1286 | 0.7333 | 0.2188 | 0.4849 |
| dimacsTl9w | rutgers.dayanik [3] | 0.1456 | 0.6643 | 0.2389 | 0.4694 |
| nusbird2004c | mlg.nus [13] | 0.1731 | 0.5833 | 0.267 | 0.444 |
| lgct1 | indiana.u.seki [7] | 0.1118 | 0.7214 | 0.1935 | 0.4348 |
| OHSUNBAYES | ohsu.hersh [5] | 0.129 | 0.6548 | 0.2155 | 0.4337 |
| NLMT2BAYES | nlm.umd.ul [12] | 0.0902 | 0.869 | 0.1635 | 0.4308 |
| THIRcat04 | tsinghua.ma [14] | 0.0908 | 0.7881 | 0.1628 | 0.3935 |
| GUClin1700 | german.u.cairo [11] | 0.1382 | 0.5595 | 0.2217 | 0.3851 |
| NLMT22 | nlm.umd.ul [12] | 0.1986 | 0.481 | 0.2811 | 0.3839 |
| NTU2v3N1 | ntu.chen [15] | 0.1003 | 0.6905 | 0.1752 | 0.381 |
| NLMT21 | nlm.umd.ul [12] | 0.195 | 0.4643 | 0.2746 | 0.3685 |
| GUCply1700 | german.u.cairo [11] | 0.1324 | 0.5357 | 0.2123 | 0.3601 |
| NTU3v3N1 | ntu.chen [15] | 0.0953 | 0.6857 | 0.1673 | 0.3601 |
| NLMT2ADA | nlm.umd.ul [12] | 0.0713 | 0.9881 | 0.133 | 0.3448 |
| lgct2 | indiana.u.seki [7] | 0.1086 | 0.581 | 0.183 | 0.3426 |
| GUClin1260 | german.u.cairo [11] | 0.1563 | 0.469 | 0.2345 | 0.3425 |
| THIRcat01 | tsinghua.ma [14] | 0.1021 | 0.6024 | 0.1746 | 0.3375 |
| NTU4v3N1416 | ntu.chen [15] | 0.0948 | 0.6357 | 0.165 | 0.3323 |
| THIRcat02 | tsinghua.ma [14] | 0.1033 | 0.5571 | 0.1743 | 0.3154 |
| biotext1trge | u.cberkeley.hearst [16] | 0.0831 | 0.7 | 0.1486 | 0.3139 |
| GUCply1260 | german.u.cairo [11] | 0.1444 | 0.4333 | 0.2167 | 0.305 |
| OHSUSVMJ20 | ohsu.hersh [5] | 0.2309 | 0.3524 | 0.279 | 0.2937 |
| biotext2trge | u.cberkeley.hearst [16] | 0.095 | 0.5548 | 0.1622 | 0.2905 |
| THIRcat03 | tsinghua.ma [14] | 0.0914 | 0.55 | 0.1567 | 0.2765 |
| THIRcat05 | tsinghua.ma [14] | 0.1082 | 0.4167 | 0.1718 | 0.245 |
| biotext3trge | u.cberkeley.hearst [16] | 0.1096 | 0.4024 | 0.1723 | 0.2389 |
| nusbird2004a | mlg.nus [13] | 0.1373 | 0.3357 | 0.1949 | 0.2302 |
| nusbird2004d | mlg.nus [13] | 0.1349 | 0.2881 | 0.1838 | 0.1957 |
| nusbird2004b | mlg.nus [13] | 0.1163 | 0.3 | 0.1677 | 0.1861 |
| eres2 | u.edinburgh.sinclair [17] | 0.1647 | 0.231 | 0.1923 | 0.1724 |
| biotext4trge | u.cberkeley.hearst [16] | 0.1271 | 0.2571 | 0.1701 | 0.1688 |
| emet2 | u.edinburgh.sinclair [17] | 0.1847 | 0.2071 | 0.1953 | 0.1614 |
| epub2 | u.edinburgh.sinclair [17] | 0.1729 | 0.2095 | 0.1895 | 0.1594 |
| nusbird2004e | mlg.nus [13] | 0.136 | 0.231 | 0.1712 | 0.1576 |
| geneteam3 | u.hospital.geneva [18] | 0.1829 | 0.1833 | 0.1831 | 0.1424 |
| edis2 | u.edinburgh.sinclair [17] | 0.1602 | 0.1857 | 0.172 | 0.137 |
| wdtriage1 | indiana.u.yang [19] | 0.202 | 0.1476 | 0.1706 | 0.1185 |
| eint2 | u.edinburgh.sinclair [17] | 0.1538 | 0.1619 | 0.1578 | 0.1174 |
| NTU3v3N1c2 | ntu.chen [15] | 0.1553 | 0.1357 | 0.1449 | 0.0988 |
| geneteam1 | u.hospital.geneva [18] | 0.1333 | 0.1333 | 0.1333 | 0.09 |
| geneteam2 | u.hospital.geneva [18] | 0.1333 | 0.1333 | 0.1333 | 0.09 |
| biotext5trge | u.cberkeley.hearst [16] | 0.1192 | 0.1214 | 0.1203 | 0.0765 |
| TRICSUSM | u.sanmarcos [20] | 0.0792 | 0.1762 | 0.1093 | 0.0738 |
| IBMIRLver1 | ibm.india (none) | 0.2053 | 0.0738 | 0.1086 | 0.0595 |
| EMCTNOT1 | tno.kraaij [21] | 0.2 | 0.0143 | 0.0267 | 0.0114 |
| Mean | 0.1381 | 0.5194 | 0.1946 | 0.3303 | |
| MeSH | rutgers.dayanik [3] | 0.1502 | 0.8929 | 0.2572 | 0.6404 |
Figure 2Triage subtask. Triage subtask runs sorted by Unorm score. The Unorm for the MeSH term Mice as well as for selecting all articles as positive is shown.
Figure 3Number of GO codes by document frequency. This graph shows the number of GO codes at increasing levels of frequency that appear in the combined (test + training) corpus.
Figure 4Number of documents with frequency of most common GO code. This graph shows the number of combined corpus documents having a most common GO code whose frequency is given on the x-axis.
Annotation hierarchy subtask, sorted by F-score.
| lgcad1 | indiana.u.seki [7] | 0.4415 | 0.7697 | 0.5611 |
| lgcad2 | indiana.u.seki [7] | 0.4275 | 0.7859 | 0.5537 |
| wiscWRT | u.wisconsin [8] | 0.4386 | 0.6202 | 0.5138 |
| wiscWT | u.wisconsin [8] | 0.4218 | 0.6263 | 0.5041 |
| dimacsAg3mh | rutgers.dayanik [3] | 0.5344 | 0.4545 | 0.4913 |
| NLMA1 | nlm.umd.ul [12] | 0.4306 | 0.5515 | 0.4836 |
| wiscWR | u.wisconsin [8] | 0.4255 | 0.5596 | 0.4834 |
| NLMA2 | nlm.umd.ul [12] | 0.427 | 0.5374 | 0.4758 |
| wiscW | u.wisconsin [8] | 0.3935 | 0.5596 | 0.4621 |
| KoikeyaHi1 | u.tokyo (none) | 0.3178 | 0.7293 | 0.4427 |
| iowarun3 | u.iowa [22] | 0.3207 | 0.6 | 0.418 |
| iowarun1 | u.iowa [22] | 0.3371 | 0.5434 | 0.4161 |
| iowarun2 | u.iowa [22] | 0.3812 | 0.4505 | 0.413 |
| BIOTEXT22 | u.cberkeley.hearst [16] | 0.2708 | 0.796 | 0.4041 |
| BIOTEXT21 | u.cberkeley.hearst [16] | 0.2658 | 0.8141 | 0.4008 |
| dimacsAl3w | rutgers.dayanik [3] | 0.5015 | 0.3273 | 0.3961 |
| GUCsvm0 | german.u.cairo [11] | 0.2372 | 0.7414 | 0.3595 |
| GUCir50 | german.u.cairo [11] | 0.2303 | 0.8081 | 0.3584 |
| geneteamA5 | u.hospital.geneva [18] | 0.2274 | 0.7859 | 0.3527 |
| GUCir30 | german.u.cairo [11] | 0.2212 | 0.8404 | 0.3502 |
| geneteamA4 | u.hospital.geneva [18] | 0.209 | 0.9354 | 0.3417 |
| BIOTEXT24 | u.cberkeley.hearst [16] | 0.4452 | 0.2707 | 0.3367 |
| GUCsvm5 | german.u.cairo [11] | 0.2052 | 0.9354 | 0.3366 |
| cuhkrun3 | chinese.u.hongkong (none) | 0.4174 | 0.2808 | 0.3357 |
| geneteamA2 | u.hospital.geneva [18] | 0.2025 | 0.9535 | 0.334 |
| dimacsAabsw1 | rutgers.dayanik [3] | 0.5979 | 0.2283 | 0.3304 |
| BIOTEXT23 | u.cberkeley.hearst [16] | 0.4437 | 0.2626 | 0.3299 |
| geneteamA1 | u.hospital.geneva [18] | 0.1948 | 0.9778 | 0.3248 |
| geneteamA3 | u.hospital.geneva [18] | 0.1938 | 0.9798 | 0.3235 |
| GUCbase | german.u.cairo [11] | 0.1881 | 1 | 0.3167 |
| BIOTEXT25 | u.cberkeley.hearst [16] | 0.4181 | 0.2525 | 0.3149 |
| cuhkrun2 | chinese.u.hongkong (none) | 0.4385 | 0.2303 | 0.302 |
| cuhkrun1 | chinese.u.hongkong (none) | 0.4431 | 0.2283 | 0.3013 |
| dimacsAp5w5 | rutgers.dayanik [3] | 0.5424 | 0.1939 | 0.2857 |
| dimacsAw20w5 | rutgers.dayanik [3] | 0.6014 | 0.1677 | 0.2622 |
| iowarun4 | u.iowa [22] | 0.1692 | 0.1333 | 0.1492 |
| Mean | 0.3600 | 0.5814 | 0.3824 |
Annotation hierarchy plus evidence code subtask, sorted by F-score.
| lgcab2 | indiana.u.seki [7] | 0.3238 | 0.6073 | 0.4224 |
| lgcab1 | indiana.u.seki [7] | 0.3413 | 0.4923 | 0.4031 |
| KoikeyaHiev1 | u.tokyo (none) | 0.2025 | 0.4406 | 0.2774 |
| Mean | 0.2892 | 0.5134 | 0.3676 |
Figure 5Annotation hierarchy subtask. Annotation hierarchy subtask results sorted by F-score.