| Literature DB >> 29688372 |
Artur Cieslewicz1, Jakub Dutkiewicz2, Czeslaw Jedrzejek2.
Abstract
Database URL: https://biocaddie.org/benchmark-data.Entities:
Mesh:
Year: 2018 PMID: 29688372 PMCID: PMC5846287 DOI: 10.1093/database/bax103
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Characteristics of the collection
| Repository | Description of repository | Number of documents | Number of different json key patterns within <METADATA> tag | Number of documents with valid | ||
|---|---|---|---|---|---|---|
| Title | Keywords | Description | ||||
| arrayexpress | Data from high-throughput functional genomics experiments | 60 881 | 17 | 60 817 | 0 | 60 804 |
| bioproject | Collection of genomics, functional genomics and genetics studies and links to their resulting datasets | 155 850 | 41 | 155 631 | 117 577 | 149 399 |
| cia | Archive of cancer imaging data | 63 | 1 | 44 | 0 | 63 |
| clinicaltrials | Collection of data concerning publicly- and privately supported clinical studies of human participants conducted around the world | 192 500 | 5518 | 192 486 | 138 983 | 191 934 |
| ctn | Repository of data from National Drug Abuse Treatment Clinical Trials Network | 46 | 1 | 46 | 44 | 46 |
| cvrg | CardioVascular Research Grid | 29 | 5 | 29 | 0 | 28 |
| dataverse | Open-source research data repository software | 60 303 | 7 | 60 037 | 0 | 60 303 |
| dryad | General-purpose database for wide diversity of databases | 67 455 | 98 | 62 795 | 60 957 | 58 421 |
| gemma | Database for genomics data (especially gene expression profiles) | 2285 | 1 | 2272 | 0 | 2285 |
| geo | Datasets focused on gene expression | 105 033 | 4 | 96 264 | 0 | 105 033 |
| mpd | Collection of measured data on laboratory mouse strains and populations | 235 | 1 | 235 | 0 | 235 |
| neuromorpho | Collection of digitally reconstructed neurons associated with peer-reviewed publications | 34 082 | 1 | 30 016 | 0 | 34 082 |
| nursadatasets | Repository of data on the role of nuclear receptors (NRs) in human diseases and conditions in which NRs play an integral role | 389 | 2 | 389 | 387 | 389 |
| openfmri | Collection of magnetic resonance imaging data | 36 | 1 | 35 | 0 | 36 |
| pdb | Database with protein aminoacid sequences | 113 493 | 1410 | 113 424 | 113 492 | 113 331 |
| peptideatlas | Public compendium of peptides identified in mass spectrometry proteomics experiments | 76 | 1 | 55 | 0 | 76 |
| phenodisco | Repository of data from studies investigating the interaction of genotype and phenotype in Humans | 429 | 1 | 429 | 0 | 429 |
| physiobank | The archive containing digital recordings of physiologic signals and related data | 70 | 1 | 70 | 0 | 70 |
| Proteomexchange | Mass spectrometry proteomics data | 1716 | 1 | 1706 | 1716 | 1716 |
| Yped | Open-source proteomics database for high throughput proteomic and small molecule data | 21 | 1 | 21 | 0 | 21 |
| Total | 794 992 | 7113 | 776 801 | 433 156 | 778 701 | |
In many documents, certain data are missing or are removed as uninformative. Of all documents, 97.71% had valid title, 54.49% keywords and 97.95% description. In total, 99.98% had no valid title, keywords or description.
Preparation of text data for title, keywords and description categories
| Repository | Categories | ||
|---|---|---|---|
| Title | Keywords | Description | |
| Arrayexpress | title | description | |
| Bioproject | title | dataItemkeywords | organismtargetspecies, dataItemdescription |
| Cia | title | anatomicalPartname, diseasename, organismname, organismscientificname | |
| Clinicaltrials | title | keyword | criteria, StudyGroupdescription, Diseasename, Treatmentdescription, Treatmentagent, Datasetdescription |
| Ctn | title | datasetkeywords | datasetdescription, organismscientificName, organismname |
| Cvrg | title | datasetdescription | |
| Dataverse | title | publicationdescription, datasetdescription | |
| Dryad | title | datasetkeywords | datasetdescription |
| Gemma | title | dataItemdescription, organismcommonName | |
| Geo | title | dataItemsource_name, dataItemorganism, dataItemdescription, text data downloaded from geo database on the basis of the geo_accesion code | |
| Mpd | title | datasetdescription, organismscientificName, organismname | |
| Neuromorpho | title | anatomicalPartname, cellname, organismscientificName, organismname | |
| Nursadatasets | title | datasetkeywords | datasetdescription, organismname |
| Openfmri | title | datasetdescription | |
| Pdb | title | dataItemkeywords | dataItemdescription, organismsourcescientificName, organismhostscientificName, genename |
| Peptideatlas | title | datasetdescription, treatmentdescription | |
| Phenodisco | title | inexclude, desc, disease, history | |
| Physiobank | title | datasetdescription | |
| Proteomexchange | title | keywords | organismname |
| Yped | title | datasetdescription, organismname | |
Items in the table represent column names from the SQL database (prepared on the basis of documents’ JSON keys). In most cases, more than one column was used to prepare text categorized as Description. Thirteen repositories did not provide any keywords.
Examples of datasets having very little useful information
| No. | Docu-ment number | Repository | Title | Keywords | Description |
|---|---|---|---|---|---|
| 1 | 104242 | dryad | chr19 | NULL | NULL |
| 2 | 108196 | dryad | Chr8 | NULL | NULL |
| 3 | 124757 | bioproject | Sobemovirus | NULL | NULL |
| 4 | 151909 | bioproject | Alphaflexiviridae | NULL | NULL |
| 5 | 500000 | geo | A375R_RPL10a_vivo__ Ronly_vem10d_rep2 | NULL | melanoma |
| 6 | 500002 | geo | A375_vitro_vehicle_rep3 | NULL | melanoma |
NULL means that in the source file there was no information that could be categorized as ‘Title’, ‘Keywords’ or ‘Description’.
Original Poznan consortium results as submitted for the challenge vs. the best participant results for a given evaluation measure (in bold font)
| Group | Submission | infAP | infNDCG | NDCG@10 | P@10 (+partial) | P@10 (−partial) |
|---|---|---|---|---|---|---|
| IAII_PUT | Biocaddie dphresults.txt | 0.0876 | 0.3580 | 0.4265 | 0.5333 | 0.1600 |
| UCSD | armyofucsdgrads-3.txt | 0.1468 | 0.5303 | 0.7133 | 0.2400 | |
| SIBTex | sibtex-5_0.txt | 0.4188 | 0.6271 | 0.7533 | 0.3467 | |
| Elsevier | elsevier4.txt | 0.3049 | 0.4368 | |||
| UIUC GSIS | sdm-0.75-0.1-0.15.krovetz.txt | 0.3228 | 0.4502 | 0.5569 | 0.7133 | 0.2867 |
| BioMelb | Post-challenge | 0.3575 | 0.4219 | 0.7733 | ||
The results of the current Poznan consortium work are shown in italics.
Baseline information retrieval results
| Algorithm | infAP | infNDCG | P@10 (+partial) | P@10 (−partial) |
|---|---|---|---|---|
| BB2 | 0.3550 | 0.4184 | 0.7133 | 0.3533 |
| BM25 | 0.3547 | 0.4055 | 0.7067 | 0.3400 |
| DFR_BM25 | 0.3723 | 0.4085 | 0.7067 | 0.3533 |
| Dfree | 0.3664 | 0.4248 | ||
| DLH | 0.3617 | 0.4120 | 0.7200 | 0.3333 |
| DLH13 | 0.3640 | 0.4207 | 0.7533 | 0.3733 |
| DPH | 0.3442 | 0.4125 | 0.7200 | 0.3400 |
| IFB2 | 0.3494 | 0.3948 | 0.6853 | 0.3400 |
| In_ExpB2 | 0.3534 | 0.4079 | 0.7222 | 0.3667 |
| In_ExpC2 | 0.3379 | 0.4015 | 0.7367 | 0.3333 |
| InL2 | 0.4181 | 0.7367 | 0.3600 | |
| LGD | 0.3773 | 0.7333 | 0.3933 | |
| PL2 | 0.3474 | 0.4009 | 0.7222 | 0.3067 |
| TFIDF | 0.3530 | 0.4120 | 0.7067 | 0.3400 |
Bold font indicates the highest values for a given measure.
Baseline information retrieval results with the best word2vec query expansion and PRF
| Algorithm | Run parameters | infAP | indNDCG | P@10 (+partial) | P@10 (−partial) |
|---|---|---|---|---|---|
| BB2 | terrier Rocchio | 0.3911 | 0.4325 | 0.3200 | |
| BB2 | word2vec and terrier Rocchio | 0.4533 | 0.3200 | ||
| BM25 | terrier Rocchio | 0.3719 | 0.4158 | 0.7067 | 0.3200 |
| BM25 | word2vec and terrier Rocchio | 0.3601 | 0.4286 | 0.6933 | 0.3200 |
| DFR_BM25 | terrier Rocchio | 0.3883 | 0.4066 | 0.7214 | 0.3133 |
| DFR_BM25 | word2vec and terrier Rocchio | 0.3801 | 0.4311 | 0.7267 | 0.3133 |
| Dfree | terrier Rocchio | 0.3910 | 0.4371 | 0.7500 | 0.3667 |
| Dfree | word2vec and terrier Rocchio | 0.3888 | 0.4454 | 0.7567 | 0.3733 |
| DLH | terrier Rocchio | 0.3683 | 0.4181 | 0.7400 | 0.3000 |
| DLH | word2vec and terrier Rocchio | 0.3604 | 0.4292 | 0.7400 | 0.3000 |
| DLH13 | terrier Rocchio | 0.3759 | 0.4324 | 0.7733 | 0.3467 |
| DLH13 | word2vec and terrier Rocchio | 0.3692 | 0.4422 | 0.7733 | 0.3467 |
| DPH | terrier Rocchio | 0.3779 | 0.4194 | 0.7500 | 0.3133 |
| DPH | word2vec and terrier Rocchio | 0.3751 | 0.4276 | 0.7567 | 0.3200 |
| IFB2 | terrier Rocchio | 0.3669 | 0.4005 | 0.7233 | 0.3133 |
| IFB2 | word2vec and terrier Rocchio | 0.3813 | 0.4284 | 0.7367 | 0.3067 |
| In_ExpB2 | terrier Rocchio | 0.3720 | 0.4108 | 0.7433 | 0.3133 |
| In_ExpB2 | word2vec and terrier Rocchio | 0.3816 | 0.4330 | 0.7433 | 0.3133 |
| In_ExpC2 | terrier Rocchio | 0.3720 | 0.3999 | 0.7367 | 0.3133 |
| In_ExpC2 | word2vec and terrier Rocchio | 0.3672 | 0.4157 | 0.7367 | 0.3133 |
| InL2 | terrier Rocchio | 0.4259 | 0.7533 | 0.3133 | |
| InL2 | word2vec and terrier Rocchio | 0.3902 | 0.4360 | 0.7467 | 0.3200 |
| LGD | terrier Rocchio | 0.3990 | 0.4456 | 0.7633 | 0.3867 |
| LGD | word2vec and terrier Rocchio | 0.3978 | 0.7700 | ||
| PL2 | terrier Rocchio | 0.3648 | 0.4082 | 0.7467 | 0.2800 |
| PL2 | word2vec and terrier Rocchio | 0.3542 | 0.4213 | 0.7467 | 0.2800 |
| TFIDF | terrier Rocchio | 0.3641 | 0.4023 | 0.7317 | 0.3133 |
| TFIDF | word2vec and terrier Rocchio | 0.3523 | 0.4154 | 0.7250 | 0.3133 |
Bold font indicates the highest values for a given measure.
Variation of measures for each bioCADDIE question
| Query number | infAP | infNDCG | P@10 (partial) |
|---|---|---|---|
| 1 | 0.4217 | 0.6504 | 0.9000 |
| 2 | 0.3933 | 0.3338 | 0.8000 |
| 3 | 0.5832 | 0.6898 | 0.9000 |
| 4 | 0.6999 | 0.5177 | 1.0000 |
| 5 | 0.1620 | 0.2897 | 0.4000 |
| 6 | 0.3256 | 0.4938 | 1.0000 |
| 7 | 0.1931 | 0.6197 | 0.2500 |
| 8 | 0.0856 | 0.4547 | 0.3000 |
| 9 | 0.2207 | 0.2607 | 0.8000 |
| 10 | 0.1186 | 0.1961 | 0.5000 |
| 11 | 0.6373 | 0.3402 | 1.0000 |
| 12 | 0.5860 | 0.4011 | 0.9000 |
| 13 | 0.3171 | 0.2919 | 0.9000 |
| 14 | 0.7005 | 0.3300 | 0.9000 |
| 15 | 0.5228 | 0.9384 | 1.0000 |
| Average | 0.3978 | 0.4539 | 0.7700 |
Evaluation of search results obtained with the LGD algorithm using the same or different weights for original and expanded terms
| Run | infAP | infNDCG | NDCG@10 | P@10 (+partial) | P@10 (−partial) |
|---|---|---|---|---|---|
| Separate words; terms added manually; same weight of all terms | 0.2896 | 0.3329 | 0.6656 | ||
| Separate words; terms added manually; original query words weight = 100 | 0.3922 | 0.4525 | 0.7633 | ||
| Terms from query as separate words without query expansion | 0.3773 | 0.4355 | 0.6375 | 0.7333 | 0.4000 |
| Terms from query as separate words; Terrier query expansion (PRF) | 0.3990 | 0.4456 | 0.6425 | 0.7633 | 0.3867 |
| Terms from query (weight 100) + word2vec (weight 20 or 1, depending on the corpus − PubMed or bioCADDIE) + Terrier query expansion (PRF) | 0.3978 | 0.4539 | 0.6425 | 0.7700 | 0.3933 |
Manually added terms were chosen by a biology specialist.
Evaluation of search results obtained with various algorithms without use of partially relevant documents
| Expansion method | NoEXP | Terrier | Emb | NoEXP | Terrier | Emb |
|---|---|---|---|---|---|---|
| baseline method | infAP | infAP | infAP | infNDCG | infNDCG | infNDCG |
| InL2c | 0.1940 | 0.2085 | 0.2524 | |||
| BB2 | 0.1853 | 0.2023 | 0.2079 | 0.2469 | 0.2624 | 0.2642 |
| BM25 | 0.1813 | 0.1950 | 0.1980 | 0.2437 | 0.2591 | 0.2610 |
| DFR_BM25 | 0.1893 | 0.1996 | 0.2040 | 0.2469 | 0.2590 | 0.2601 |
| In_expB2 | 0.1841 | 0.1954 | 0.1995 | 0.2439 | 0.2578 | 0.2587 |
| DLH13 | 0.1780 | 0.1815 | 0.1845 | 0.2495 | 0.2529 | 0.2585 |
| LGD | 0.2013 | 0.2569 | 0.2579 | |||
| DLH | 0.1633 | 0.1664 | 0.1688 | 0.2392 | 0.2560 | 0.2579 |
| DFRee | 0.1779 | 0.1905 | 0.1981 | 0.2489 | 0.2547 | 0.2560 |
| IFB | 0.1684 | 0.1762 | 0.1824 | 0.2360 | 0.2467 | 0.2485 |
| In_expC2 | 0.1754 | 0.1786 | 0.1829 | 0.2350 | 0.2408 | 0.2420 |
| PL2 | 0.1660 | 0.1708 | 0.1735 | 0.2367 | 0.2371 | 0.2383 |
| DPH | 0.1584 | 0.1691 | 0.1777 | 0.2422 | 0.2343 | 0.2360 |
| TF_IDF | 0.1827 | 0.1880 | 0.1904 | 0.2456 | 0.2327 | 0.2345 |
Bold font indicates the highest values for a given measure.