| Literature DB >> 19527520 |
Karin Verspoor1, K Bretonnel Cohen, Lawrence Hunter.
Abstract
BACKGROUND: Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption.Entities:
Mesh:
Year: 2009 PMID: 19527520 PMCID: PMC2714574 DOI: 10.1186/1471-2105-10-183
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Incidence of syntactic/semantic phenomena
| CRAFT | TraJour | Reference | BioReference | |
| Document count | 97 | 99 | 2,500 | 163 |
| Sentence count | 43,694 | 35,997 | 53,107 | 32,895 |
| Avg. Sentence count | 450 | 364 | 21 | 202 |
| Token count | 717,166 | 598,331 | 1,096,976 | 654,493 |
| Type count | 41,574 | 49,394 | 40,139 | 38,801 |
| Stopword count | 238,542 | 193,905 | 453,264 | 238,077 |
| Stopword % | 33.3% | 32.4% | 41.3% | 36.4% |
| Avg. Document length | 7,393 | 6,044 | 439 | 4,015 |
| Avg. Sentence length | 22.5 | 24.7 | 26.4 | 27.8 |
| Types/Tokens | 5.8% | 8.3% | 3.7% | 5.9% |
| Tokens/Types | 17.3 | 12.1 | 27.3 | 16.9 |
| Negatives | 3,273 | 2,587 | 7,605 | 2,961 |
| Negatives % | 0.46% | 0.43% | 0.69% | 0.45% |
| Coordination | 25,237 | 23,706 | 26,019 | 25,059 |
| Coordination % | 3.52% | 3.96% | 2.37% | 3.83% |
| Pronouns | 18,874 | 15,603 | 57,406 | 20,699 |
| Pronouns % | 2.63% | 2.61% | 5.23% | 3.16% |
| Passives | 2,783 | 2,587 | 2,661 | 3,172 |
| Passives % | 0.39% | 0.43% | 0.24% | 0.48% |
This table represents the counts of linguistic phenomena determined from our four document sets, CRAFT (open access), TraJour (traditional journals), Reference (Wall Street Journal), and BioReference (full text biomedical publications).
Figure 1Sentence length distribution. Sentence length distributions for the four document sets, measured as the relative proportion of the sentences in the corpus of a particular length. The data here is binned – "10" means a sentence length of 1–10 tokens, "20" 11–20 tokens, etc.
Figure 2Kullback-Leibler divergences. KL divergences at the top n terms for CRAFT (open access) versus TraJour (traditional journal) and for each target corpus against the Wall Street Journal reference corpus and the BioReference corpus.
KL divergence of term probability distributions, CRAFT versus TraJour
| CRAFT v. TraJour | CRAFT v. BioRef. | TraJour v. BioRef. | CRAFT v. Ref. | TraJour v. Ref. | |
| 100 | -0.006925696 | 0.043712192 | 0.020944793 | 0.161024124 | 0.16794174 |
| 200 | -0.007124725 | 0.053331913 | 0.03214335 | 0.236587232 | 0.257485466 |
| 300 | -0.005614059 | 0.050666423 | 0.037185528 | 0.319120526 | 0.341360939 |
| 400 | -0.001556702 | 0.05700178 | 0.047472912 | 0.36994002 | 0.386699809 |
| 500 | 0.007515454 | 0.064545725 | 0.04958329 | 0.411526361 | 0.421816134 |
| 1000 | 0.041726207 | 0.096664283 | 0.089761915 | 0.513431974 | 0.548467754 |
| 1500 | 0.06325848 | 0.134310701 | 0.11321715 | 0.577868266 | 0.641503517 |
| 2000 | 0.078438422 | 0.158005507 | 0.138857184 | 0.642507317 | 0.69333303 |
| 2500 | 0.098753882 | 0.180169586 | 0.157642056 | 0.697711222 | 0.746388986 |
| 3000 | 0.108449436 | 0.19872906 | 0.179409293 | 0.746911394 | 0.817412333 |
| 3500 | 0.118474793 | 0.215904498 | 0.193018939 | 0.794260113 | 0.87476207 |
| 4000 | 0.132179627 | 0.228193197 | 0.207559096 | 0.830437495 | 0.904734502 |
| 4500 | 0.145510397 | 0.244716631 | 0.21989223 | 0.872842604 | 0.942379721 |
| 5000 | 0.152931092 | 0.258427849 | 0.230542781 | 0.89245553 | 0.969431637 |
This table shows the KL divergence of the probability distributions of words in the corpora. Each row in the table corresponds to the figure for the top n most frequent terms in the corpora.
Log Likelihood analysis of terms in CRAFT vs. TraJour
| CRAFT | TraJour | LL |
| figure | 2318.9 | |
| doi | 1099.3 | |
| window | 854.6 | |
| fig | 756.7 | |
| text | 743.6 | |
| abstract | 721.9 | |
| mice | 678.2 | |
| pp | 608.5 | |
| hair | 601.8 | |
| 588.1 | ||
| x1 | 570.6 | |
| full | 550.8 | |
| pgc | 516.9 | |
| ?m | 502.3 | |
| e2 | 465.6 | |
| chm | 460.5 | |
| gp | 435.8 | |
| ephrin | 418.1 | |
| qtl | 381.5 | |
| view | 363.9 | |
| °c | 338.5 | |
| sam68 | 328.8 | |
| atrx | 322.0 | |
| bhlh | 320.2 | |
| ptds | 311.8 | |
| version | 305.1 | |
| olfactory | 301.0 | |
| ca | 294.9 | |
| mena | 294.3 | |
| ap | 292.2 | |
| rb | 292.0 | |
| sox1 | 288.2 | |
| null | 287.4 | |
| file | 278.4 | |
| p300 | 270.1 | |
| -catenin | 264.0 | |
| -1? | 262.1 | |
| kinase | 256.8 | |
| binding | 256.2 | |
| nk | 256.0 | |
| snail | 256.0 | |
| -1?? | 253.6 | |
| ited | 251.2 | |
| larger | 247.3 | |
| states | 244.0 | |
| 5? | 243.9 | |
| nxt1 | 241.7 | |
| strains | 240.3 | |
| articles | 239.6 | |
| wk | 239.4 | |
These are the results of log likelihood analysis of all terms in the CRAFT (open access publications) and TraJour (traditional journals) corpora, ranked by the largest difference.
Log Likelihood analysis of terms in CRAFT vs. BioReference
| CRAFT | BioReference | LL |
| mice | 3755.8 | |
| abstract | 1830.7 | |
| doi | 1650.5 | |
| mouse | 1489.8 | |
| window | 1229.4 | |
| free | 1183.7 | |
| embryos | 1151.8 | |
| figure | 1017.2 | |
| null | 922.5 | |
| embryonic | 657.2 | |
| hair | 611.3 | |
| pgc | 539.9 | |
| ?m | 536.9 | |
| e2 | 532.4 | |
| olfactory | 512.4 | |
| ephrin | 503.2 | |
| development | 492.1 | |
| mutant | 480.3 | |
| view | 471.3 | |
| wild | 465.8 | |
| allele | 455.9 | |
| expression | 451.1 | |
| qtl | 430.6 | |
| version | 424.9 | |
| gene | 416.4 | |
| type | 411.4 | |
| homozygous | 407.4 | |
| larger | 405.2 | |
| knockout | 394.6 | |
| shh | 387.8 | |
| heterozygous | 384.8 | |
| differentiation | 376.9 | |
| fig | 370.8 | |
| °c | 361.8 | |
| atrx | 356.7 | |
| sam68 | 351.5 | |
| sections | 341.2 | |
| new | 336.1 | |
| ptds | 333.3 | |
| ap | 332.3 | |
| es | 324.5 | |
| women | 322.5 | |
| sox1 | 320.3 | |
| targeted | 317.0 | |
| annexin | 316.8 | |
| defects | 312.0 | |
| limb | 311.4 | |
| targeting | 310.0 | |
| cleavage | 306.5 | |
| a7 | 298.7 | |
These are the results of log likelihood analysis of all terms in the CRAFT and BioReference corpora, ranked by the largest difference.
Log Likelihood analysis of terms in TraJour vs. BioReference
| TraJour | BioReference | LL |
| mouse | 1260.6 | |
| mice | 1219.8 | |
| free | 911.8 | |
| embryos | 704.9 | |
| 690.2 | ||
| text | 688.1 | |
| pp | 569.5 | |
| full | 543.5 | |
| expression | 514.6 | |
| medline | 512.5 | |
| crossref | 497.6 | |
| embryonic | 479.8 | |
| development | 447.7 | |
| chm | 443.4 | |
| patients | 425.4 | |
| x1 | 422.4 | |
| risk | 372.3 | |
| bhlh | 357.7 | |
| gp | 356.8 | |
| slap-2 | 354.7 | |
| figure | 354.6 | |
| dpc | 331.1 | |
| jmj | 331.1 | |
| women | 329.3 | |
| tap | 326.3 | |
| pb | 310.7 | |
| nxt1 | 304.5 | |
| isi | 297.6 | |
| p300 | 295.8 | |
| mena | 286.7 | |
| endoderm | 285.6 | |
| hybridization | 275.7 | |
| exercise | 273.5 | |
| cited4 | 273.4 | |
| tbx2 | 270.5 | |
| zfp-57 | 266.0 | |
| otx2 | 264.6 | |
| neural | 263.8 | |
| orderarticleviainfotrieve | 261.3 | |
| sti | 258.9 | |
| abstract | 258.6 | |
| ko | 258.4 | |
| mznf8 | 257.2 | |
| heterozygous | 255.9 | |
| embryo | 252.2 | |
| gl | 249.8 | |
| domain | 249.5 | |
| -catenin | 246.6 | |
| mutants | 245.3 | |
| chl1 | 243.9 | |
These are the results of log likelihood analysis of all terms in the TraJour and BioReference corpora, ranked by the largest difference.
Log Likelihood analysis of terms in CRAFT vs. Reference
| CRAFT | Reference | LL |
| mice | 9705.6 | |
| 's | 9351.0 | |
| said | 7898.0 | |
| cells | 6565.1 | |
| million | 5684.6 | |
| expression | 5272.3 | |
| figure | 4528.4 | |
| 't | 4392.8 | |
| he | 4284.3 | |
| cell | 4224.9 | |
| mouse | 3914.6 | |
| mr | 3850.0 | |
| year | 3788.4 | |
| gene | 3766.7 | |
| company | 3362.6 | |
| protein | 3221.6 | |
| it | 3199.2 | |
| to | 2986.3 | |
| will | 2948.4 | |
| type | 2833.5 | |
| were | 2803.0 | |
| embryos | 2619.0 | |
| its | 2564.8 | |
| stock | 2475.9 | |
| genes | 2442.7 | |
| doi | 2431.4 | |
| mutant | 2383.2 | |
| wild | 2317.2 | |
| about | 2192.3 | |
| new | 2158.6 | |
| analysis | 2140.2 | |
| his | 2107.7 | |
| and | 1972.7 | |
| who | 1843.6 | |
| corp | 1769.0 | |
| they | 1696.4 | |
| null | 1689.1 | |
| dna | 1596.9 | |
| in | 1585.8 | |
| al | 1557.6 | |
| et | 1484.3 | |
| shares | 1477.7 | |
| inc | 1475.9 | |
| would | 1468.6 | |
| receptor | 1458.6 | |
| shown | 1396.3 | |
| differentiation | 1366.1 | |
| using | 1333.2 | |
| has | 1332.9 | |
| fig | 1326.0 | |
These are the results of log likelihood analysis of all terms in the CRAFT and the general Reference corpora, ranked by the largest difference.
Log Likelihood analysis of terms in TraJour vs. Reference
| TraJour | Reference | LL |
| 's | 8358.8 | |
| cells | 7680.7 | |
| said | 6854.5 | |
| expression | 5262.0 | |
| million | 4975.7 | |
| mice | 4833.1 | |
| mr | 4607.9 | |
| cell | 4285.6 | |
| protein | 4074.2 | |
| fig | 3881.0 | |
| 't | 3808.8 | |
| he | 3660.9 | |
| to | 3464.4 | |
| mouse | 3425.1 | |
| year | 3165.5 | |
| and | 3144.6 | |
| it | 2864.9 | |
| will | 2858.5 | |
| company | 2820.1 | |
| et | 2690.1 | |
| were | 2680.4 | |
| al | 2555.7 | |
| gene | 2389.7 | |
| stock | 2196.0 | |
| biol | 2180.8 | |
| proteins | 2163.8 | |
| binding | 2009.0 | |
| type | 1990.1 | |
| its | 1900.2 | |
| domain | 1897.9 | |
| shown | 1873.2 | |
| about | 1859.1 | |
| embryos | 1822.6 | |
| his | 1701.0 | |
| they | 1700.4 | |
| who | 1694.5 | |
| would | 1623.4 | |
| mutant | 1598.6 | |
| analysis | 1592.6 | |
| wild | 1591.2 | |
| abstract | 1560.2 | |
| corp | 1546.0 | |
| receptor | 1537.5 | |
| up | 1483.3 | |
| activity | 1404.0 | |
| in | 1385.2 | |
| expressed | 1375.6 | |
| genes | 1337.5 | |
| pp | 1316.4 | |
| on | 1304.3 | |
These are the results of log likelihood analysis of all terms in the TraJour and the general Reference corpora, ranked by the largest difference.
TF*IDF-ranked terms in the corpora
| CRAFT | TraJour | Reference | BioReference | ||||
| 0.435821989 | 0.336961638 | mr | 0.121256579 | 0.320612568 | |||
| 0.270086285 | 0.23486562 | says | 0.118389148 | 0.205308437 | |||
| 0.216037704 | 0.23159711 | that | 0.118092658 | cell | 0.20214811 | ||
| 0.178144406 | 0.220320662 | he | 0.102566064 | abstract | 0.190193446 | ||
| 0.172290914 | 0.195781243 | market | 0.091669921 | medline | 0.188483175 | ||
| 0.163204203 | 0.187501368 | 's | 0.088505453 | 0.177454065 | |||
| 0.151251462 | 0.167860719 | million | 0.08479812 | fulltext | 0.140623811 | ||
| 0.14510293 | 0.135330482 | is | 0.083961293 | 0.138198756 | |||
| figure | 0.12789903 | 0.120455574 | as | 0.081560405 | orderarticle... | 0.119943839 | |
| doi | 0.122095859 | 0.117032939 | his | 0.081293607 | 0.109091523 | ||
| 0.120878869 | 0.1166477 | on | 0.079143554 | 0.106029999 | |||
| 0.119701823 | 0.113119754 | stock | 0.078988492 | 0.098232094 | |||
| null | 0.097585044 | 0.110513285 | they | 0.078407133 | 0.096137225 | ||
| 0.093527187 | domain | 0.093587827 | at | 0.075765418 | binding | 0.094229295 | |
| 0.085050648 | binding | 0.093519941 | but | 0.075548004 | window | 0.085695981 | |
| 0.078946407 | 0.086691778 | billion | 0.073818895 | induced | 0.085566187 | ||
| 0.076546987 | 0.085026763 | have | 0.073662149 | 0.085231028 | |||
| 0.075227992 | pp | 0.081740644 | are | 0.072364352 | ml | 0.083458459 | |
| 0.073316012 | 0.077608416 | be | 0.071025302 | min | 0.083015317 | ||
| 0.073162003 | abstract | 0.077359299 | with | 0.068584195 | 0.078391977 | ||
| 0.072197936 | 0.076735615 | it | 0.067830211 | 0.077346227 | |||
| 0.070585942 | cdna | 0.076317095 | was | 0.067707989 | 0.076920249 | ||
| 0.070224935 | 0.07615871 | 't | 0.066475036 | mm | 0.073266187 | ||
| allele | 0.069948445 | membrane | 0.075929698 | in | 0.065890951 | 0.072205176 | |
| 0.067582502 | 0.073863584 | trading | 0.065657748 | 0.070980579 | |||
| 0.066734887 | 0.073860554 | would | 0.06509097 | data | 0.06877046 | ||
| 0.0644353 | 0.072302222 | said | 0.064915624 | ph | 0.067550112 | ||
| 0.06344436 | 0.070485632 | to | 0.064419151 | activation | 0.06720991 | ||
| staining | 0.061271839 | kinase | 0.070118752 | has | 0.064175458 | 0.066788269 | |
| neurons | 0.059343704 | 0.070118752 | by | 0.063766297 | 0.066026426 | ||
| 0.058555579 | 0.069422034 | shares | 0.063615252 | 0.065025557 | |||
| mm | 0.057094213 | X1 | 0.068956563 | company | 0.063043995 | human | 0.064973329 |
| olfactory | 0.056987095 | activation | 0.065599127 | their | 0.062731731 | using | 0.064071093 |
| 0.056130146 | 0.065140388 | for | 0.062641744 | 0.063146568 | |||
| 0.055582376 | 0.06314115 | bonds | 0.061745073 | crossref | 0.062926201 | ||
| phenotype | 0.052916588 | wt | 0.062499956 | will | 0.061422042 | activity | 0.058794757 |
| observed | 0.05206838 | 0.060635915 | year | 0.061329696 | rna | 0.058294133 | |
| e2 | 0.051952521 | chem | 0.060606544 | new | 0.060716109 | observed | 0.05785548 |
| 0.050532729 | 0.060433841 | 0.06062604 | with | 0.057545637 | |||
| homozygous | 0.050131504 | mrna | 0.060175577 | or | 0.060257745 | these | 0.057379774 |
| function | 0.049871842 | rna | 0.059552487 | an | 0.060255469 | study | 0.056368432 |
| muscle | 0.049628485 | ca | 0.055251655 | from | 0.059225401 | free | 0.056039813 |
| data | 0.049494253 | 0.055139424 | we | 0.059174038 | mediated | 0.055983639 | |
| 0.048131217 | insulin | 0.05440163 | index | 0.059103846 | serum | 0.05494964 | |
| chromosome | 0.048033587 | activity | 0.053267608 | some | 0.058883875 | actin | 0.054506498 |
| we | 0.047444291 | expressed | 0.053108054 | one | 0.058690763 | kinase | 0.053029357 |
| 0.047236181 | 0.052726312 | more | 0.058586253 | ?c | 0.052586215 | ||
| transgenic | 0.046764407 | 0.052369358 | stocks | 0.058457121 | we | 0.051671671 | |
| using | 0.046658651 | molecular | 0.052281469 | sales | 0.058224908 | figure | 0.051378087 |
| pgc | 0.045739642 | amino | 0.052137849 | this | 0.05791668 | amino | 0.050216424 |
These are the top 50 terms in each corpus, by TF*IDF (Term Frequency * Inverse Document Frequency). Terms highlighted in bold in the CRAFT and TraJour columns indicate terms that are shared among these two corpora within the top 50 terms of each corpus; terms highlighted in bold in the BioReference column are shared among all three corpora in the top 50 terms. There is clearly significant overlap between CRAFT and TraJour in their contentful terms.