| Literature DB >> 24499370 |
Chris J Stubben1, Jean F Challacombe.
Abstract
BACKGROUND: The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements.Entities:
Mesh:
Year: 2014 PMID: 24499370 PMCID: PMC3937057 DOI: 10.1186/1471-2105-15-43
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flowchart for mining locus tags using the pmcXML package. R functions are indicated by solid lines, inputs by dash lines, and NCBI databases and R objects by boxes. For each species, NCBI Genomes is used to find the reference strain and download the GFF3 file. The locus tag prefixes in the GFF3 files are used to format a search query in PubMed Central and find matching references. For each reference, the PMC id is used to download the XML document which is then parsed into full text and tables. The XML file includes links to supplements that are downloaded separately, but typically require additional code to reformat (therefore, only locus tags within supplements from B. pseudomallei were extracted). Finally, the locus tags are used to create a pattern string to extract tags and also expand locus tag pairs marking the start and end of a region. The R functions developed specifically for this effort are described in Additional file 2.
Reference genome codes, strains and locus tag prefixes used for searching PubMed Central
| BPS | Burkholderia pseudomallei K96243 | BPSL, BPSS | NC_006350, NC_006351 |
| Cj | Campylobacter jejuni subsp. jejuni NCTC 11168 | Cj | NC_002163 |
| CT | Chlamydia trachomatis D/UW-3/CX | CT | NC_000117 |
| FTT | Francisella tularensis subsp. tularensis SCHU S4 | FTT | NC_006570 |
| HP | Helicobacter pylori 26695 | HP | NC_000915 |
| lmo | Listeria monocytogenes EGD-e | lmo | NC_003210 |
| Rv | Mycobacterium tuberculosis H37Rv | Rv | NC_000962 |
| PA | Pseudomonas aeruginosa PAO1 | PA | NC_002516 |
| VC | Vibrio cholerae O1 biovar El Tor str. N16961 | VC, VCA | NC_002505, NC_002506 |
| YPO | Yersinia pestis CO92 | YPO | NC_003143 |
Figure 2Total number of articles published each year in the rapidly growing Open Access subset that are available for text mining compared to other PMC articles that are only free to read.
Total number of open access articles with locus tag mentions by source
| | | | ||||
|---|---|---|---|---|---|---|
| BPS | 53 | 3675 | 832 | 682 | 1379 | 782 |
| Cj | 85 | 4193 | 2094 | 1116 | 654 | 329 |
| CT | 66 | 3268 | 1841 | 950 | 476 | 1 |
| FTT | 54 | 1709 | 603 | 1027 | 79 | 0 |
| HP | 136 | 5035 | 2701 | 1954 | 271 | 109 |
| lmo | 65 | 4227 | 2470 | 1144 | 517 | 96 |
| Rv | 626 | 26329 | 13352 | 7903 | 4079 | 995 |
| PA | 225 | 11454 | 5372 | 3754 | 1855 | 473 |
| VC | 102 | 2633 | 1450 | 609 | 511 | 63 |
| YPO | 34 | 1900 | 176 | 1350 | 183 | 191 |
Total number of unique locus tags in RefSeq, PMC and in both databases and the source of the unique locus tag
| | ||||||
|---|---|---|---|---|---|---|
| BPS | 5935 | 1575 | 1588 | 466 | 194 | 928 |
| Cj | 1699 | 863 | 1009 | 683 | 196 | 130 |
| CT | 940 | 620 | 626 | 243 | 183 | 200 |
| FTT | 1852 | 687 | 792 | 740 | 49 | 3 |
| HP | 1627 | 928 | 977 | 770 | 149 | 58 |
| lmo | 2940 | 1092 | 1094 | 845 | 160 | 89 |
| Rv | 4111 | 3354 | 3686 | 2030 | 1293 | 363 |
| PA | 5571 | 2488 | 2507 | 1853 | 470 | 184 |
| VC | 4007 | 766 | 803 | 535 | 95 | 173 |
| YPO | 4087 | 1269 | 1291 | 994 | 134 | 163 |
Total number of articles citing a locus tag and the number of times each tag was mentioned directly within the text or indirectly within a range
| | | | ||
|---|---|---|---|---|
| BPSS1492 | 22 | 36 | 1 | Hypothetical protein |
| BPSL1549 | 21 | 94 | 2 | Hypothetical protein |
| BPSL2697 | 21 | 41 | 1 | Molecular chaperone GroEL |
| BPSL1705 | 19 | 47 | 7 | Hypothetical protein |
| BPSS0796 | 18 | 42 | 0 | Surface-exposed protein |
| BPSS1434 | 18 | 53 | 2 | Membrane-anchored cell surface protein |
| BPSS1529 | 18 | 19 | 6 | Membrane antigen |
| BPSS1532 | 18 | 23 | 9 | Cell invasion protein |
| BPSS2288 | 18 | 32 | 0 | Heat shock protein 20 |
| BPSS1385 | 17 | 17 | 5 | ATP/GTP binding protein |
| BPSS1545 | 17 | 15 | 7 | Type III secretion system protein |
| BPSL2522 | 16 | 26 | 0 | Outer membrane protein a |
| BPSS0421 | 16 | 24 | 7 | Lipopolysaccharide biosynthesis protein |
| BPSS1498 | 16 | 23 | 13 | Hypothetical protein |
| BPSS1531 | 16 | 14 | 8 | Cell invasion protein |
| BPSS1546 | 16 | 15 | 7 | AraC family transcriptional regulator |
| BPSL3319 | 15 | 33 | 2 | Flagellin |
| BPSS1509 | 15 | 29 | 14 | Hypothetical protein |
| BPSS1511 | 15 | 22 | 5 | Hypothetical protein |
| BPSS1539 | 15 | 27 | 5 | Hypothetical protein |
| BPSS1542 | 15 | 12 | 5 | Surface presentation of antigens protein |
| BPSS1544 | 15 | 11 | 7 | Type III secretion system protein |