| Literature DB >> 18271954 |
Stein Aerts1, Maximilian Haeussler, Steven van Vooren, Obi L Griffith, Paco Hulpiau, Steven J M Jones, Stephen B Montgomery, Casey M Bergman.
Abstract
BACKGROUND: Decoding transcriptional regulatory networks and the genomic cis-regulatory logic implemented in their control nodes is a fundamental challenge in genome biology. High-throughput computational and experimental analyses of regulatory networks and sequences rely heavily on positive control data from prior small-scale experiments, but the vast majority of previously discovered regulatory data remains locked in the biomedical literature.Entities:
Mesh:
Year: 2008 PMID: 18271954 PMCID: PMC2374703 DOI: 10.1186/gb-2008-9-2-r31
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Distribution of cosine similarity scores between the query vector and each of the Medline abstract vectors, indicating the 10,000th (blue diamond) 50,000th (red diamond) and 100,000th (green diamond) ranked abstract.
Coverage of validation sets (excluding PMIDs in the training set) within the top10k, top50k, and top100k ranked abstracts for the vector space model relevancy ranking
| TRANSFAC | FlyReg | ORegAnno Queue | ORegAnno prior to RegCreative | RegCreative success | RegCreative failure | |
| Number of PMIDs | 5,719 | 200 | 4,145 | 376 | 260 | 218 |
| Number of PMIDs (no training data) | 5,183 | 186 | 3,687 | 340 | 228 | 212 |
| Number in top10k | 1,390 | 38 | 1,035 | 89 | 59 | 18 |
| Percent in top10k | 26.8% | 20.4% | 28.1% | 26.2% | 25.9% | 8.5% |
| Number in top50k | 3,908 | 146 | 2,753 | 260 | 165 | 79 |
| Percent in top50k | 75.4% | 78.5% | 74.7% | 76.5% | 72.4% | 37.3% |
| Number in top100k | 4,572 | 166 | 3,208 | 301 | 199 | 110 |
| Percent in top100k | 88.2% | 89.2% | 87.0% | 88.5% | 87.3% | 51.9% |
Figure 2PPV calculated for each threshold in the top100k of the final relevancy ranking, using the pseudo-curation results of 200 evenly distributed samples. The length of the final 'text-mining entry' component of the ORegAnno Publication Queue was chosen at 58,000, which yields a PPV of 50%.
Figure 3Results of the pseudo-curation procedure on 200 evenly distributed samples across the top100k.
Figure 4Transcriptional regulatory sub-network around the Drosophila transcription factor even-skipped (eve). All nodes and edges were retrieved from eve-related publications in the top100k abstract list. Black edges are success papers (that is, fully curatable publications); grey edges are failure papers that report regulatory data (for example, consensus sites) but are not the primary reference; grey dashed edges are failure papers that contain regulatory data that are not complete enough to allow full curation; blue edges are failures that report protein-protein interactions.
Efficiency of document recovery, sequence extraction and genome mapping for the source lists of PMIDs with high cis-regulatory content
| TRANSFAC | FlyReg | ORegAnno | Queue | top4,501 | All | |
| Number of PMIDs | 5,719 | 202 | 914 | 4,145 | 4,491 | 11,437 |
| Number of PMIDs with PDF | 5,302 | 187 | 835 | 3,710 | 3,677 | 9,940 |
| Percent PMIDs with PDF | 92.7% | 92.6% | 91.4% | 89.5% | 81.9% | 86.9% |
| Number of PMIDs with text >2 Kbytes | 5,051 | 175 | 793 | 3,517 | 3,498 | 9,440 |
| Percent PMIDs with text >2 Kbytes | 88.3% | 86.6% | 86.8% | 84.8% | 77.9% | 82.5% |
| Efficiency of text conversion | 95.3% | 93.6% | 95.0% | 94.8% | 95.1% | 95.0% |
| Number of PMIDs with fasta sequence | 4,357 | 155 | 660 | 3,044 | 3,080 | 8,066 |
| Percent PMIDs with fasta sequence | 76.2% | 76.7% | 72.2% | 73.4% | 68.6% | 70.5% |
| Efficiency of sequence extraction | 86.3% | 88.6% | 83.2% | 86.6% | 88.1% | 85.4% |
| Number of PMIDs with fasta sequence mapped to genome | 1,518 | 75 | 303 | 1,279 | 1,260 | 2,975 |
| Percent PMIDs with fasta sequence mapped to genome | 26.5% | 37.1% | 33.2% | 30.9% | 28.1% | 26.0% |
| Efficiency of genome mapping | 34.8% | 48.4% | 45.9% | 42.0% | 40.9% | 36.9% |
Note that totals are less than the sum of the sets since many PMIDs are found in more than one source list.
Performance of text-based sequence extraction for cis-regulatory annotation
| dm2 | hg18 | mm8 | ce2 | rn4 | All | |
| Number of ORegAnno annotations | 2,079 | 589 | 255 | 178 | 107 | 3,208 |
| Number of PMIDs with ORegAnno annotation | 389 | 283 | 113 | 30 | 48 | 850 |
| Number of PMIDs with Ensembl target gene name(s) | 388 | 253 | 107 | 29 | 42 | 819 |
| Number of text hits from PMIDs with ORegAnno annotation | 188 | 128 | 51 | 16 | 32 | 415 |
| Number of text hits that overlap ORegAnno annotation | 149 | 54 | 36 | 13 | 17 | 269 |
| Percent text hits that overlap ORegAnno annotation (PPV) | 79.3% | 42.2% | 70.6% | 81.3% | 53.1% | 64.8% |
| Number of ORegAnno annotations overlapped by a text hits | 681 | 133 | 149 | 22 | 64 | 1,049 |
| Percent ORegAnno annotations overlapped by a text hits (SN) | 32.8% | 22.6% | 58.4% | 12.4% | 59.8% | 32.7% |
| Number of PMIDs with text hits | 124 | 91 | 44 | 12 | 24 | 295 |
| Percent PMIDs with text hits (coverage) | 31.9% | 32.2% | 38.9% | 40.0% | 50.0% | 32.2% |
| Number of PMIDs with text hits to correct species | 123 | 84 | 37 | 12 | 18 | 274 |
| Percent PMIDs with text hits to correct species (PPV) | 99.2% | 92.3% | 84.1% | 100.0% | 75.0% | 92.9% |
| Number of PMIDs with text hits and Ensembl target gene name(s) | 122 | 77 | 33 | 11 | 16 | 259 |
| Number of PMIDs with text hits and perfect match to correct target gene name(s) | 67 | 57 | 24 | 4 | 10 | 162 |
| Number of PMIDs with text hits and partial match to correct target gene name(s) | 16 | 12 | 5 | 3 | 4 | 40 |
| Percent PMIDs with text hits and match to correct target gene name (PPV) | 68.0% | 89.6% | 87.9% | 63.6% | 87.5% | 78.0% |
| Number of PMIDs without ORegAnno annotation with text hits | 76 | 1,291 | 841 | 13 | 459 | 2,680 |
| Number of text hits from PMIDs without ORegAnno annotation | 126 | 2,602 | 2,131 | 14 | 1,002 | 5,875 |
| Number of text hits from PMIDs without ORegAnno annotation that overlap ORegAnno annotation | 59 | 202 | 58 | 1 | 18 | 338 |
| Number of ORegAnno annotations overlapped by text hits from PMIDs without ORegAnno annotation | 200 | 347 | 139 | 3 | 33 | 722 |
Figure 5Comparison of automatically extracted text-based annotation and manual annotation of the D. melanogaster Hsp70 gene regions. (a) The Hsp70Aa-Ab region. (b) The Hsp70Ba-Bc region. The 'evaluation' track refers to text-based hits extracted from papers with curated regulatory data in ORegAnno; the 'prediction' track refers to text hits extracted from papers not currently curated in ORegAnno, but with high predicted cis-regulatory content. Annotations in both text-based tracks are labeled with their corresponding PMIDs. Also shown are the original manual annotation in the FlyReg database, the automated mapping of these curated data in ORegAnno, and FlyBase genes, including the α-γ-element noncoding RNA gene that is expressed in response to heat shock. Differences in the FlyReg and ORegAnno mappings in (a) arise because the sequences for these regions are duplicated in the genome and alternative unique mappings are chosen in the two databases.