| Literature DB >> 29220462 |
Carlos-Francisco Méndez-Cruz1, Socorro Gama-Castro1, Citlalli Mejía-Almonte1, Marco-Polo Castillo-Villalba1, Luis-José Muñiz-Rascado1, Julio Collado-Vides1.
Abstract
Database URL: RegulonDB, http://regulondb.ccg.unam.mx.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29220462 PMCID: PMC5737074 DOI: 10.1093/database/bax070
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Summary of CytR from RegulonDB (http://regulondb.ccg.unam.mx/regulon?term=ECK120012407&organism=ECK12&format=jsp&type=regulon).
Figure 2.References per manual summary.
Figure 3.Model selection and model assessment.
Features and sentence representations
| Feature | Sentence representation |
|---|---|
| 1 Words | ArgP, which belongs to the LysR-family, has a helix-turn-helix motif located close to the N-terminus |
| 2 Lemmas | ArgP, which belong to the LysR-family, have a helix-turn-helix motif located close to the N-terminus |
| 3 POS tags + Term tags | TF, r-crq VBZ p-acp dt DFAM, vdz dt DMOT NN JJ av-j p-acp dt DPOS |
| 4 POS tags + Term tags + Frequent-word tags | TF, r-crq FWDOM p-acp dt DFAM, vdz dt DMOT FWDOM FWDOM av-j p-acp dt DPOS |
| 5 Words + POS tags + Term tags | ArgP, which belongs to the LysR-family, has a helix-turn-helix motif located close to the N-terminus. TF, r-crq VBZ p-acp dt DFAM, vdz dt DMOT NN JJ av-j p-acp dt DPOS |
| 6 Lemmas + POS tags + Term tags | ArgP, which belong to the LysR-family, have a helix-turn-helix motif located close to the N-terminus. TF, r-crq VBZ p-acp dt DFAM, vdz dt DMOT NN JJ av-j p-acp dt DPOS |
| 7 Words + POS tags + Term tags + Frequent-word tags | ArgP, which belongs to the LysR-family, has a helix-turn-helix motif located close to the N-terminus. TF, r-crq FWDOM p-acp dt DFAM, vdz dt DMOT FWDOM FWDOM av-j p-acp dt DPOS |
| 8 Lemmas + POS tags + Term tags + Frequent-word tags | ArgP, which belong to the LysR-family, have a helix-turn-helix motif located close to the N-terminus . TF, r-crq FWDOM p-acp dt DFAM, vdz dt DMOT FWDOM FWDOM av-j p-acp dt DPOS |
Two examples of tagged manual summaries
| CytR, Cytidine Regulator, is a TF required for |
Examples of training fragments and training sentences
| Training fragment | Training sentence |
|---|---|
| Transport and utilization of ribonucleosides and deoxyribonucleosides | CytR, Cytidine Regulator, is a TF required for transport and utilization of ribonucleosides and deoxyribonucleosides |
| Biosynthesis and transport of arginine, transport of histidine, and its own synthesis and activates genes for arginine catabolism | ArgR complexed with L-arginine represses the transcription of several genes involved in biosynthesis and transport of arginine, transport of histidine, and its own synthesis and activates genes for arginine catabolism |
| ArgR is also essential for a site-specific recombination reaction that resolves plasmid ColE1 multimers to monomers and is necessary for plasmid stability | ArgR is also essential for a site-specific recombination reaction that resolves plasmid ColE1 multimers to monomers and is necessary for plasmid stability |
Description of training and validation datasets
| Dataset | Classes | |||
|---|---|---|---|---|
| DOM | RP | OTHER | Total | |
| Training | 223 | 190 | 1,153 | 1,566 |
| Validation | 105 | 70 | 496 | 671 |
| Total | 328 | 260 | 1,649 | 2,237 |
Description of the test dataset
| TF | No. of articles | PMIDs | No. of sentences | ||
|---|---|---|---|---|---|
| Total | DOM | RP | |||
| ArgR | 6 | 11305941, 1640456, 1640457, 17074904, 17850814, 8594204 | 216 | 25 | 9 |
| CytR | 6 | 10766824, 1715855, 8022285, 8596434, 8764393, 9086266 | 431 | 4 | 2 |
| FhlA | 5 | 2118503, 2280686, 8034727, 8034728, 8412675 | 29 | 4 | 0 |
| GntR | 7 | 12618441, 9045817, 9135111, 9358057, 9537375, 9658018, 9871335 | 194 | 8 | 3 |
| MarA | 6 | 10802742, 11844771, 8955629, 9097440, 9324261, 9724717 | 149 | 38 | 3 |
| Total | 1019 | 79 | 17 | ||
Figure 4.NLP preprocessing pipeline.
Experimental grid
| Aspect | Values |
|---|---|
| Classifiers | SVM, Multinomial NB, Bernoulli NB, Gaussian NB |
| Features | Words, Lemmas, POS tags, Term tags, Frequent-word tags |
| Eliminate stop words | Yes, No |
| 1, 2, 3, 1 + 2, 1 + 2 + 3, 2 + 3 | |
| Dimensionality reduction (SVD) | 100, 200, 300 components |
| Vector values | Frequency, binary, TF-IDF |
Figure 5.Automatic summarization pipeline.
Descriptions of the two selected classifiers
| Strategy | Classifier | Features | Remove stop words? | SVD | Values | |||
|---|---|---|---|---|---|---|---|---|
| Validation dataset | SVM | Lemmas, term tags, and frequent-word tags | 1 + 2 | No | Yes | 200 | TF-IDF | 0.888 |
| Only cross-validation | SVM | Lemmas, term tags, and frequent-word tags | 1 + 2 | No | No | Binary | 0.867 |
Componentes.
Performance of the best classifier for TF and class
| TF | DOM | OTHER | RP | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | ||||
| ArgR | 1 | 0.12 | 0.21 | 0.86 | 1 | 0.92 | 0 | 0 | 0 |
| CytR | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| FhlA | 1 | 0.25 | 0.4 | 0.89 | 1 | 0.94 | — | — | — |
| GntR | 0.67 | 0.25 | 0.36 | 0.95 | 0.98 | 0.97 | 0 | 0 | 0 |
| MarA | 0.77 | 0.26 | 0.39 | 0.78 | 0.97 | 0.87 | 1 | 0.67 | 0.8 |
| Average | 0.38 | 0.47 | 0.9 | 0.94 | 0.42 | 0.45 | |||
The highest score between averaged precision and recall is shown in boldface.
Evaluation of automatic summaries with ROUGE-1 with and without stop words
| TF | ROUGE-1 | ||||||
|---|---|---|---|---|---|---|---|
| With stop words | Without stop words | Summary (words) | |||||
| Recall | Precision | Recall | Precision | ||||
| ArgR | 0.553 | 0.463 | 154 | ||||
| CytR | 0.087 | 0.158 | 0.759 | 0.066 | 0.121 | 1,124 | |
| FhlA | 0.753 | 0.109 | 0.19 | 0.63 | 0.091 | 0.159 | 676 |
| GntR | 0.418 | 0.241 | 0.306 | 0.326 | 0.182 | 0.233 | 137 |
| MarA | 0.821 | 0.063 | 0.117 | 0.05 | 0.094 | 1,103 | |
| Average | 0.681 | 0.188 | 0.252 | 0.587 | 0.147 | 0.201 | |
The best scores are shown in boldface.
Evaluation of automatic summaries with ROUGE-SU4 with and without stop words
| TF | ROUGE-SU4 | |||
|---|---|---|---|---|
| With stop words | Without stop words | |||
| Recall | Recall | |||
| ArgR | 0.277 | 0.223 | ||
| CytR | 0.428 | 0.077 | 0.361 | 0.056 |
| FhlA | 0.392 | 0.097 | 0.269 | 0.066 |
| GntR | 0.146 | 0.106 | 0.095 | 0.067 |
| MarA | 0.062 | 0.042 | ||
Best scores in bold face.
Figure 6.MDS of sentences of the class DOM for ArgR.
Figure 7.MDS of sentences of the class DOM for CytR.
Figure 8.MDS of sentence of the class RP for ArgR.
Figure 9.MDS of sentence of the class RP for CytR.