| Literature DB >> 26258475 |
Ruoyao Ding1, Cecilia N Arighi2, Jung-Youn Lee3, Cathy H Wu2, K Vijay-Shanker1.
Abstract
BACKGROUND: Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26258475 PMCID: PMC4530884 DOI: 10.1371/journal.pone.0135305
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1pGenN system architecture.
Fig 2Components of dictionary-based gene mention detection.
Regular expression for identifying gene f-terms.
| /(gene|protein|factor|kinase|[^abehiou]ase|oncogene?|binder|globulin|tubulin|inter-?feron|lectin|galectin|globin|tinin|matin|ietin|tropin|zyme|kine|leukin|nogen|receptor|enzyme|hormone|protease|permease|nuclease|oncogene)$/ |
Plant species prefix conventions.
| 2 letter prefix | First letter is upper case and second is lower case. e.g., “At” for “Arabidopsis thaliana”, “Os” for “Oryza sativa”. |
| 3 letter prefix | Only for Brassica species. First letter must be upper case “B”, which is short for “Brassica”. Second and third letters are lower case. e.g., “Bra” for “Brassica rapa”, “Bni” for “Brassica nigra”. |
| 4 letter prefix | For Latin binomial. The symbol for a binomial consists of the first two letters of the genus, plus the first two letters of the specific epithet. e.g., “PASM” for “Pascopyrum smithii”. |
| 5 letter prefix | All the letters must be upper case, and the first three letters must be “VIT”. e.g., “VITVI” for “Vitis vinifera”. |
Fig 3Pivot based plant gene dictionary structure.
Rules for generating multiple candidates from a sequence of tokens.
|
|
|
|
|
|
Rules for filtering out family and complex name.
| If NAME appears at the end of a noun phrase, and the noun phrase starts with “a” or “another”, then NAME will be considered as family name and filtered out. |
| If NAME is followed by words “family” or by an f-Term in plural form, then NAME will be considered as family name and filtered out. |
| If NAME appears at in the end of a noun phrase, and the NAME is preceded by “subunits of”, then NAME will be considered as complex name and filtered out. |
| If NAME is followed by word “complex”, then NAME will be considered as complex name and filtered out. |
Rules for disambiguation of name as gene or non-gene.
| Acronym rule: If an acronym pair is detected, and the full name matches with the gene dictionary or ends with an F-term, then the short name will be assigned a gene sense. |
| Appositive rule: If NAME has an appositive, and the appositive ends with an F-term, then NAME will be assigned a gene sense. |
| Dictionary matched name in relation rule: If two or more names are matched with the dictionary and they appear together in a conjunction with other candidate mentions, then all the names will be assigned a gene sense. |
| Synonym rule: If NAME1 and NAME2 are synonyms (matched with the same dictionary entry), and they appear in the same article, then both NAME1 and NAME2 will be assigned a gene sense. |
Fig 4Example for ‘uni-pivot gene sense assumption’.
Fig 5Screenshot of pGenN Interface.
Fig 6Screenshot of pGenN result table.
Fig 7Screenshot of pGenN text evidence page.
Performance of pGenN & GenNorm on in-house plant corpus.
| Precision | Recall | F-value | |
|---|---|---|---|
| pGenN | 90.9% | 87.2% | 88.9% |
| GenNorm | 57.6% | 39.0% | 46.5% |
Statistics of pGenN large-scale processing of plant Medline abstracts.
| # of abstracts processed | 444,211 |
| # of abstracts which are pGenN positive | 58,301 |
| # of gene mentions normalized | 313,334 |
| # of unique UniProt ACs obtained | 27,496 |
| # of PMID–UniProt AC pairs obtained | 112,053 |
Statistics of pGenN processing of gene/protein-related subset of plant Medline abstracts.
| # of abstracts processed | 97,611 |
| # of abstracts which are pGenN positive | 36,261 |
| # of gene mentions normalized | 224,273 |
| # of unique UniProt ACs obtained | 20,986 |
| # of PMID–UniProt AC pairs obtained | 74,069 |
Performance of pGenN & GenNorm on use case data set.
| Precision | Recall | F-value | |
|---|---|---|---|
| pGenN | 97.9% | 93.5% | 95.6% |
| GenNorm | 93.5% | 66.0% | 77.4% |