| Literature DB >> 19208194 |
Abstract
BACKGROUND: Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities--particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations.Entities:
Mesh:
Year: 2009 PMID: 19208194 PMCID: PMC2646239 DOI: 10.1186/1471-2105-10-S2-S6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Pharmspresso ontology examples.
| Cell or cell group | germ line cells, intenstines, sensory neurons |
| Cellular component* | axons, integrins, mitochondrial membrane |
| stroke, chronic leukemia, tuberculoma | |
| acebutolol, mechlorethamine, tartaric acid | |
| ABCB1, CYP2C9, coagulation factor V | |
| Organism | mice, rat, xenopus laevis |
| T168N, 1039G-A, 236Arg->Lys | |
| Action | assists, acomplishes, recognizes |
| Association | associates, binds, interacts |
| Biological Process* | acetylated, matures, reactivations |
| Characterization | has, contains, displays, includes, lacks |
| Comparison | correlates, differs, equally, matches |
| Effect | accumulates, aggregates, causes |
Examples of Textpresso biological entities and relationships, along with additions for Pharmspresso (in bold). The ontology includes 35 categories of two types: (1) biological entities and (2) relationships between entities. Category names and examples of these categories are shown. * marks categories imported from Gene Ontology (GO).
Overview of the Pharmspresso database.
| # articles | 1,025 |
| # journals | 343 |
| # gene terms recognized* | 102,334 |
| # drug terms recognized | 3,756 |
| # disease terms recognized** | 36,843 |
* Includes names, symbols, aliases
** Includes redundancies in MeSH thesaurus
We used the MeSH thesaurus disease terms, including many synonyms and phrase permutations that create redundancy in disease matches. However, these are required to capture the different ways in which they appear in natural language.
Figure 1Pharmspresso pipeline for data processing. The Pharmspresso pipeline for data processing: full text PDFs of articles are downloaded, converted to text, and tokenized into individual words and sentences. Next, the text is parsed to identify words or phrases that are members of specific categories within the ontology. These are marked as such and indexed for future search accessibility.
Figure 2Pharmspresso search page. Snapshot of Pharmspresso search page. User is searching for text that includes the keyword 'ABCB1' as well as a member of the {drug} category and a member of the {polymorphism} category, within the abstract or full text.
Figure 3Pharmspresso results page. Results page for the search shown in Figure 2. There are eight publications (from the corpus of 1025 in Pharmspresso) that include a total of 20 sentences fulfilling the query conditions. Users may view the sentences in each of these articles that match the query. The number of matches indicates the number of sentences containing the query keywords and categories.
Figure 4Marked-up sentences found in corpus which match user query. Sentences matching the query are color-coded with keywords and categories highlighted. In this example, 'tacrolimus' is a member of the {drug} category, and 'G2677T' and 'C3435T' are members of the {polymorphism} category. Pharmspresso displays the title and sentence number within the text.
Figure 5Pharmspresso retrieves sentences from full text not found when scanning abstract only. User queried for 'warfarin' keyword + a member of the {polymorphism} category. Results show that the article titled 'Relative impact of covariates in prescribing warfarin according to CYP2C9 genotype' contains such a sentence, but this sentence would not be found by reading abstract only, as it is sentence number 132 in the article, which actually appears in the 'Discussion' section. Although the 'star notation' (*2, *3) is used earlier in the article to describe gene variants, explicit genomic location information which can be used to map this polymorphism is first given in sentence 132.
Figure 6Pharmspresso retrieves fact from referenced article. User queried for both keywords 'CYP2D6' and 'codeine' and a member of the {polymorphism} category. Although the article ('Functional Analysis of Six Different Polymorphic CYP1B1 Enzyme Variants Found in an Ethiopian Population') discusses the gene 'cytochrome P450 1B1' and not 2D6, there is a reference to knowledge in a referenced article, regarding a polymorphism in CYP2D6 (not in CYP1B1) and its affect on affinity for codeine. Thus, this article is extracted in response to the query.
Pharmspresso performance in evaluation.
| Gene | 78.1 |
| Polymorphism | 48.6 |
| Polymorphism (non-table gold standard) | 60.8 |
| Drug | 74.4 |
Summary of Pharmspresso performance in evaluation. The rows report percent of gene, gene-variant (polymorphism), and drug mentions found by the gold standard, recovered by Pharmspresso.