| Literature DB >> 25392685 |
Yuan Luo1, Gregory Riedlinger2, Peter Szolovits1.
Abstract
Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.Entities:
Keywords: cancer omics; gene prioritization; machine learning; pathway prioritization; text mining
Year: 2014 PMID: 25392685 PMCID: PMC4216063 DOI: 10.4137/CIN.S13874
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1The Omic hierarchy on the left, biological networks on the right, and their interactions. TF stands for transcription factor, The figure shows some typical network interaction scenarios such as: a signaling network activates transcription factors for a regulatory network; transcription factor complexes that control a regulatory network may be formed through protein interactions (eg, binding); a metabolic network may produce energy (through catabolism) and amino acids (through anabolism) that are necessary for other functional networks; and enzymes that catalyze many metabolic networks are in fact proteins and are produced and regulated by other biological networks. Note that regulatory networks often have participants from multiple levels of the Omic hierarchy.
Data sources for gene and pathway prioritization according to their primary utility.
| UTILITY CATEGORY | DATA SOURCES |
|---|---|
| Literature | |
| Terminology & Ontology | |
| Pathway | KEGG |
| Protein sequence & Domain | |
| Regulation | |
| Gene expression | Ensembl |
| Gene-Protein and Disease | |
| Gene & Protein variation | Ensembl, |
| Gene function annotation | |
| Gene, Protein & Chemical interaction | |
| Gene sequence & Locus | BLAST |
| Homology analysis |
Notes: Bold font indicates the source has narrative text and is suitable for text mining. This does not include data sources that only points to literature data sources such as PubMed. We also exclude data sources that are built solely by automatic mining of other data sources, eg, GeneCards.7,8
Abbreviations: OMIM, Online Mendelian Inheritance in Man; GO, Gene Ontology; UMLS, Unified Medical Language System; DO, Disease Ontology; MeSH, Medical Subject Heading; HPO, Human Phenotype Ontology; MPO, Mammalian Phenotype Ontology; GEO, Gene Expression Omnibus; CTD, Comparative Toxicogenomics Database; GXD, Gene Expression Databas; MGI, Mouse Genome Informatics; HPRD, Human Protein Reference Database; HGMD, Human Gene Mutation Database; MBA, Mouse Brain Atlas; HBA, Human Brain Atlas; CDD, Conserved Domain Database; GHR, Genetics Home Reference; GAD, Genetic Association Database; OMA, Orthologous Matrix.
Summarization of text mining components in gene prioritization methods.
| CATEGORIZATION | PRIORITIZER | TEXT MINING USAGE | PROS AND CONS OF TEXT MINING METHODS |
|---|---|---|---|
| No text mining | POCUS | NA | Lack of textual evidence |
| Keyword search | GeneSeeker | Extract phenotype data | Requires prior knowledge in selecting and hand tuning keyword sets; subject to selection bias when picking keywords |
| Prioritizer | Extract known genes | ||
| CANDID | Match protein domain description | ||
| PGMapper | Extract phenotype data | ||
| GeneProspector | PubMed screening before reviews by curators | ||
| MaxLink | Extract known genes | ||
| Vector space model | G2D | Associate MeSH phenotype terms, MeSH chemistry terms, and GO terms | Possible to calculate semantic similarities automatically; on the other hand, the accuracy of the semantic similarity will be restricted by the co-occurrence counts of words or citations, which are only approximations for real semantics |
| SNPs3D | Score candidate genes by profiling noun and adjective counts in the MEDLINE abstracts | ||
| MimMiner | Extract correlations of pheno-type similarity protein interaction, and gene functions in pathways | ||
| Endeavour | Rank genes separately on literature evidence, before pooling an overall rank | ||
| CAESAR | Match database description of genes to ontology descriptions of phenotype, anatomy and genes | ||
| ToppGene | Use co-citation counts in Pub-Med as indication of gene relationship | ||
| CIPHER | Use the text mining component of MimMiner | ||
| GeneDistiller | Use literature co-occurrence statistics to filter the candidate genes | ||
| PRINCE | Use the text mining component of MimMiner | ||
| PolySearch | Rank gene terms based on a discretized sentence relevancy to disease queries | ||
| GeneWanderer | Use text evidence to augment PPI networks | ||
| GPsy | Extract phenotype annotation based on co-occurrence statistics in the biomedical literature | ||
| With ontology structure | Tiffin et al. | Use the eVOC anatomical ontology to connect the PubMed literature and RefSeq genes | Richer and hierarchical semantics from ontology, but accuracy depend on resolution, noise and irregularities from ontologies |
| SUSPECTS | Calculate semantic similarity between GO terms by exploring GO structure and how many times these GO terms occur in the Swiss-Prot database | ||
| MedSim | Calculate semantic similarity between GO terms by exploring GO structure | ||
| Statistical text mining | GRAIL | Represent genes using a tf-idf weighted feature vector to analyze PubMed abstracts to calculate the gene-gene and gene-SNP correlation | More detailed and advanced modeling of text distributions opens the avenue to even richer semantic analysis. Requires large training corpus, time consuming |
| Genie | Train a Bayesian linear classifier that picks discriminative keywords to be used in orthologous gene abstract search | ||
| MetaRanker | Large-scale text mining on MEDLINE abstracts using customized statistical models of genes and MeSH terms adjusted for publication bias |
Figure 2(A) The network representation for the example sentence: “More recent data have suggested that targeting mutations in BRAF, AKT1, ERBB2 and PIK3CA and fusions that involve ROS1 and RET may also be successful”. (B) and (C) are two sub-networks of (A).