OBJECTIVE: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. METHODS: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IG), chi(2) test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. RESULTS: NDCG measures for classification systems using all features or a subset of features selected using IG and chi(2) test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. CONCLUSIONS: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system.
OBJECTIVE: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. METHODS: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IG), chi(2) test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. RESULTS: NDCG measures for classification systems using all features or a subset of features selected using IG and chi(2) test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. CONCLUSIONS: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system.
Authors: Manabu Torii; Lanlan Yin; Thang Nguyen; Chand T Mazumdar; Hongfang Liu; David M Hartley; Noele P Nelson Journal: Int J Med Inform Date: 2010-12-04 Impact factor: 4.046
Authors: Jean I Garcia-Gathright; Andrea Oh; Phillip A Abarca; Mary Han; William Sago; Marshall L Spiegel; Brian Wolf; Edward B Garon; Alex A T Bui; Denise R Aberle Journal: Comput Biol Med Date: 2015-01-13 Impact factor: 4.589
Authors: Yuqing Mao; Kimberly Van Auken; Donghui Li; Cecilia N Arighi; Peter McQuilton; G Thomas Hayman; Susan Tweedie; Mary L Schaeffer; Stanley J F Laulederkind; Shur-Jen Wang; Julien Gobeill; Patrick Ruch; Anh Tuan Luu; Jung-Jae Kim; Jung-Hsien Chiang; Yu-De Chen; Chia-Jung Yang; Hongfang Liu; Dongqing Zhu; Yanpeng Li; Hong Yu; Ehsan Emadzadeh; Graciela Gonzalez; Jian-Ming Chen; Hong-Jie Dai; Zhiyong Lu Journal: Database (Oxford) Date: 2014-08-25 Impact factor: 3.451