| Literature DB >> 22420330 |
Siddhartha Jonnalagadda1, Ryan Peeler, Philip Topham.
Abstract
BACKGROUND: Rapid identification of subject experts for medical topics helps in improving the implementation of discoveries by speeding the time to market drugs and aiding in clinical trial recruitment, etc. Identifying such people who influence opinion through social network analysis is gaining prominence. In this work, we explore how to combine named entity recognition from unstructured news articles with social network analysis to discover opinion leaders for a given medical topic.Entities:
Year: 2012 PMID: 22420330 PMCID: PMC3338075 DOI: 10.1186/2041-1480-3-2
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Overall architecture. We first retrieved the articles related to obesity from the Internet using web-crawlers. The Person, Organization and Location named entities were extracted from the collected articles. Among the person names, only medical experts were retained. The semi-automatic normalization step addressed polysemy as well as synonymy. In the social network analysis step, we analyzed the network presence of the subject experts.
Figure 2Concept extraction process. The CRF system is trained using the CoNLL-2003 NER shared task corpus and run on the 147,528 obesity-related news articles. The model created during the training phase is used to tag the input sentences with the concepts "person", "organization" and "location".
List of features used in the CRF method.
| Feature name | Type | Description |
|---|---|---|
| Dictionary | Semantic | Person names; Organization names; Location names |
| Distributional | Semantic | Distributional thesaurus |
| Section | Pragmatic | Name of the section in which the sentence appears |
| Part of speech | Syntactic | Part of speech of the token in the sentence |
| Others | Lexical | Lower case token, Lemma, Prefixes, Suffixes, n-grams, Matching patterns such as beginning with a capital, etc. |
Dictionary features: all the three dictionaries contain words that have a single token and are obtained by removing stop words. Each dictionary corresponds to one feature depending on whether a token is present in the dictionary. Distributional features: using the Semantic Vectors package [27] trained on the text retrieved from the links obtained for the case study, each word is represented in a 2000-dimensional vector space. The vector representation is used to find the 20 most similar words from the text to each word. For each token, we thus have 20 distributional semantic features that represent the entries in the thesaurus. Section features: section names are detected automatically using simple rules (e.g. a sentence ending with a semi-colon). Other features: there are about a hundred more features considering different part of speech tags according to Penn Treebank format, the different matching patterns used, prefixes, n-grams etc
Keywords used to filter subject expert mentions out of person mentions in news articles
| Dr | MD | PhD | M.D |
|---|---|---|---|
| PhD | Prof | Dr. | M.D. |
| Ph.D. | Prof. | Program Director | Professor |
| Journal | Colleague | Colleagues | Researcher |
| Faculty | Doctor | Doctors | Publish |
| Published | University | Hospital | Hospitals |
| Research | Lab | Laboratory | School |
| Engineering | Sciences | Institute | Institutes |
| Clinic | College | ||
Performance of the CRF-based concept extraction system, compared to the dictionary-based baseline on 100 news articles
| True Positives | False Negatives | False Positives | Recall | Precision | F-measure | |
|---|---|---|---|---|---|---|
| Machine learning system | ||||||
| Exact Match | 54 | 9 | 5 | 85.7 | 91.5 | 88.5 |
| Partial Match | 55 | 8 | 4 | 87.3 | 93.2 | 90.1 |
| Dictionary Baseline | ||||||
| Exact Match | 45 | 18 | 254 | 71.4 | 15.0 | 24.9 |
| Partial Match | 58 | 5 | 241 | 92.1 | 19.4 | 32.0 |
Figure 3Network map of the largest connected component with subject experts marked. Each person appears in at least one news article. The persons appearing at the center have a higher centrality. The links are unweighted and show the co-occurrence (mentioned together in an article) of subject experts.