| Literature DB >> 25097821 |
Vít Nováček1, Gully A P C Burns2.
Abstract
Background. Unlike full reading, 'skim-reading' involves the process of looking quickly over information in an attempt to cover more material whilst still being able to retain a superficial view of the underlying content. Within this work, we specifically emulate this natural human activity by providing a dynamic graph-based view of entities automatically extracted from text. For the extraction, we use shallow parsing, co-occurrence analysis and semantic similarity computation techniques. Our main motivation is to assist biomedical researchers and clinicians in coping with increasingly large amounts of potentially relevant articles that are being published ongoingly in life sciences. Methods. To construct the high-level network overview of articles, we extract weighted binary statements from the text. We consider two types of these statements, co-occurrence and similarity, both organised in the same distributional representation (i.e., in a vector-space model). For the co-occurrence weights, we use point-wise mutual information that indicates the degree of non-random association between two co-occurring entities. For computing the similarity statement weights, we use cosine distance based on the relevant co-occurrence vectors. These statements are used to build fuzzy indices of terms, statements and provenance article identifiers, which support fuzzy querying and subsequent result ranking. These indexing and querying processes are then used to construct a graph-based interface for searching and browsing entity networks extracted from articles, as well as articles relevant to the networks being browsed. Last but not least, we describe a methodology for automated experimental evaluation of the presented approach. The method uses formal comparison of the graphs generated by our tool to relevant gold standards based on manually curated PubMed, TREC challenge and MeSH data. Results. We provide a web-based prototype (called 'SKIMMR') that generates a network of inter-related entities from a set of documents which a user may explore through our interface. When a particular area of the entity network looks interesting to a user, the tool displays the documents that are the most relevant to those entities of interest currently shown in the network. We present this as a methodology for browsing a collection of research articles. To illustrate the practical applicability of SKIMMR, we present examples of its use in the domains of Spinal Muscular Atrophy and Parkinson's Disease. Finally, we report on the results of experimental evaluation using the two domains and one additional dataset based on the TREC challenge. The results show how the presented method for machine-aided skim reading outperforms tools like PubMed regarding focused browsing and informativeness of the browsing context.Entities:
Keywords: Information visualisation; Machine reading; Publication search; Skim reading; Text mining
Year: 2014 PMID: 25097821 PMCID: PMC4121546 DOI: 10.7717/peerj.483
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 2Architecture of the SKIMMR system.
Figure 3Exploring SMA etiology.
Figure 4Exploring Parkinson’s disease.
Basic statistics of the SKIMMR instances.
| Data set ID | | | | | | | | | | | | |
|---|---|---|---|---|---|---|
| SMA | 1,221 | 223,257 | 333,124 | 15,288 | 308,626 | 23,167 |
| PD | 4,727 | 943,444 | 1,096,037 | 43,410 | 965,753 | 57,876 |
| TREC | 2,247 | 439,202 | 757,762 | 39,431 | 745,201 | 65,510 |
Derived statistics of the SKIMMR instances.
| Data set ID | ||||||
|---|---|---|---|---|---|---|
| SMA | 182.848 | 272.829 | 0.068 | 0.07 | 271.739 | 21.703 |
| PD | 199.586 | 231.867 | 0.046 | 0.057 | 216.549 | 23.58 |
| TREC | 195.462 | 337.233 | 0.09 | 0.081 | 360.797 | 20.56 |
Statistics of the PubMed graphs for random walks.
| Data set ID | | | | |
|
|
|
| | |
|---|---|---|---|---|---|---|---|
| SMA | 5,364 | 78,608 | 14.655 | 5.465⋅10−3 | 5.971 | 3.029 | 2 |
| PD | 8,622 | 133,188 | 15.447 | 3.584⋅10−3 | 6 | 2.899 | 2 |
| TREC | 10,734 | 161,838 | 15.077 | 2.809⋅10−3 | 7.984 | 3.146 | 3 |
Statistics of the SKIMMR graphs for random walks.
| Data set ID | | | | |
|
|
|
| | |
|---|---|---|---|---|---|---|---|
| SMA | 15,287 | 305,077 | 19.957 | 2.611⋅10−3 | 5 | 2.642 | 1 |
| PD | 43,411 | 952,296 | 21.937 | 1.011⋅10−3 | 5 | 2.271 | 2 |
| TREC | 37,184 | 745,078 | 20.038 | 1.078⋅10−3 | 5.991 | 2.999 | 12 |
Statistics of the indices of related publications.
| Gold standard | SKIMMR | |||
|---|---|---|---|---|
| Data set ID | | |
| | |
|
| SMA | 1,221 | 36.15 | 1,220 | 959.628 |
| PD | 4,727 | 28.61 | 4,724 | 4327.625 |
| TREC | 434 | 18.032 | 2,245 | 1251.424 |
Figure 5Aggregated semantic coherence (blue: PubMed, green: SKIMMR).
Figure 6Aggregated information content (blue: PubMed, green: SKIMMR).
Figure 7Clustering coefficient (blue: PubMed, green: SKIMMR).
Results for the related articles.
| PD | SMA | TREC | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||
| 0.0095 | 0.0240 | 0.5576 | 0.0139 | 0.0777 | 0.5405 | 0.0154 | 0.0487 | 0.5862 |
| … | … |
| 12 | Therefore, we performed 123Ibeta-CIT single-photon emission computed tomography |
| … | … |
| 14 | Five females (4 from two families, and 1 sporadic) were diagnosed as |
| … | … |
| 17 | 123Ibeta-CIT striatal binding was normal in |
| … | … |
| 22 | A normal striatal DAT in a parkinsonian patient is evidence for a nondegenerative cause |
| 23 | Finding a new mutation in one family and failure to demonstrate mutations in the |
| 1 | There are two major syndromes presenting in the early decades of life |
| 2 | |
| … | … |
| 5 | Some have suggested, however, that |
| … | … |
| 0.14 | 0.39 | 1.0 | 0.08 | 0.26 | 0.06 | 0.18 | 0.4 | 0.07 | 0.27 | 0.09 | 0.7 | 0.03 | 0.14 | 0.33 | 0.25 | |
| 0.26 | 0.57 | 1.0 | 0.3 | 0.82 | 0.2 | 0.33 | 0.26 | 0.39 | 0.43 | 0.36 | 0.41 | 0.06 | 0.34 | 1.0 | 1.0 |
| Type | Entity1 | Entity2 | Membership |
|---|---|---|---|
| 1.0 | |||
| 0.852 | |||
| 0.852 | |||
| 0.695 | |||
| 0.695 | |||
| 0.167 | |||
| 0.069 |
| PMID | Title | Authors | Weight |
|---|---|---|---|
|
| The diagnosis of neurodegenerative disorders based on clinical and | Watanabe H et al. | 1.0 |
|
| MRI measurements predict PSP in unclassifiable parkinsonisms: | Morelli M et al. | 0.132 |
|
| Accuracy of magnetic resonance parkinsonism index for | Morelli M et al. | 0.005 |
|
| Utility of dopamine transporter imaging (123-I Ioflupane | Garcia Vicente AM et al. | 0.003 |
|
| Alzheimer’s disease and idiopathic Parkinson’s disease coexistence | Rajput AH et al. | 0.002 |