| Literature DB >> 21685060 |
Sanmitra Bhattacharya1, Viet Ha-Thuc, Padmini Srinivasan.
Abstract
MOTIVATION: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents.Entities:
Mesh:
Year: 2011 PMID: 21685060 PMCID: PMC3117369 DOI: 10.1093/bioinformatics/btr223
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A simplified diagram of the summarization system.
Distribution of MeSH terms for variable length profile cutoffs
| Percentage of MeSH terms | No. of expansion terms in best profile |
|---|---|
| 19% | 0 |
| 28% | 5 |
| 22% | 10 |
| 14% | 15 |
| 17% | 20 |
ROUGE F-Scores for baselines and MeSH term approach where number of sentences in summary = 5
| MEAD Random | MEAD True | MeSH Term | |
|---|---|---|---|
| ROUGE-1 | 0.3074 | 0.3224 | 0.3385 |
| ROUGE-2 | 0.0774 | 0.0873 | 0.1126 |
| ROUGE-SU4 | 0.0884 | 0.1005 | 0.1094 |
ROUGE F-Scores for baselines and MeSH term approach where number of sentences in summary = number of sentences in abstract
| MEAD Random | MEAD True | MeSH Term | |
|---|---|---|---|
| ROUGE-1 | 0.3781 | 0.3815 | 0.4150 |
| ROUGE-2 | 0.1062 | 0.1353 | 0.1435 |
| ROUGE-SU4 | 0.1428 | 0.1411 | 0.1782 |
ROUGE F-Scores for MeSH term and MeSH profile approaches where number of sentences in summary = number of sentences in abstract
| MeSH Term | MeSH Profiles | |||
|---|---|---|---|---|
| 5-Terms | 10-Terms | VL | ||
| ROUGE-1 | 0.4150 | 0.4201 | 0.4220 | 0.4320 |
| ROUGE-2 | 0.1435 | 0.1375 | 0.1408 | 0.1497 |
| ROUGE-SU4 | 0.1782 | 0.1743 | 0.1753 | 0.1887 |
5-Terms, selecting top five terms from MeSH Profiles; 10-Terms, selecting top 10 terms from MeSH Profiles; VL, variable length; N is set specific to the MeSH Term.
Accuracy scores: human evaluations of abstract and summaries
| Article | Abs | MeSH Term | MeSH Profile | M-True | M-Random |
|---|---|---|---|---|---|
| 1 | 0.74 | 0.68 | 0.70 | 0.54 | 0.46 |
| 2 | 0.76 | 0.69 | 0.71 | 0.60 | 0.38 |
| 3 | 0.70 | 0.63 | 0.67 | 0.47 | 0.45 |
| 4 | 0.80 | 0.64 | 0.71 | 0.51 | 0.44 |
| 5 | 0.85 | 0.73 | 0.75 | 0.65 | 0.45 |
| 6 | 0.90 | 0.80 | 0.80 | 0.65 | 0.70 |
| 7 | 0.84 | 0.76 | 0.78 | 0.67 | 0.58 |
| 8 | 0.97 | 0.71 | 0.83 | 0.66 | 0.76 |
| Mean | 0.82 | 0.71 | 0.74 | 0.59 | 0.53 |
| SD | 0.09 | 0.06 | 0.06 | 0.08 | 0.14 |
Abs, abstract; M-True, MEAD True Summary; M-Random, MEAD Random Summary.
Document-level MAP scores for retrieval experiments
| Abs | Full | MeSHProf-1 | MeSHProf-2 | Hybrid | |
|---|---|---|---|---|---|
| Doc-MAP | 0.1368 | 0.1585 | 0.1307 | 0.1412 | 0.1521 |
Abs, abstract; Full, full text; MeSHProf-1, abstract-length MeSH Profile summary; MeSHProf-2, 10-sentence long MeSH Profile summary; Hybrid, combined summary from Abs and MeSHProf-2; Doc-MAP, document-level MAP.
Types of omission errors made by the summarization strategies
| Error type | No. of errors | Example |
|---|---|---|
| Acronym/synonym | 24 | SCA1, spinocerebellar ataxia type 1; HPE, holoprosencephaly; mAb1C2, 1C2 antibody, etc. |
| Missed non-major MeSH term | 9 | Phosphoinositide phospholipase C, glycosylphosphatidylinositols, hedgehog proteins, etc. |
| Singular/plural | 6 | Ataxia, ataxias, Ca2+-channel, Ca2+-channels; β-subunit, β-subunits, etc. |
| Condensed terms/concepts | 5 | Orexin-A and B, Orexin A and Orexin B; SCA1/SCA2: SCA1 and SCA2; SCA3/MJD, SCA3 or MJD, etc. |
| Missed major MeSH term | 5 | Central nervous system, prions, Machado-Joseph disease, etc. |
| Other | 13 | Creutzfeldt–Jakob syndrome, cerebral cortex, autocatalytic cleavage, cholesterol modification, etc. |
Results shown here are from a set of eight sample documents.
ROUGE F-Scores for baselines and MeSH term approach where number of sentences in summary = 10
| MEAD Random | MEAD True | MeSH Term | |
|---|---|---|---|
| ROUGE-1 | 0.3731 | 0.3606 | 0.4083 |
| ROUGE-2 | 0.1031 | 0.1079 | 0.1401 |
| ROUGE-SU4 | 0.1317 | 0.1188 | 0.1665 |
ROUGE F-Scores for baselines and MeSH term approach where number of sentences in summary = 15
| MEAD Random | MEAD True | MeSH Term | |
|---|---|---|---|
| ROUGE-1 | 0.3662 | 0.3214 | 0.3878 |
| ROUGE-2 | 0.1127 | 0.1257 | 0.1441 |
| ROUGE-SU4 | 0.1217 | 0.0877 | 0.1424 |
ROUGE F-Scores for MeSH term and MeSH profile approaches where number of sentences in summary = 5
| MeSH Term | MeSH Profiles | |||
|---|---|---|---|---|
| 5-Terms | 10-Terms | VL | ||
| ROUGE-1 | 0.3385 | 0.3538 | 0.3607 | 0.3661 |
| ROUGE-2 | 0.1126 | 0.1154 | 0.1169 | 0.1206 |
| ROUGE-SU4 | 0.1094 | 0.1165 | 0.1201 | 0.1246 |
5-Terms, selecting top five terms from MeSH Profiles; 10-Terms, selecting top 10 terms from MeSH Profiles; VL, variable length; N is set specific to the MeSH Term.
ROUGE F-Scores for MeSH term and MeSH profile approaches where number of sentences in summary = 10
| MeSH Term | MeSH Profiles | |||
|---|---|---|---|---|
| 5-Terms | 10-Terms | VL | ||
| ROUGE-1 | 0.4083 | 0.4100 | 0.41293 | 0.4245 |
| ROUGE-2 | 0.1401 | 0.1342 | 0.13921 | 0.1485 |
| ROUGE-SU4 | 0.1665 | 0.1595 | 0.16265 | 0.1748 |
5-Terms, selecting top five terms from MeSH Profiles; 10-Terms, selecting top 10 terms from MeSH Profiles; VL, variable length; N is set specific to the MeSH Term.
ROUGE F-Scores for MeSH term and MeSH profile approaches where number of sentences in summary = 15
| MeSH Term | MeSH Profiles | |||
|---|---|---|---|---|
| 5-Terms | 10-Terms | VL | ||
| ROUGE-1 | 0.3878 | 0.3893 | 0.3878 | 0.3954 |
| ROUGE-2 | 0.1441 | 0.1397 | 0.1431 | 0.1486 |
| ROUGE-SU4 | 0.1424 | 0.1375 | 0.1359 | 0.1428 |
5-Terms, selecting top five terms from MeSH Profiles; 10-Terms, selecting top 10 terms from MeSH Profiles; VL, variable length; N is set specific to the MeSH Term.