| Literature DB >> 35497046 |
Yiming Cui1,2, Ting Liu1, Wanxiang Che1, Zhigang Chen2, Shijin Wang2,3.
Abstract
Achieving human-level performance on some Machine Reading Comprehension (MRC) datasets is no longer challenging with the help of powerful Pre-trained Language Models (PLMs). However, it is necessary to provide both answer prediction and its explanation to further improve the MRC system's reliability, especially for real-life applications. In this paper, we propose a new benchmark called ExpMRC for evaluating the textual explainability of the MRC systems. ExpMRC contains four subsets, including SQuAD, CMRC 2018, RACE+, and C3, with additional annotations of the answer's evidence. The MRC systems are required to give not only the correct answer but also its explanation. We use state-of-the-art PLMs to build baseline systems and adopt various unsupervised approaches to extract both answer and evidence spans without human-annotated evidence spans. The experimental results show that these models are still far from human performance, suggesting that the ExpMRC is challenging. Resources (data and baselines) are available through https://github.com/ymcui/expmrc.Entities:
Keywords: Explainable artificial intelligence; Machine reading comprehension; Natural language processing
Year: 2022 PMID: 35497046 PMCID: PMC9048090 DOI: 10.1016/j.heliyon.2022.e09290
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Examples in ExpMRC. The evidence is marked with underline. The answer is in blue.
Statistics of the proposed ExpMRC.
| SQuAD | CMRC 2018 | RACE+ | C3 | |||||
|---|---|---|---|---|---|---|---|---|
| Dev | Test | Dev | Test | Dev | Test | Dev | Test | |
| Language | English | Chinese | English | Chinese | ||||
| Answer Type | passage span | passage span | multi-choice | multi-choice | ||||
| Domain | Wikipedia | Wikipedia | exams | exams | ||||
| Passage Num. | 319 | 313 | 369 | 399 | 167 | 168 | 273 | 244 |
| Question Num. | 501 | 502 | 515 | 500 | 561 | 564 | 505 | 500 |
| Max Answer Num. | 3 | 3 | 3 | 3 | 1 | 1 | 1 | 1 |
| Max Evidence Num. | 2 | 2 | 3 | 3 | 2 | 2 | 4 | 4 |
| Avg/Max Passage Tokens | 146/369 | 157/352 | 467/961 | 468/930 | 311/514 | 324/603 | 426/1096 | 413/1011 |
| Avg/Max Question Tokens | 12/28 | 11/28 | 15/37 | 15/37 | 15/39 | 16/55 | 14/28 | 14/31 |
| Avg/Max Answer Tokens | 3/25 | 3/27 | 6/64 | 5/33 | 6/20 | 6/27 | 7/25 | 7/35 |
| Avg/Max Evidence Tokens | 26/62 | 28/76 | 43/175 | 52/313 | 23/162 | 23/82 | 37/199 | 41/180 |
| Surface Matching | - | - | - | - | 61% | 58% | 63% | 62% |
| Semantic Matching | - | - | - | - | 14% | 16% | 20% | 18% |
| Complex Reasoning | - | - | - | - | 25% | 26% | 17% | 20% |
Figure 1Distribution of question types.
Figure 2Neural network architecture of the baselines.
Baseline results on SQuAD, CMRC 2018, RACE+, and C3. B: base, L: large. ‘Sent.’ for ‘sentence’, ‘Ques.’ for ‘question’. ‘Ans.’, ‘Evi.’, and ‘All’ denote the answer/evidence/overall score, respectively.
| System | SQuAD (dev) | SQuAD (test) | CMRC 2018 (dev) | CMRC 2018 (test) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Ans. | Evi. | All | Ans. | Evi. | All | Ans. | Evi. | All | Ans. | Evi. | All | |
| Most Similar Sent. (B) | 81.8 | 74.5 | 87.1 | 85.4 | 76.1 | 71.9 | 60.1 | 84.4 | 62.2 | 52.9 | ||
| MSS. w/ Ques. (B) | 81.0 | 72.9 | 87.1 | 84.8 | 75.6 | 76.9 | 63.9 | 84.4 | ||||
| Predicted Answer Sent. (B) | 87.1 | 84.4 | 69.1 | 59.8 | ||||||||
| Pseudo-data Training (B) | 87.0 | 79.5 | 70.6 | 78.6 | 69.8 | 81.5 | 73.2 | 60.4 | 61.3 | 52.4 | ||
| Most Similar Sent. (L) | 83.9 | 79.3 | 92.3 | 85.7 | 80.4 | 82.8 | 71.6 | 60.3 | 88.6 | 63.0 | 55.9 | |
| MSS. w/ Ques. (L) | 81.9 | 77.4 | 92.3 | 85.1 | 79.8 | 82.8 | 76.3 | 63.6 | 88.6 | 63.2 | ||
| Predicted Answer Sent. (L) | 92.3 | 82.8 | 88.6 | 70.6 | ||||||||
| Pseudo-data Training (L) | 92.9 | 80.7 | 75.6 | 80.1 | 74.8 | 73.1 | 62.7 | 62.9 | 55.3 | |||
|
| ||||||||||||
Figure 3Effect of the lambda term in the evidence loss. X-axis: lambda, Y-axis: average F1.
Upper bound performance of evidence F1 on the development sets.
| SQuAD | CMRC 2018 | RACE+ | C3 | |
|---|---|---|---|---|
| Most Similar Sent. w/ Ques. | 81.9 | 76.3 | 48.0 | 63.2 |
| Predicted Answer Sent. | 85.4 | 77.7 | - | - |
| Ground Truth Answer Sent. | 88.2 | 82.1 | 49.9 | 66.8 |
| Ground Truth Evidence Sent. | 91.6 | 85.2 | 86.9 | 89.1 |