| Literature DB >> 35085252 |
Tirthankar Ghosal1, Sandeep Kumar2, Prabhat Kumar Bharti2, Asif Ekbal2.
Abstract
Peer Review is at the heart of scholarly communications and the cornerstone of scientific publishing. However, academia often criticizes the peer review system as non-transparent, biased, arbitrary, a flawed process at the heart of science, leading to researchers arguing with its reliability and quality. These problems could also be due to the lack of studies with the peer-review texts for various proprietary and confidentiality clauses. Peer review texts could serve as a rich source of Natural Language Processing (NLP) research on understanding the scholarly communication landscape, and thereby build systems towards mitigating those pertinent problems. In this work, we present a first of its kind multi-layered dataset of 1199 open peer review texts manually annotated at the sentence level (∼ 17k sentences) across the four layers, viz. Paper Section Correspondence, Paper Aspect Category, Review Functionality, and Review Significance. Given a text written by the reviewer, we annotate: to which sections (e.g., Methodology, Experiments, etc.), what aspects (e.g., Originality/Novelty, Empirical/Theoretical Soundness, etc.) of the paper does the review text correspond to, what is the role played by the review text (e.g., appreciation, criticism, summary, etc.), and the importance of the review statement (major, minor, general) within the review. We also annotate the sentiment of the reviewer (positive, negative, neutral) for the first two layers to judge the reviewer's perspective on the different sections and aspects of the paper. We further introduce four novel tasks with this dataset, which could serve as an indicator of the exhaustiveness of a peer review and can be a step towards the automatic judgment of review quality. We also present baseline experiments and results for the different tasks for further investigations. We believe our dataset would provide a benchmark experimental testbed for automated systems to leverage on current NLP state-of-the-art techniques to address different issues with peer review quality, thereby ushering increased transparency and trust on the holy grail of scientific research validation. Our dataset and associated codes are available at https://www.iitp.ac.in/~ai-nlp-ml/resources.html#Peer-Review-Analyze.Entities:
Mesh:
Year: 2022 PMID: 35085252 PMCID: PMC8794172 DOI: 10.1371/journal.pone.0259238
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 3Heatmaps showing the label co-occurrence between two layers highlighting inter-dependency between the layers.
(a) Layer 1 vs Layer 2. (b) Layer 1 vs Layer 3.
Peer review analyze data statistics, L→# of sentences, Std→Standard deviation.
| Category | # Papers | # Reviews | Min L | Max L | Avg L | Std | # Sentences |
|---|---|---|---|---|---|---|---|
|
| 184 | 555 | 2 | 82 | ∼14 | 8.52 | 7736 |
|
| 192 | 578 | 1 | 60 | ∼14 | 8.58 | 8190 |
|
| 22 | 66 | 2 | 44 | ∼16 | 9.88 | 1050 |
|
| 398 | 1199 | - | - | - | - | 16976 |
Representative examples of review texts for different labels in Review-Paper Section Correspondence Layer (Layer 1) and review-paper aspect layer (Layer 2).
|
| ||
|
|
|
|
|
| If the reviewer is explicitly commenting on the Abstract of the paper. |
|
|
| If the reviewer is explicitly commenting on the Introduction of the paper or provides a general summary at the beginning of the review. | |
|
| If the reviewer is talking explicitly on the Literature Section or comments on some related research. |
|
|
| Review statement that comments on the problem being investigated or the main scientific idea in the paper. |
|
|
| Any statement on the data/datasets/corpus used in the concerned work. |
|
|
| Review comments on the methods, the approach described in the paper, on details on how the problem has been addressed? | |
|
| Review comments on the experimental section, parameter/hyperparameter details, training/testing configuration, what has been done, etc. |
|
|
| Comments on the results, the outcome of the experiment | |
|
| Comments explicitly specifying the tables and figures within the paper | |
|
| Comments on analysis of results, studies on the outcome | |
|
| Comments on the future of the work, impact on the community, etc. | |
|
| We keep the Overall label for those review comments which are not confined to a certain section of the paper and are a comment on the overall work in general, sometimes overlaps with the Introduction label. | |
|
| Any straightforward comments on the references or on the bibliography section of the paper | |
|
| To justify their point, sometimes the reviewer brings external knowledge from their expertise in the review, which cannot be classified into the other section-labels. We mark those with EXT. |
|
|
| ||
|
| If the reviewer comments on the scope of the article to the conference or the standard/suitability of the article to the venue. |
|
|
| Review-comments on novelty or originality of the submission. |
|
|
| If the reviewer comments on the significance of the work described (e.g., inspire new ideas, insights which could be impactful to the community). |
|
|
| If the reviewer comments whether the work is compared against earlier approaches or where do the work stands against existing literature if the references are adequate. |
|
|
| Review-comments on presentation and formatting aspects of the paper. | |
|
| Overall recommendation of the reviewer on the article for inclusion/exclusion from the proceedings. |
|
|
| If the reviewer comments on the soundness of the approach or if the approach is well-chosen (e.g., if the arguments in the paper are cogent and well-supported). |
|
|
| If the reviewer comments on the volume of work done and have enough substance to warrant publication or if the paper would benefit from more ideas and results. |
|
|
| If the reviewer comments about the writing and if the paper is well-structured or not, whether the contributions come out clear. |
|
Label descriptions and representative examples of review texts in review-statement purpose layer (Layer 3) and review-statement significance layer (Layer 4).
|
| ||
|
| Reviewer Intent/Label Descriptions | Example |
|
| Provides a summary of the work reflecting his/her understanding of the work, usually at the beginning of the review. |
|
|
| Provides suggestions to the author to improvise or to include additional details for clarity, such as evidence, artifacts, etc. |
|
|
| Highlights the major/minor flaws/shortcomings in the paper, complementing the work. Usually, the reviewer appears confident in their claim. |
|
|
| Applauds the author about their work highlighting positive aspects or specific sections of the paper. |
|
|
| Statements where the reviewer is engaging in simple explanations, providing additional insights, etc. Usually neutral in polarity. |
|
|
| Reviewer is explicitly posing a question to the author (ending with a question mark), sometimes ask for further explanation, sometimes can highlight a deficit and bear negative polarity. |
|
|
| The reviewer is critical of the work, usually highlights a deficit, and bears explicit negative sentiment. |
|
|
| The reviewer clearly brings out their view towards the work, usually leading to acceptance/rejection statements. |
|
|
| ||
|
| A strong statement by the reviewer highlights their opinionated view on a major aspect/section or the entire paper (usually strength or weakness). It could highlight a critical flaw (empirical/theoretical soundness) that could not be rectified easily by the author or could be an appreciation of the work’s novelty. The editor/chair should consider a major comment in their final decision-making or while writing the meta-review. |
|
|
| Comments, which would not play a decisive role, are usually on presentation and formatting aspects, missing references, etc. and which could be quickly addressed by the author with less effort. |
|
|
| Are regular comments on the paper could not be classified into the above two categories. Usually are discussions and non-opinionated. |
|
Fig 1Label distributions for the four layers across peer review analyze.
(a) Layer 1-ACC. (b) Layer 1-REJ. (c) Layer 2-ACC. (d) Layer 2-REJ. (e) Layer 3-ACC. (f) Layer 3-REJ. (g) Layer 4-ACC. (h) Layer 4-REJ.
Max and Avg occurrence of review statements in a review pertaining to the four different layers in the dataset (Min occurrence is zero for each label).
| Layer 1 | Layer 2 | Layer 3 | Layer 4 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Labels | Max | Avg | Labels | Max | Avg | Labels | Max | Avg | Labels | Max | Avg |
| ABS | 3 | 0.05 | CLA | 11 | 0.95 | SMY | 26 | 3.34 | MAJ | 31 | 5.87 |
| INT | 9 | 1.00 | APR | 4 | 0.12 | SUG | 13 | 1.46 | MIN | 40 | 5.84 |
| RWK | 42 | 2.93 | NOV | 6 | 0.56 | DFT | 13 | 1.49 | GEN | 34 | 4.27 |
| PDI | 20 | 1.76 | SUB | 14 | 1.34 | APC | 15 | 2.53 | |||
| DAT | 12 | 1.21 | IMP | 9 | 0.75 | DIS | 18 | 2.69 | |||
| MET | 32 | 7.51 | CMP | 11 | 1.43 | QSN | 18 | 1.66 | |||
| EXP | 17 | 2.61 | PNF | 14 | 0.75 | CRT | 24 | 3.38 | |||
| RES | 15 | 1.75 | EMP | 31 | 6.93 | FBK | 4 | 0.28 | |||
| TNF | 10 | 0.61 | REC | 3 | 0.23 | ||||||
| ANA | 17 | 0.71 | |||||||||
| FWK | 5 | 0.15 | |||||||||
| OVA | 10 | 1.12 | |||||||||
| BIB | 8 | 0.41 | |||||||||
| EXT | 6 | 0.27 | |||||||||
Fig 2Label distribution pie charts showing the relative importance of each category of statements in one single review across the four layers.
The average label occurrence for each layer in Table 4 is translated to percentage distribution in these pie charts. (a) Layer 1. (b) Layer 2. (c) Layer 3. (d) Layer 4.
Fig 4Heatmaps showing the label co-occurrence between two layers highlighting inter-dependency between the layers.
(a) Layer 1 vs Layer 4. (b) Layer 2 vs Layer 3. (c) Layer 2 vs Layer 4. (d) Layer 3 vs. Layer 4.
Fig 5Sentiment distribution for labels across Layer 1 and Layer 2 for ACC and REJ papers.
(a) Sentiment for Layer1-ACC. (b) Sentiment for Layer2-ACC. (c) Sentiment for Layer1-REJ. (d) Sentiment for Layer2-REJ.
Fig 6Label-wise F1 scores (micro-averaged) for Task 1, M(1-6)→Methods for evaluation, refer Table 6.
| M | ABS | INT | RWK | PDI | DAT | MET | EXP | RES | ANA | TNF | FWK | BIB | EXT | OAL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 0.48 | 0.41 | 0.15 | 0.27 | 0.43 | 0.54 | 0.34 | 0.45 | 0.31 | 0.51 | 0.37 | 0.28 | 0.17 | 0.50 |
|
| 0.24 | 0.53 | 0.21 | 0.33 | 0.49 | 0.59 | 0.38 | 0.51 | 0.34 | 0.56 | 0.31 | 0.31 | 0.11 | 0.50 |
|
| 0.00 | 0.30 | 0.01 | 0.03 | 0.34 | 0.53 | 0.15 | 0.25 | 0.03 | 0.24 | 0.00 | 0.02 | 0.00 | 0.50 |
|
| 0.13 | 0.34 | 0.13 | 0.16 | 0.27 | 0.49 | 0.12 | 0.06 | 0.19 | 0.18 | 0.22 | 0.27 | 0.01 | 0.38 |
|
|
|
|
|
|
| 0.66 |
|
|
| 0.73 | 0.53 |
|
|
|
|
| 0.29 | 0.63 | 0.56 | 0.40 | 0.54 |
| 0.49 | 0.59 | 0.35 |
|
| 0.66 | 0.10 | 0.61 |
Label-wise F1 scores (micro-averaged) for Task 2.
| Methods | CLA | APR | NOV | SUB | IMP | CMP | PNF | REC | EMP |
|---|---|---|---|---|---|---|---|---|---|
|
| 0.45 | 0.33 | 0.47 | 0.14 | 0.09 | 0.33 | 0.29 | 0.49 | 0.57 |
|
| 0.62 | 0.34 | 0.59 | 0.06 | 0.10 | 0.39 | 0.32 | 0.46 | 0.64 |
|
| 0.45 | 0.05 | 0.10 | 0.01 | 0.01 | 0.19 | 0.05 | 0.26 | 0.61 |
|
| 0.32 | 0.03 | 0.12 | 0.05 | 0.07 | 0.20 | 0.08 | 0.12 | 0.57 |
|
| 0.61 | 0.32 |
|
| 0.26 | 0.43 |
|
| 0.68 |
|
|
|
| 0.69 | 0.38 |
|
| 0.40 | 0.62 |
|
Label-wise F1 scores (micro-averaged) for Task 3 and Task 4.
| Methods | Task 3 | Task 4 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| APC | CRT | DFT | DIS | FBK | QSN | SMY | SUG | GEN | MAJ | MIN | |
|
| 0.56 | 0.42 | 0.19 | 0.28 | 0.38 | 0.36 | 0.55 | 0.42 | 0.57 | 0.58 | 0.56 |
|
| 0.54 | 0.41 | 0.19 | 0.25 | 0.42 | 0.38 | 0.58 | 0.37 | 0.59 | 0.59 | 0.61 |
|
| 0.52 | 0.37 | 0.20 | 0.24 | 0.51 | 0.41 | 0.55 | 0.352 | 0.56 | 0.57 | 0.55 |
|
| 0.36 | 0.33 | 0.07 | 0.18 | 0.08 |
| 0.49 | 0.36 | 0.49 | 0.48 | 0.42 |
|
|
| 0.61 |
|
| 0.58 | 0.79 |
|
|
|
|
|
|
| 0.74 |
| 0.34 | 0.46 |
| 0.83 | 0.71 | 0.57 | 0.62 | 0.66 | 0.64 |
Overall accuracy figures on the initial four tasks for the different baseline methods.
| Methods | Task 1 | Task 2 | Task 3 | Task 4 |
|---|---|---|---|---|
|
| 42.97% | 44.48% | 42.10% | 57.12% |
|
| 48.38% | 52.3% | 41.11% | 60.12% |
|
| 38.88% | 46.39% | 39.91% | 56.08% |
|
| 32.53% | 39.65% | 35.71% | 43.32% |
|
|
| 58.42% |
|
|
|
| 57.61% |
| 61.81% | 62.50% |
P→Precision, R→Recall, R1→ROUGE with unigram, R2→ROUGE-2 for bigram overlap, R-L→ROUGE-L for longest common subsequence.
| Model | R1 | R2 | R-L | ||||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F1 | P | R | F1 | |
|
| |||||||||
|
| 0.538 | 0.221 | 0.267 | 0.358 | 0.138 | 0.170 | 0.509 | 0.230 | 0.281 |
|
| 0.559 | 0.230 | 0.277 | 0.372 | 0.143 | 0.176 | 0.529 | 0.239 | 0.292 |
|
| |||||||||
|
| 0.299 | 0.165 | 0.190 | 0.047 | 0.025 | 0.029 | 0.256 | 0.145 | 0.169 |
|
| 0.227 | 0.201 | 0.189 | 0.024 | 0.021 | 0.019 | 0.163 | 0.147 | 0.139 |