| Literature DB >> 34927062 |
Behzad Mirzababaei1, Viktoria Pammer-Schindler1,2.
Abstract
This article discusses the usefulness of Toulmin's model of arguments as structuring an assessment of different types of wrongness in an argument. We discuss the usability of the model within a conversational agent that aims to support users to develop a good argument. Within the article, we present a study and the development of classifiers that identify the existence of structural components in a good argument, namely a claim, a warrant (underlying understanding), and evidence. Based on a dataset (three sub-datasets with 100, 1,026, 211 responses in each) in which users argue about the intelligence or non-intelligence of entities, we have developed classifiers for these components: The existence and direction (positive/negative) of claims can be detected a weighted average F1 score over all classes (positive/negative/unknown) of 0.91. The existence of a warrant (with warrant/without warrant) can be detected with a weighted F1 score over all classes of 0.88. The existence of evidence (with evidence/without evidence) can be detected with a weighted average F1 score of 0.80. We argue that these scores are high enough to be of use within a conditional dialogue structure based on Bloom's taxonomy of learning; and show by argument an example conditional dialogue structure that allows us to conduct coherent learning conversations. While in our described experiments, we show how Toulmin's model of arguments can be used to identify structural problems with argumentation, we also discuss how Toulmin's model of arguments could be used in conjunction with content-wise assessment of the correctness especially of the evidence component to identify more complex types of wrongness in arguments, where argument components are not well aligned. Owing to having progress in argument mining and conversational agents, the next challenges could be the developing agents that support learning argumentation. These agents could identify more complex type of wrongness in arguments that result from wrong connections between argumentation components.Entities:
Keywords: Toulmin’s model of argument; argument mining; argument quality detection; educational conversational agent; educational technology
Year: 2021 PMID: 34927062 PMCID: PMC8680349 DOI: 10.3389/frai.2021.645516
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
FIGURE 1The component of arguments based on Toulmin’s scheme (Toulmin, 2003).
MTurk experiments for collecting data.
| Datasets | Number of collected responses | Qualification requirement |
|---|---|---|
| Dataset 1 | 100 | • HIT Approval Rate (%) ≥ 95 |
| Dataset 2 | 1,026 | • HIT Approval Rate (%) ≥ 95 |
| • At least US Bachelor’s Degree | ||
| Dataset 3 | 211 | • HIT Approval Rate (%) ≥ 95 |
| • At least US Bachelor’s Degree |
The descriptive statistics of different categories of entities in the datasets.
| Category | Datasets 1 and 2 | Dataset 3 (test data) | ||
|---|---|---|---|---|
| # of responses | The average number of tokens | # of responses | The average number of tokens | |
| Animals | 296 | 36.73 | 53 | 29.77 |
| Plants | 277 | 34.06 | 55 | 34.85 |
| Inanimate objects | 277 | 31.54 | 52 | 30.23 |
| AI-enabled technologies | 276 | 39.13 | 51 | 32.31 |
The number of different labels for each component in training and test data.
| Component | Claim | Warrant | Evidence | ||||
|---|---|---|---|---|---|---|---|
| Annotation | Positive | Negative | Unknown | With warrant | Without warrant | With evidence | Without evidence |
| Training data (datasets 1 and 2) | 477 | 594 | 55 | 691 | 435 | 835 | 291 |
| Test data (dataset 3) | 102 | 99 | 10 | 111 | 100 | 159 | 52 |
The 30 terms that correlate most with the class “with evidence” in dataset 2.
|
|
|
|
|
|
|
| Hunt |
|
|
| handmade |
|
| Survive |
|
|
|
|
|
|
| alive |
| feed | made by |
|
|
|
|
| by human |
|
|
The features used for training classifiers of claim, warrant, and evidence components.
| Component | General feature | Component-specific feature |
|---|---|---|
| Claim | • TFIDF of bigrams and trigrams (The length of vector = 500) | • Regular expressions to indicate phrases such as “it is (not) intelligent” |
| Warrant | • TFIDF of bigrams and trigrams (The length of vector = 500) | • Regular expressions to indicate the proposed definitions of intelligence |
| Evidence | • TFIDF of unigrams and bigrams (The length of vector = 3,000) | • The entity-specific keywords ( |
| • The length of responses based on the number of words |
Real samples regarding the different values of the claim component. There are users’ responses without any modification.
| User’s response | Claim |
|---|---|
| “Monkeys and humans are evolutionary speaking very close. Whilst it can’t be said to think or act “humanly” (by definition only humans can do that), it can certainly think and act both intelligently and rationally, and most certainly learns from experiences. Therefore it is intelligent.” | Positive |
| “I think that a self-driving car is intelligent. It learns from experiences and adapts and makes decisions based on what it has learned.” | Positive |
| “I think a venus flytrap just wants to feed itself. That would be the goal it wants to reach.” | Unknown |
| “the New York Statue of Liberty is made of copper and it exhibits positivity to the people around it and also the toes of this statue denotes the stableness to the world.” | Unknown |
| “no I don’t believe a self-driving car is intelligent I believe the people who wrote the code that make the car self-driving are intelligent. The car can only do what is it is programed to do.” | Negative |
| “no” | Negative |
The result of 10-fold cross-validation on dataset 2 in detecting claims and evaluation of performance on the held-out dataset.
| Classifiers | The result of 10-fold cross-validation on dataset 2 | The result of using dataset 1 as a held-out dataset | ||
|---|---|---|---|---|
| Average of macro F1-scores | Standard deviation of macro F1-scores | Macro F1-score | Accuracy | |
| K-Nearest Neighbors | 0.61 | 0.02 | 0.56 | 0.70 |
| SVM | 0.76 | 0.07 | 0.63 | 0.93 |
| Decision Tree | 0.75 | 0.03 | 0.71 | 0.88 |
| Random Forest |
| 0.07 |
|
|
| Ada Boost | 0.68 | 0.07 | 0.62 | 0.92 |
The highest F1‐scores and Accuracy values.
The performance of detecting claims on the test data (dataset 3) based on each class.
| Precision | Recall | F1-score | # of instances | |
|---|---|---|---|---|
| Positive |
| 0.91 |
| 102 |
| Negative | 0.95 |
|
| 99 |
| Unknown | 0.33 | 0.60 | 0.43 | 10 |
The highest Precision, Recall and F1-score values.
The overall performance of detecting claims on the test data (dataset 3).
| Random forest classifier | Precision | Recall | F1-score |
|---|---|---|---|
| Macro average | 0.75 | 0.81 | 0.77 |
| Weighted average | 0.93 | 0.91 | 0.91 |
| Accuracy | 0.91 | ||
| Cohen’s κ | 0.83 | ||
Real samples regarding the different values of the warrant component.
| User’s response | Warrant |
|---|---|
| “Yes, I think that any action that involves the act of thinking and acting, involves a certain level of intelligence, in my opinion they are very intelligent, because they are born doing things that we humans are not born doing, they learn new things, things which is outside the animal world, things that only we humans learn, but of course there is a limitation in that.” | With |
| “I think a monkey is very intelligent because it can learn just like a human.” | With |
| “Snakes have the ability to adjust their behavior as determined by their surroundings and, as such, are able to learn from their experiences, so, yes, they are intelligent.” | With |
| “A self-driving car is intelligent as long as it has the correct information for it to function. It needs to have “brains” in order to work properly.” | Without |
| “No, I think that the actins of reptiles which include apparent stealth and self-direction, do not correspond to selecting from a set of alternative actions. The action is the only option and it is conjured by the needs of instinct” | Without |
| “It was intelligent it shows the friendship between two countries namely France and United States and mostly it representing liberty the enlightening the world. The torch really shows the path to freedom.” | Without |
The result of 10-fold cross-validation on dataset 2 in detecting warrants and evaluation of performance on the held-out dataset.
| Classifiers | The result of 10-fold cross-validation on dataset 2 | The result of using dataset 1 as a held-out dataset | ||
|---|---|---|---|---|
| Average of F1-scores | Standard deviation of macro F1-scores | Macro F1-score | Accuracy | |
| K-Nearest Neighbors | 0.76 | 0.03 | 0.55 | 0.58 |
| SVM | 0.85 | 0.03 | 0.61 | 0.61 |
| Decision Tree | 0.81 | 0.04 | 0.64 | 0.64 |
| Random Forest |
| 0.02 |
|
|
| Ada Boost | 0.85 | 0.03 | 0.65 | 0.65 |
The highest F1‐scores and Accuracy values.
The overall performance of detecting warrants on the test data (dataset 3).
| Random forest classifier | Precision | Recall | F1-score | # of instances |
|---|---|---|---|---|
| With warrant | 0.95 | 0.83 | 0.88 | 111 |
| Without warrant | 0.83 | 0.95 | 0.88 | 100 |
| Accuracy | 0.89 | |||
| Cohen’s κ | 0.77 | |||
Real samples regarding the different values of the evidence component.
| User’s response | Evidence |
|---|---|
| “In my opinion, a monkey is an intelligent being, as he presents aspects similar to those in humans, such as concern for the group, being able to perceive what is best for his community with its due limitations, motor intelligence, intelligence to solve situations that demand creativity.” | With |
| “Actually, yes, I do. It doesn’t “think humanely, or act humanely.” I’m not sure if it thinks rationally or not, but it acts rationally: seeking out light in order to maximize its nutritional opportunities. It also, as all plants, learns from experience, in that it grows to match environmental conditions.” | With |
| “I don’t believe Google search engine meets the definition of intelligent because humans are behind the code of Google so Google itself is not doing the thinking. It is also only acting on what humans tell it to do. The only learning it might do is remembering what you’ve searched for previously and remembering cookies.” | With |
| “Based on the definition provided the venus fly trap is not intelligent. I believe it meets some of the criteria (Thinks and acts rationally, learns from experience) but not all. It does not think or act humanly” | Without |
| “yes because it behaves humanly and can be able to adapt to changes to its environment” | Without |
| “A Table is unintelligent, because it cannot think like a human, move on its own or adapt behavior to a changing environment.” | Without |
The result of 10-fold cross-validation on dataset 2 in detecting evidence and evaluation of performance on the held-out dataset.
| Classifiers | The result of 10-fold cross-validation on dataset 2 | The result of using dataset 1 as a held-out dataset | ||
|---|---|---|---|---|
| Average of F1-scores | Standard deviation of F1-scores | Macro F1-score | Accuracy | |
| K-Nearest Neighbors | 0.87 | 0.02 | 0.70 | 0.77 |
| SVM | 0.86 | 0.01 | 0.44 | 0.70 |
| Decision Tree | 0.86 | 0.01 |
| 0.77 |
| Random Forest |
| 0.01 |
|
|
| Ada Boost | 0.88 | 0.02 | 0.63 | 0.74 |
The highest F1‐scores and Accuracy values.
The overall performance of detecting evidence on the test data (dataset 3).
| Random forest classifier | Precision | Recall | F1-score | # of instances |
|---|---|---|---|---|
| With evidence | 0.83 | 0.96 | 0.89 | 159 |
| Without evidence | 0.79 | 0.42 | 0.54 | 52 |
| Accuracy | 0.83 | |||
| Cohen’s κ | 0.45 | |||
FIGURE 2The different states that the agent reaches based on the user’s responses regarding the main question of the conversation, “Is < an entity > intelligent or not? Why?”
FIGURE 3A coherent conversation when all the core components were mentioned by the user.
FIGURE 4A coherent conversation when some of the core components were not mentioned by the user.