| Literature DB >> 35686061 |
Konstantinos Voudouris1,2, Matthew Crosby1,3, Benjamin Beyret1,3, José Hernández-Orallo1,4, Murray Shanahan1,3, Marta Halina1,2,5, Lucy G Cheke1,2.
Abstract
Artificial Intelligence is making rapid and remarkable progress in the development of more sophisticated and powerful systems. However, the acknowledgement of several problems with modern machine learning approaches has prompted a shift in AI benchmarking away from task-oriented testing (such as Chess and Go) towards ability-oriented testing, in which AI systems are tested on their capacity to solve certain kinds of novel problems. The Animal-AI Environment is one such benchmark which aims to apply the ability-oriented testing used in comparative psychology to AI systems. Here, we present the first direct human-AI comparison in the Animal-AI Environment, using children aged 6-10 (n = 52). We found that children of all ages were significantly better than a sample of 30 AIs across most of the tests we examined, as well as performing significantly better than the two top-scoring AIs, "ironbar" and "Trrrrr," from the Animal-AI Olympics Competition 2019. While children and AIs performed similarly on basic navigational tasks, AIs performed significantly worse in more complex cognitive tests, including detour tasks, spatial elimination tasks, and object permanence tasks, indicating that AIs lack several cognitive abilities that children aged 6-10 possess. Both children and AIs performed poorly on tool-use tasks, suggesting that these tests are challenging for both biological and non-biological machines.Entities:
Keywords: AI benchmarks; Animal-AI Olympics; artificial intelligence; cognitive AI; comparative cognition; human-AI comparison; out-of-distribution testing
Year: 2022 PMID: 35686061 PMCID: PMC9172850 DOI: 10.3389/fpsyg.2022.711821
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Tasks grouped into 10 levels of increasing difficulty.
| Level Name | Level Description | What is required of the agent? | Task Examples |
|---|---|---|---|
| L1 - Food Retrieval | Rewarding and aversive stimuli in an open arena containing no obstacles. | Basic navigation towards rewarding stimuli and away from aversive stimuli. Tests whether the agent can navigate the arena and achieve the simple goals of obtaining rewards. | This is not a tested skill within comparative cognition, as it is assumed that any creature is able to feed itself to survive. |
| L2 - Preferences | Rewarding and aversive stimuli arranged in forced-choice or free-choice arrangements. All stimuli can be viewed from the same position (the agent need not reorient itself to view the stimuli) | Selection of most rewarding stimuli when presented with multiple visible options. Tests whether the agent has a notion of which stimuli are the most rewarding. | Y-mazes ( |
| L3 - Static Obstacles | Rewarding stimuli are fully or partially occluded by opaque or transparent static obstacles such as walls, ramps, tunnels, or boxes. | Navigation around variable static objects to obtain rewards that may be initially out of view. Tests whether the agent can explore the arena in the search for occluded rewarding stimuli. | Detour tasks and cylinder tasks ( |
| L4 - Avoidance | Rewarding and aversive stimuli are arranged around aversive zones. | Navigation in an arena containing aversive zones. Tests whether the agent avoids aversive stimuli. | Y-maze variants (see L2) |
| L5 - Spatial Reasoning and Support | Rewarding stimuli are occluded, or not simultaneously visible from one position. They may also be supported out of reach by other static objects. | Inferences about the locations of rewarding stimuli from their absence elsewhere. Tests whether the agent can reason about space and how external objects can support each other. | T-mazes ( |
| L6 - Generalisation | A selection of tasks from previous levels, except that the colour of the walls and flooring (except orange and red zones) is varied. | The agent is required to ignore irrelevant cues about colour. Tests whether the agent is using the colour of background objects as a cue to behaviour in the arena. | This is often a feature of controls within animal cognition tasks rather than a feature of test variables, e.g., counterbalancing colour, or stimulus location. |
| L7 - Internal Modelling | A selection of tasks from previous levels except that visual information is blocked at periodic intervals. | The agent is required to continue navigating towards rewarding objects despite lack of visual input. Tests whether the agent behaves through step-by-step responses to pixel output or whether broader action plans are carried out. | ‘Lights out’ radial arm mazes ( |
| L8 - Object Permanence and Working Memory | Rewarding stimuli pass out of view behind occluding objects. | The agent is required to navigate to rewarding stimuli by inferring where they are from their initial trajectories before they were occluded. Tests whether the agent acknowledges that objects persevere even when they move behind a barrier. | Primate Cognition Test Battery (PCTB; |
| L9 – Numerosity and Advanced Preferences | Preference tasks (L2) where the number of rewarding stimuli in each choice is high (3+). These are augmented with object permanence tasks (L8) | The agent is required to count the number of rewarding stimuli available and judge which option is optimal for the goal of reaching maximum points. These stimuli may pass behind occluding objects. Tests whether the agent acknowledges the number of rewarding stimuli in making preference decisions. | Numerical discrimination tasks (e.g., |
| L10 – Causal Reasoning | Rewarding stimuli are only accessible through interactions with one or more non-rewarding stimuli such as boxes and pushable blocks. | The agent is required to manipulate non-rewarding objects in the arena to facilitate them in obtaining rewarding stimuli. Tests whether the agent can reason about how objects interact causally to carry out multi-stage actions. | Trap-tube tasks (e.g., |
“Rewarding stimuli” refer to green/yellow “fruit. “Aversive stimuli” refer to red “fruit.” “Aversive zones” refer to red and orange areas of the arena. In the final column, the experimental paradigms that inspired each level are referenced.
Figure 1A visual description of the Animal-AI Environment and Testbed. Full details are presented in the Supplementary Material. Images of the Animal-AI Environment and Testbed are licensed under Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).
Rankings of 30 AI agents involved in this study compared to ranking in AAI Olympics 2019 Competition.
| AI/Team Name | Total Average Accuracy (4 decimal places, d.p.) | Ranking | AAI Olympics Rank |
|---|---|---|---|
| Ironbar | 0.4896 | 1 | 2 |
| Trrrrr | 0.4881 | 2 | 1 |
| Sirius | 0.4308 | 3 | 3 |
| ARF-RL | 0.4278 | 4 | 8 |
| Sungbinchoi | 0.4198 | 5 | 6 |
| Melflo (oltau.ai) | 0.4196 | 6 | 5 |
| DeepFox | 0.4095 | 7 | 7 |
| Juramaia | 0.3910 | 8 | 10 |
| BronzeBlood | 0.3906 | 9 | 4 |
| mmIA | 0.3900 | 10 | 12 |
Figure 2Histograms of accuracy averaged across 40 tasks. AIs (left, purple) and children (right, red). The average pass mark across the 40 tasks is shown by the green line. Red/purple solid lines show the probability densities. Red/purple dotted lines show average accuracy.
Figure 3Boxplots by level and by agent. Levels are in ascending order on the x-axis, with AIs in purple (left hand boxplot of each pair) and children in red (right hand boxplot of each pair). Average pass marks for each level are shown in green.
Mann–Whitney U-test statistics and Vargha-Delaney’s A comparing AIs and children on each level.
| Level Num. | Level Name | W-statistic | Vargha-Delaney’s A |
|---|---|---|---|
| L1 | Food Retrieval | 560 | 0.349 |
| L2 | Preferences | 670 | 0.429 |
| L3 | Static Obstacles | 52*** | 0.033 |
| L4 | Avoidance | 267*** | 0.171 |
| L5 | Spatial Reasoning and Support | 201*** | 0.129 |
| L6 | Generalisation | 153*** | 0.098 |
| L7 | Internal Modelling | 303*** | 0.194 |
| L8 | Object Permanence and Working Memory | 73*** | 0.047 |
| L9 | Numerosity and Advanced Preferences | 219*** | 0.140 |
| L10 | Causal Reasoning | 395** | 0.253 |
NAI = 30, Nchildren = 52. Bonferroni correction applied to significance levels. *p < 0.005, **p < 0.001, ***p < 0.0001.
Measures of central tendency and deviation by age group/agent type.
| Age Group/Agent | Mean | Median | Standard Deviation |
|---|---|---|---|
| 6 (N = 7) | 0.5823 | 0.5993 | 0.1355 |
| 7 ( | 0.6081 | 0.6295 | 0.1330 |
| 8 ( | 0.5823 | 0.5993 | 0.1355 |
| 9 ( | 0.6539 | 0.6843 | 0.1548 |
| 10 ( | 0.6081 | 0.6294 | 0.1330 |
| AI ( | 0.3412 | 0.3529 | 0.0809 |
Figure 4Density of plot of average score across 40 tasks, by age/agent type. The green line shows the average pass mark across 40 levels.
Kendall’s Tau by age group/agent type, with Bonferroni correction.
| Age Group/Agent Type | ||
|---|---|---|
| 6 | −0.7778* | −0.4777*** |
| 7 | −0.6889( | −0.4030** |
| 8 | −0.7778* | −0.4377** |
| 9 | −0.6000 ( | −0.3950** |
| 10 | −0.6889 ( | −0.3710* |
| AI | −0.3778 ( | −0.2684 ( |
.
Figure 5Boxplots of average accuracy on each level, by age/agent type. The left hand 5 boxplots for each level are the age groups 6–10 respectively, with the rightmost boxplot being the AI group. The green bars show the average pass mark for each level.
T-ratio for pairwise comparisons (contrast effects) between age groups/AIs on Aligned Rank data. All DFs 76.
| Age/AI | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|
| 7 | −0.730 (−0.399) | ||||
| 8 | −0.349 (−0.125) | 0.613 (0.247) | |||
| 9 | −1.446 (−0.809) | −0.804 (−0.410) | −1.479 (−0.684) | ||
| 10 | −1.053 (−0.576) | −0.356 (−0.177) | −1.008 (−0.451) | 0.457 (0.233) | |
| AI | −4.401***(2.050) | 6.045*** (2.449) | 6.332*** (2.175) | 6.779*** (2.859) | 6.481*** (2.626) |
Values of p adjusted by Tukey Method. Effect sizes (Cohen’s d) shown in brackets.
Figure 6UMAP projection onto 2-dimensions using default values of N = 15 and min-dist = 0.1. The labels for AIs correspond to the algorithm name. Age labels are included for children. See the RShinyDash app provided in the Supplementary Material for different parameter settings.
Percentile for “ironbar” and “Trrrrr” with respect to children’s performances.
| Level | Percentile | |
|---|---|---|
| Ironbar | Trrrrr | |
| Overall | 22nd | 22nd |
| 1 | 97th | 46th |
| 2 | 38th | 72nd |
| 3 | 12th | 4th |
| 4 | 79th | 75th |
| 5 | 33rd | 48th |
| 6 | 1st | 18th |
| 7 | 62nd | 74th |
| 8 | 25th | 0th |
| 9 | 11th | 11th |
| 10 | 32nd | 21st |
Figure 7Bonferroni confidence intervals for children’s data at alpha = 0.05 with ‘ironbar’ and ‘Trrrrr’ results and pass marks overlayed.
Figure 8Different static obstacles in the AAI Testbed. Cuboidal blocks in L1-L5 (Top). Fence-like structures in L6 (bottom). Images of the Animal-AI Environment and Testbed are licensed under Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0).