Literature DB >> 33180842

Relationship of taxonomic error to frequency of observation.

Abstract

Biological nomenclature is the entry point to a wealth of information related to or associated with living entities. When applied accurately and consistently, communication between and among researchers and investigators is enhanced, leading to advancements in understanding and progress in research programs. Based on freshwater benthic macroinvertebrate taxonomic identifications, inter-laboratory comparisons of >900 samples taken from rivers, streams, and lakes across the U.S., including the Great Lakes, provided data on taxon-specific error rates. Using the error rates in combination with frequency of observation (FREQ; as a surrogate for rarity), six uncertainty/frequency classes (UFC) are proposed for approximately 1,000 taxa. The UFC, error rates, FREQ each are potentially useful for additional analyses related to interpreting biological assessment results and/or stressor response relationships, as weighting factors for various aspects of ecological condition or biodiversity analyses and helping set direction for taxonomic research and refining identification tools.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33180842 PMCID： PMC7660486 DOI： 10.1371/journal.pone.0241933

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

[1] discuss biodiversity in terms of not only richness of genotypes, species, and ecosystems, but also evenness of spatial and temporal distribution, functional characteristics, and their interactions. The sheer magnitude of biological species richness is largely unknown, with estimates ranging from 3–100 million [2-7]. For almost 300 years, efforts to organize and understand that diversity have used nomenclature and classification to provide a direct pathway to actual and conceptual catalogues of information about the biota; it is a system that can conceptually and functionally be thought of as a card catalogue in a library. With growing acceptance of the reality of global change and degradation in climate and both small- and large-scale habitat, along with diminishing taxonomic expertise, the task to census and record biota seems ever more daunting and urgent. Increases in computing power, information technology, and molecular techniques are encouraging some optimism in biodiversity research [8-13]. Even with some of these advances, progress in understanding biological diversity is uneven across taxonomic groups representing different segments of the tree of life, the bias mostly reflecting differential research attention and uneven sampling for some taxa in selected geographic areas [5, 14, 15]. Routine biological monitoring and assessment is about gathering representative sample data from defined habitat and using them for quantitative inference of environmental conditions [16, 17]. Though such monitoring is not about documenting biodiversity or even absolute richness, the two fields rely on identical basic data as input for indicator calculations, model building, and decision-making, that is, taxonomic identifications. The name of an entity or object, whether individually or as a group or class, associates it with information on observable characteristics, provides answers to questions, and potentially allows new lines of enquiry to be framed and pursued. It is as much a truism of biological taxonomy as it is of basic human language that inconsistency in terminology impedes understanding and progress. Historical development of biological nomenclature and classification has been considered by anthropologists as a fundamental component of language. Efforts to understand folk taxonomies have been through debating the relative merits of intellectualism vs. utilitarianism [18-20], approximating the difference between, respectively, basic curiosity and material need. The greater frequency with which an object is observed, there is improved reliability and consistency in its recognition, potentially leading to greater refinement of naming conventions/nomenclatural structure. In this context, it is important to define what is intended by labelling an object (or a taxon) as rare. From a theoretical perspective, rarity has been defined using niche- or phylogenetic-based concepts of abundance, distribution, rarity, or conservation priority-setting [21-24]. As an operational descriptor, rarity or relative commonness is frequency of encounter or observation. The first principle and purpose of taxonomic identification and nomenclature is communication, and logically, objects that are more frequently observed will be recognized with increasing speed, reliability, and consistency. Biologist and ecologist perceptions of the relative rarity or commonness of taxa is a combination of life history and encounter frequency. As an example, reliability of botanical nomenclature used by the lay community in Chiapas, Mexico, was evaluated and use of plant names was found to be strongly related to cultural significance [25]. Techniques for communicating about plants with low cultural significance receiving little human attention were imprecise, that is, under-differentiating. Those with moderate cultural significance had a folk taxonomy which came closer to biological taxon definitions; and the extreme, plants with a high cultural significance tended to be over-differentiated. There is a conceptual relationship between cultural significance and familiarity, the latter of which would be enhanced by a high frequency of encounters/observation. [26] developed a system for distribution classes of benthic macroinvertebrates, based on frequency of occurrence in the Netherlands. Using a combination of species rarity or commonness in their national dataset and direct input from a group of selected taxonomists, they developed a system comprising six different classes (Table 1). One of the driving factors behind their analysis was to have a classification system that would contribute to decision-making relative to conservation of aquatic resources.

Table 1

Distribution classes describing relative rarity and commonness of benthic macroinvertebrates in the Netherlands [26].

Distribution class¹	Percentage of sites
Very rare	0–0.15
Rare	0.16–0.5
Uncommon	0.6–1.5
Common	1.6–4.0
Very common	4.1–12
Abundant	>12

1Class definitions are based on frequency of occurrence, calculated as the percentage of sites.

1Class definitions are based on frequency of occurrence, calculated as the percentage of sites. Routine taxonomic quality control (QC) analysis used by the USEPA National Aquatic Resources Surveys (NARS) and several state, regional, and local monitoring programs for benthic macroinvertebrate samples are based on direct inter-laboratory comparisons. Randomly selected samples are identified by independent taxonomists, resulting in quantitative descriptors of data quality, error rates and potential causes, and information used for formulating corrective actions. A secondary use/added benefit of these analyses is that taxon-specific error rates are produced that can be used as direct indicators of taxon uncertainty, as weighting factors during calculation of quantitative indicators, to help guide development of tools for biological monitoring, in general, and taxonomic identification, in particular. The purpose of this paper is to present the process used for deriving the uncertainty values using morphology-based taxonomic identifications, discuss and summarize the results, and provide recommendations for their application and next step analyses.

Methods

Data used in this analysis are from freshwater benthic macroinvertebrate samples, collected from rivers, streams, and lakes across the U.S., including the Great Lakes. All taxonomic identifications were executed in laboratories using necessary sample/specimen preparation techniques, optical equipment, and appropriate technical literature. The level of effort expended by taxonomists for identifications is standardized for individual programs or projects, and is typically genus level, with occasionally more coarse targets for selected taxa. The taxonomic comparison process used for routine QC analysis is described in detail elsewhere [27-29] and involves blind sample reidentification by independent taxonomists in separate laboratories of a randomly selected 10% of each sample lot. We compiled interlaboratory comparison data for 914 samples from 10 large programs or projects (Table 2) which are conducted at selected local, regional, State, and National scales. Samples used by each of the programs for QC analyses [27, 30] were randomly selected from the full sample load of the program, typically at a rate of approximately 10%. Thus, results reported here can be considered as representative of more than 9,000 samples. There is a total of 1,003 taxa, primarily at genus level (Fig 1), but also including more coarse levels because the level of effort was limited by defined standard procedures and/or poor specimen condition. Following Genus at 79.9 percent, the most frequently used categories were Family (14.6 percent), and Order and Subfamily (1.9 and 1.6 percent, respectively); other levels represent <1 percent of the dataset. There are occasionally “slash taxa”, such as Cricotopus/Orthocladius (Diptera: Chironomidae), and one genus-group taxon, Thienemannimyia genus group which includes the chironomid genera Conchapelopia, Rheopelopia, Helopelopia, Telopelopia, Meropelopia, Hayesomyia, and Thienemannimyia. Truncatelloidea (Mollusca: Gastropoda) is used as a grouping for all Hydrobiidae. Two informal/undefined groupings were used: “Tubificoid Naididae” for those taxa formerly identified as Tubificidae (Oligochaeta: Haplotaxida); and Hydracarina for water mites that could not be taken to genus level.

Table 2

Datasets compiled and used in this analysis.

Entity	Project/Program Name	Sample years	No. samples¹
Maryland Department of Natural Resources	Maryland Biological Stream Survey (MBSS)	1995–2014	135
Mississippi Department of Environmental Quality	Mississippi Phased Biological Monitoring	2002–2018	133
Prince George's County (MD) Department of the Environment	Watershed-Scale Biological Monitoring and Assessment Program	2004–2017	54
US Environmental Protection Agency/Office of Water (USEPA/OW) (National Survey)	Wadeable Streams Assessment (WSA_2004)	2004	71
US Army Corps of Engineers-Mobile District	Lake Allatoona/Upper Etowah River Watershed (LAUE) (GA, US)	2007–2009	19
USEPA/OW (National Survey)	National Lakes Assessment (NLA_2007)	2007	96
USEPA/OW (National Survey)	National Rivers and Streams Assessment (NRSA_2008)	2008	134
USEPA/OW (National Survey)	National Coastal Condition Assessment (NCCA_2015) (Great Lakes only)	2015	49
USEPA/OW (National Survey)	National Lakes Assessment (NLA_2017)	2017	120
USEPA/OW (National Survey)	National Rivers and Streams Assessment (NRSA_2018)	2018	103
TOTAL			914

The number of samples generally represents approximately 10% of the entire sample load for each program during the indicated time period.

Fig 1

Frequency distribution of taxa among hierarchical levels in this dataset.

The number of samples generally represents approximately 10% of the entire sample load for each program during the indicated time period. Two different taxon-specific characteristics are quantified, frequency of observation, or relative rarity, and relative percent difference (RPD). The total number of individuals (count) for a given taxon is the sum across all primary taxonomists (T1), from all samples in all projects. That count is derived in the same manner for the QC taxonomists (T2). Frequency of observation ([FREQ] relative rarity, commonness) for a taxon is the percentage of samples for which a taxon was recorded, calculated as the number of samples in which the taxon was found relative to the total number of samples (n = 914). The number of samples for each taxon is the average between T1 and T2. We plotted numbers of taxa versus numbers of samples using logarithmic scales to illustrate the dominance of taxa observed in a single sample. The proportional difference between two taxon-specific values is calculated using RPD [31] as an indication of the confidence with which a data user can rely on an identification result. It is calculated as follows: where A and B are the numbers of individuals counted for a taxon by T1 and T2, respectively, and pooled across all samples and projects. Values range from 0, indicating perfect agreement, to 200, or perfect disagreement. A general characteristic of RPD is that low values indicate better consistency of identifications between/among taxonomists, thus conveying greater certainty than high values. Caution is warranted in using RPD when taxon-specific counts are low. If either T1 or T2 recorded ≥1 specimen of a taxon, and the other found none (0), RPD would be 200%. Although the number itself (200) would not be informative, it would indicate that one of the taxonomists recognized individuals of a taxon where the other did not. This would be a clue that some morphological key character (and, thus, the taxon) is not being recognized, or incorrect nomenclature is being applied. Other than these cautions, low values of RPD are reliable indicators of consistency. Thus, each taxon is represented by two data values, x = RPD and y = frequency of observation (FREQ) (S1 Appendix), as input for an x:y scatterplot. We used R-script to run a nonlinear regression model relating RPD to FREQ.

Results

The first data visualization was to use a logarithmic plot of numbers of taxa versus numbers of samples (Fig 2). There are 304 taxa that are observed in only 1–2 samples, where the 33 most common taxa are found in anywhere from 200–674 samples. Seventy-five percent (75%) of the taxa were documented in ≤20 samples. Overall distribution ranged from 200 taxa each being found once (in a single sample), to one taxon, Polypedilum (Diptera: Chironomidae: Chironominae: Chironomini), occurring in 674 samples.

Fig 2

Logarithmic scatterplot illustrating that most taxa in this dataset are infrequently observed.

Taxon-specific RPD plotted against FREQ (Fig 3) illustrates that most taxa have low taxonomic uncertainty (mostly identified consistently) and are relatively infrequently encountered. The best fit nonlinear regression model is given by the exponential decay equation: RPD = 22.673 + (200.498)*e^(-0.192*FREQ), and all model terms were significant at p<0.001 (S1 Table). We delineated six uncertainty/frequency classes (UFC) based on graphic patterns (Figs 4 and 5), resulting in approximately 60% of taxa as being considered rare and identified with a high degree of certainty, that is, low RPD. All taxa are listed with associated numbers of individuals by primary and QC taxonomists, RPD, the number and percentages of samples, and UFC (S1 Appendix). Most taxa fall within UFC3 and 5 (Table 3; Figs 5 and 6), with roughly similar proportions within major taxa (Fig 7). UFC6 should be considered anomalous due to its representation by a small number of taxa (n = 6); otherwise, the mean and median values of RPD and FREQ, respectively, generally decrease and increase from UFC1-5 (Table 4, Fig 8).

Fig 3

Taxon-specific relative percent difference (RPD) plotted against frequency of observation (FREQ), or percent of total number of samples.

Fig 4

Uncertainty/frequency model categories delineated relative to the graphic pattern shown in Fig 3.

Fig 5

Distribution of taxa within uncertainty/frequency categories (UFC1-6).

Uncertainty is expressed as relative percent difference (RPD) and relative rarity or commonness as frequency of observation (FREQ). Each point represents a taxon.

Table 3

Identification uncertainty/frequency Classes (UFC).

UFC	No. taxa (n)	Percent¹	Taxonomic certainty	Commonness
1	30	3.0	High	Common
2	40	4.0	High	Moderate
3	606	60.4	High	Rare
4	79	7.9	Moderate	Rare
5	242	24.1	Low	Rare
6	6	0.6	Mixed	Mixed

The confidence (certainty) placed in taxonomic identifications is related to both frequency of observation (commonness) and the consistency of identification.

1Percent is the percentage of taxa relative to the overall dataset.

Fig 6

Proportion of taxa falling within six uncertainty/frequency classes (UFC).

Approximately 67% of taxa are reliably identified with a high level of certainty (UFC 1–3), and 24.1% (UFC 5) are identified with a low level of certainty. Taxa within UFC 3 and 5 are also considered as rare or having a low frequency of observation.

Fig 7

Percentages of taxa in “major” benthic macroinvertebrate groups in six uncertainty/frequency classes.

1, high confidence, common; 2, high confidence, moderately common; 3, high confidence, rare; 4, moderate confidence, rare; 5, low confidence, rare; 6, outliers, mixed.

Table 4

Descriptive statistics for relative percent difference (RPD) and frequency of occurrence (FREQ).

UFC	RPD
UFC	Median	Mean	SD¹	Min¹	Max¹
1	1.3	2.7	4.0	0	20.3
2	2.3	4.6	5.0	0.3	17.6
3	7.2	13.0	14.9	0	54.1
4	66.7	67.2	6.6	54.5	82.4
5	200	176.4	38.5	85.7	200
6	90.7	92.4	43.7	45.7	162.2
	FREQ
1	31.9	35.4	11.6	22.6	73.7
2	17.2	17.6	2.6	13.9	22.2
3	1.2	2.9	3.5	0.1	13.7
4	0.3	0.9	1.2	0.1	5.7
5	0.1	0.4	0.6	0.1	4.3
6	13.3	14.6	8.5	5.9	27.5

1SD is standard deviation, Min and Max are minimum and maximum.

Numbers of taxa (n) representing each class are given in Table 3.

Fig 8

Percentile distributions (boxplots) for frequency of taxon occurrence (FREQ) and relative percent difference (RPD) among the uncertainty-frequency classes.

Distribution of taxa within uncertainty/frequency categories (UFC1-6).

Uncertainty is expressed as relative percent difference (RPD) and relative rarity or commonness as frequency of observation (FREQ). Each point represents a taxon.

Proportion of taxa falling within six uncertainty/frequency classes (UFC).

Percentages of taxa in “major” benthic macroinvertebrate groups in six uncertainty/frequency classes.

1, high confidence, common; 2, high confidence, moderately common; 3, high confidence, rare; 4, moderate confidence, rare; 5, low confidence, rare; 6, outliers, mixed.

Percentile distributions (boxplots) for frequency of taxon occurrence (FREQ) and relative percent difference (RPD) among the uncertainty-frequency classes.

FREQ is the percentage of samples for which a taxon was observed; RPD is a measure of uncertainty associated with taxonomic identifications, thus lower values equate to increased confidence. 1, high confidence, common; 2, high confidence, moderately common; 3, high confidence, rare; 4, moderate confidence, rare; 5, low confidence, rare; 6, outliers, mixed. The confidence (certainty) placed in taxonomic identifications is related to both frequency of observation (commonness) and the consistency of identification. 1Percent is the percentage of taxa relative to the overall dataset. 1SD is standard deviation, Min and Max are minimum and maximum. Numbers of taxa (n) representing each class are given in Table 3. We selected several taxa from each UFC (Table 5) to illustrate representative, quantitative outcomes and characteristics. UFC1 is , with representative taxa such as Pisidium, Stenelmis, Caenis, and Hyalella; overall, taxa in this class are observed in 23–74 percent of samples. Other than Nais with an RPD of 20.3, all other taxa in this class have RPD<10. UFC2 is ; overall, ranging in frequency of observation from 14–22 percent of samples, these taxa are also identified with low uncertainty (RPD, 0.3–17.6). Example taxa of this class include Stempellinella, Baetis, Arrenurus, and Hemerodromia. UFC3 groups taxa that are identified with confidence, simultaneous with being relatively rare (low frequency of occurrence) (). Taxa range from being observed in only a single sample (0.1 percent of total n), such as Anchycteis, Susperatus, Marilia, and Armiger, to just under 14 percent, 120–125 samples (Stenonema, Chimarra, Limnesia, Stictochironomus). UFC4 groups taxa that are identified with increased uncertainty and are uncommon (Fig 4) (). RPD ranges from 55–82, and taxa represent 0.1 percent of the samples (n = 1) to 5.7 percent (n = 52). Examples of UFC4 taxa include Halesochila, Vacupernius, and Macrelmis from only a single sample to Cernotina, Teloganopsis, and Micromenetus (n = 11, 15, and 52 samples, respectively). UFC5 groups taxa that are simultaneously rare and identified with a high degree of uncertainty (), with taxa being observed in from 0.1–4.3 percent of samples, and identification uncertainty ranging from 85.7–200 (S1 Appendix). Example UFC5 taxa of lowest observation frequency include Amphicosmoecus and Kogotus (n = 1 sample) to Placobdella and Sphaerium in 16 (1.8 percent) and 39 (4.3 percent) samples, respectively. UFC6 taxa are , not clearly falling in the other classes; there are six in this dataset, three of which are genus level (Conchapelopia, Thienemannimyia, and Dero), and three, family (Polycentropodidae, Libellulidae, and Naididae).

Table 5

Selected taxa as representative examples of uncertainty/frequency classes (UFC).

UFC	Class	Order	Family	Genus	T1	T2	RPD	n	Pct.
1	Bivalvia	Veneroida	Pisidiidae	Pisidium	1772	1888	6.3	207	22.6
1	Insecta	Coleoptera	Elmidae	Stenelmis	2507	2522	0.6	248	27.1
1	Insecta	Ephemeroptera	Caenidae	Caenis	16630	16562	0.4	437	47.8
1	Malacostraca	Amphipoda	Hyalellidae	Hyalella	15844	15875	0.2	301	32.9
2	Arachnida	Trombidiformes	Arrenuridae	Arrenurus	1869	1832	2.0	169	18.5
2	Insecta	Diptera	Chironomidae	Micropsectra	2172	2211	1.8	149	16.3
2	Insecta	Ephemeroptera	Baetidae	Baetis	4221	3999	5.4	186	20.4
2	Insecta	Odonata	Coenagrionidae	Enallagma	362	431	17.4	127	13.9
3	Gastropoda	Basommatophora	Planorbidae	Gyraulus	1992	1805	9.8	125	13.7
3	Insecta	Diptera	Chironomidae	Stictochironomus	1438	1406	2.3	125	13.7
3	Insecta	Diptera	Simuliidae	Prosimulium	1363	1363	0.0	95	10.4
3	Insecta	Ephemeroptera	Leptohyphidae	Tricorythodes	3961	4001	1.0	112	12.3
4	Clitellata	Haplotaxida	Naididae	Ripistes	11	6	58.8	6	0.7
4	Crustacea	Isopoda	Asellidae	Asellus	37	19	64.3	4	0.4
4	Gastropoda	Basommatophora	Planorbidae	Micromenetus	592	284	70.3	52	5.7
4	Insecta	Plecoptera	Taeniopterygidae	Oemopteryx	24	11	74.3	6	0.7
5	Annelida	Lumbriculida	Lumbriculidae	Stylodrilus	14	79	139.8	6	0.7
5	Bivalvia	Veneroida	Sphaeriidae	Sphaerium	460	114	120.6	39	4.3
5	Insecta	Diptera	Ceratopogonidae	Mallochohelea	14	93	147.7	18	2.0
5	Insecta	Ephemeroptera	Siphlonuridae	Siphlonurus	9	26	97.1	8	0.9
6	Clitellata	Haplotaxida	Naididae	Dero	2132	3565	50.3	187	20.5
6	Insecta	Diptera	Chironomidae	Conchapelopia	120	1149	162.2	89	9.7
6	Insecta	Odonata	Libellulidae		237	64	115.0	63	6.9
6	Insecta	Trichoptera	Polycentropodidae		179	76	80.8	54	5.9

The full list of 1,003 taxa is presented in S1 Appendix. T1 and T2 are the summed counts across n samples. RPD is relative percent difference, and Pct. is the percentage of total samples (n = 914) used in this analysis. Major taxa are most heavily represented in UFC3 and 5 (Table 6, Fig 6). Chironomidae (n = 104), Trichoptera (n = 72), Coleoptera (n = 68), Ephemeroptera (n = 59), and Plecoptera (n = 47), in descending order, are the top five major taxa in UFC3, while Coleoptera (n = 23), Chironomidae (n = 21), Annelida (n = 20), Arachnida (n = 19), and Ephemeroptera and Plecoptera (tied, each n = 16) are those for UFC5.

Table 6

Numbers of taxa by uncertainty/frequency class (UFC).

"Major" taxon	UFC (no. taxa)						TOTAL
"Major" taxon	1	2	3	4	5	6	TOTAL
Arachnida	0	3	38	4	19	0	64
Annelida	3	2	41	6	20	2	74
Bivalvia	2	0	9	1	8	0	20
Chironomidae	15	15	104	8	21	2	165
Coleoptera	2	1	68	9	23	0	103
Crustacea	1	2	16	4	11	0	34
Ephemeroptera	1	7	59	9	16	0	92
Gastropoda	1	2	30	5	10	0	48
Plecoptera	0	0	47	6	16	0	69
Trichoptera	3	1	72	7	14	1	98
Total no. taxa	28	33	484	59	158	5	767

Discussion

Taxa with the highest RPD values, that is, with greater uncertainty, are documented in smaller numbers of sites (Table 7), corresponding with very rare and rare distribution classes of [23], and clearly illustrated by UFC1-2 versus UFC4-5 (Fig 7). In general, the more rare a taxon is, the greater is the uncertainty associated with its identity; and the obverse, increasingly common taxa are better known and identified with elevated confidence. This observation is demonstrated by the near mirror images of error rate (RPD) and rarity (FREQ) for UFC1-5 (Fig 8) and reflects the outcome predicted by [25], i.e., familiarity is borne of repeated encounters. This also speaks, in part, to the collective sense of our limited understanding of biological diversity, and of the most appropriate and effective ways of communicating about that diversity.

Table 7

Relating relative percent difference (RPD) to distribution classes.

Nijboer and Verdonschot (2004)		RPD (this study)
Distribution class	Pct. of sites	n	Median	Mean	SD	Min	Max
Very rare	<0.16	200	200.0	139.2	88.0	0	200
Rare	0.16–0.5	235	40.0	61.9	63.8	0	200
Uncommon	0.6–1.5	181	21.6	44.8	53.8	0	200
Common	1.6–4.0	151	8.5	24.1	34.9	0	180.3
Very common	4.1–12	145	4.9	13.2	23.3	0	162.2
Abundant	>12	91	2.3	7.6	13.8	0	100.7

“n” is the number of taxa that would be categorized as belonging to the [26] distribution classes based on frequency of occurrence in this study.

“n” is the number of taxa that would be categorized as belonging to the [26] distribution classes based on frequency of occurrence in this study. Higher level macroinvertebrate taxa in this analysis shown to have greater identification confidence and consistency are midges (Insecta: Diptera: Chironomidae), caddisflies (Insecta: Trichoptera), beetles (Insecta: Coleoptera), snails (Mollusca: Gastropoda), and stoneflies (Insecta: Plecoptera) (Fig 7), as they are mostly made up of finer level taxa within UFC1-3. Conversely, higher level taxa for which identification data seem to be more problematic (i.e., greater uncertainty) are bivalves (Mollusca: Bivalvia) and Crustacea (Arthropoda); these groups have a higher percentage of taxa in UFC4-5. Several potential uses of UFC designations are relevant to informing data analysts and data users on the extent to which confidence can be placed in results. They include being used as taxon-specific weighting factors for calculating biological indicator values, such as indexes of biological integrity (IBI), River Invertebrate Prediction and Classification System (RIVPACS) models, various diversity calculations, species protection, or habitat prioritization. Testing is necessary to determine the effect on indicator values, but a weighted-average index could be formulated to elevate or restrict the importance of a taxon due to the relative potential of identification error. Similar to use of stressor tolerance values in the Hilsenhoff Biotic Index (HBI), UFC numbers could be used as taxon count modifiers. This approach would retain the inherent value and information content of organism identity, and simultaneously help objectively moderate the influence of those taxa on quantitative indicator outcomes. Taxa demonstrated as having elevated identification uncertainty could be targeted for basic focused research, including morphological re-description, dichotomous identification keys, genetic fingerprinting, or other tools. Commonness values (FREQ) for individual taxa would allow users of comprehensive identification manuals (such as, for example, [32]) to evaluate the relative rarity. The need for independent verification of an identification result would be emphasized for those with known elevated error rates (high RPD). Another potential use of these results would be in helping target individual taxa for determining causes, beyond lack of familiarity, of higher error rates. A common cause is known to be specimens in poor condition and/or small body size (early life stages, or instars). An outcome of such an investigation might be to specify standard procedures for some taxa, including for sampling, handling, preservation, and identification. An example of this would be a requirement that all larval Chironomidae be slide-mounted for examination under a compound microscope. We do not necessarily advocate this, as slide-mounting is not consistently needed by all laboratories or taxonomists. Rather, we stress that the taxonomist use whatever method is needed to attain the target taxonomic level as defined by program or study goals. The goal in this case is not to require that all taxonomists (or taxonomic technicians) slide-mount all chironomid midges; rather, the goal is to acquire genus level data for the taxon. In some cases, slide-mounting might be needed, in others, it would not. Thus, the need for such actions would be determined on a case-by-case, taxon-by-taxon, or even taxonomist-by-taxonomist basis, but the goal of genus level data remains the same. Our interest is in seeing UFC values used as one tool to enhance biological assessments, whether as direct input to indicator calculations, as information to help formulate additional analytical questions, or to help set or justify interpretive procedures. This analysis was possible by having access to available output of inter-taxonomist comparisons and demonstrates added benefits of routine QC and operational data management routines.

Uncertainty/frequency dataset, with benthic macroinvertebrate phylogenetic/classification hierarchy.

Primary and quality control counts (T1 and T2, respectively) are cumulative across n samples, relative percent difference (RPD), percent of samples, uncertainty/frequency class (UFC), and taxonomic rank. (XLSX) Click here for additional data file.

Nonlinear regression of FREQ against RPD.

(XLSX) Click here for additional data file.

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present. 8 Oct 2020 PONE-D-20-27394 Relationship of taxonomic error to frequency of observation PLOS ONE Dear Dr. Stribling, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Editor's comments: The reviewer was very enthusiastic about the manuscript but did have a number of minor suggestions about improvements that could be made prior to publication. Please make as many of these as you can. Please submit your revised manuscript by Nov 22 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Judi Hewitt Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Financial Disclosure section: "Approximately 10% of necessary level of effort in initiating this project was contracted to Tetra Tech, Inc. (JBS) (EP-C-14-016, Work Assignment 4-13) by the US Environmental Protection Agency/Office of Water/Office of Wetlands, Oceans, and Watersheds/Assessment and Watershed Protection Division. The work was in support of the Agency's National Aquatic Resources Surveys: https://www.epa.gov/national-aquatic-resource-surveys The sponsors played no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript." We note that you received funding from a commercial source: Tetra Tech, Inc * Please provide an amended Competing Interests Statement that explicitly states this commercial funder, along with any other relevant declarations relating to employment, consultancy, patents, products in development, marketed products, etc. * Within this Competing Interests Statement, please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. * Please include your amended Competing Interests Statement within your cover letter. We will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests 3. Please amend the manuscript submission data (via Edit Submission) to include author: Erik W. Leppo [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Manuscript Number PONE-D-20-27394 Relationship of taxonomic error to frequency of observation Ecological monitoring and taxonomic assessment is fundamental for effective management of freshwater ecosystems. Error is rarely measured in biomonitoring surveys because it is often assumed to be small and constant, but is considerably more widespread than assumed. This short manuscript provides an interesting method of quantifying taxon-specific error rates for macroinvertebrate identification. Macroinvertebrate identification requires a high level of skill and training, and misidentification of taxa, sorting efficiency and taxon enumeration can all impact data quality. Interpreting rare taxa data can be problematic for freshwater ecologists and, in my experience, assessing rare and often misidentified taxon are not handled consistently between regional water authorities and agencies. Providing tools to better measure and quantify biodiversity values from national datasets, researchers can more accurately assess distinctiveness of rare taxon groups. This manuscript provides a means to better understand freshwater macroinvertebrate assemblages and, more importantly, highlight focus taxa groups that may require further investigation, better taxonomic keys, management or protection. Furthermore, this manuscript indirectly demonstrates the importance of implementing regular interlaboratory quality control procedures to identify causes and sources of error for macroinvertebrate samples. This will continue to ensure, sorting laboratories are able to produce consistent, high quality data. I have some minor points to note, mostly with respect to the inconsistencies between formatting and presentation of the figures. However, some of these format issues may be a reflection of the journal requirements. 1. Lines 221, 226, 229, 230, 233-4, 239, 242 can you provide context to the taxonomic groups (e.g., order) the mentioned taxa belong, as done for Nais (line 222)? For a non-taxonomist or ecologist without knowledge of North America macroinvertebrates this would help immensely rather than the having to cross reference with supplementary files. 2. Although there is no mention to macroinvertebrate size (in relation to instar size) can you comment if instar size correlates to an increase in RPD error? 3. Lines 222, 224, 232. Consistency required for expressing percentages. 4. Figure 1. Can this axis extend to 100% as seen in Figure 5? Do this also for Fig. 3. The titles of X and Y axes are in upper case and not consistent with the other figures. 5. Figures 3, 4 and 5. There is considerably overlap in these three figures. Figure 3 is not necessary as this information is repeated in Figure 5. Although there is subtle difference in how these data points are plotted, they are essentially the same graph. Instead, can figures 4 and 5 be combined? For example, can each of the data points that correspond to a UFC categories be colour coded or superimposed on Figure 4? 6. Figure 6. A 2D rather than a 3D plot would be preferred. A legend is required (1 = High confidence, common; 2 = High confidence, moderately common etc) to avoid the reader having to reference back to Figure 4. 7. Figure 7. This figure needs to be produced in a similar format to Figure 1, i.e., consistent formatting of Y and X axis titles. As noted above, it is suggested a legend is added to this figure. 8. Figure 8. Y axis uses “FREQ(%)” but “FREQ” is used elsewhere. Note the use of “/” in manuscript text but hyphen in X axis and figure caption (line 207). See also lines 204, 207 and replace hyphen with “/”. This figure also requires a legend for the frequency classes. 9. In the discussion are you able to provide an overview or summary of which groups you found to be more problematic than others and, for example, are likely in need for more comprehensive or updated identification guides. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 13 Oct 2020 Stribling and Leppo, 10/09/20 “Relationship of taxonomic error to frequency of observation” PONE-D-20-27394 (RESPONSES TO COMMENTS ARE IN ALL CAPS OR QUOTATION MARKS) 1. Lines 221, 226, 229, 230, 233-4, 239, 242 can you provide context to the taxonomic groups (e.g., order) the mentioned taxa belong, as done for Nais (line 222)? For a non-taxonomist or ecologist without knowledge of North America macroinvertebrates this would help immensely rather than the having to cross reference with supplementary files. I APPRECIATE THIS COMMENT. HOWEVER, IS IT NOT SUFFICIENT THAT THE TAXA MENTIONED IN THIS SECTION, AS REPRESENTATIVE AND SELECTED EXAMPLES, ARE GIVEN WITH THEIR CLASSIFICATION HIERARCY IN TABLE 4? I WOULD CONTEND THAT IT IS SUFFICIENT, AND ACTUALLY MAKES THE TEXT AN EASIER READ WITHOUT THE INLINE PARENTHETICALS. FOR CONSISTENCY, I DELETED THE HIERARCHY PARENTHETICAL FOR NAIS. 2. Although there is no mention to macroinvertebrate size (in relation to instar size) can you comment if instar size correlates to an increase in RPD error? THE STRIBLING ET AL. 2008 PAPER CITED GOES INTO SOME OF THE CAUSES AND MAGNITUDES OF ERROR. BUT YES, BODY SIZE, LIFE STAGE, AND SPECIMEN CONDITION ALL PLAY AN OUTSIZED ROLE IN IDENTIFICATION ERROR. THE PART OF THAT WHICH IS INTERESTING IS THAT TAXONOMISTS THAT ARE MORE EXPERIENCED AND WITH MORE IN-DEPTH TRAINING ARE ABLE TO OBTAIN SOLID RESULTS (=ACCURATE, CONSISTENT) MORE EASILY THAN THOSE WHO ARE LESS EXPERIENCED. I ADDED THE FOLLOWING TEXT IN THE DISCUSSION SECTION: “Another potential use of these results would be in helping target individual taxa for determining causes, beyond lack of familiarity, of higher error rates. A common cause is known to be specimens in poor condition and/or small body size (early life stages, or instars). An outcome of such an investigation might be to specify standard procedures for some taxa, including for sampling, handling, preservation, and identification. An example of this might be a requirement that all larval Chironomidae be slide-mounted for examination under a compound microscope. We do not necessarily advocate this, as slide-mounting is not consistently needed by all laboratories or taxonomists. Rather, we stress that the taxonomist use whatever method is needed to attain the target taxonomic level as defined by program or study goals. The goal in this case is not to require that all taxonomists (or taxonomic technicians) slide-mount all chironomid midges; rather, the goal is to acquire genus level data for the taxon. In some cases, slide-mounting might be needed, in others, it would not. Thus, the need for such actions would be determined on a case-by-case, taxon-by-taxon, or even taxonomist-by-taxonomist basis, but the goal of genus level data remains the same.” 3. Lines 222, 224, 232. Consistency required for expressing percentages. I AM UNSURE OF THE ISSUE HERE, AND WHAT CHANGES ARE BEING REQUESTED. 4. Figure 1. Can this axis extend to 100% as seen in Figure 5? DONE Do this also for Fig. 3. DONE The titles of X and Y axes are in upper case and not consistent with the other figures. CORRECTED 5. Figures 3, 4 and 5. There is considerably overlap in these three figures. Figure 3 is not necessary as this information is repeated in Figure 5. Although there is subtle difference in how these data points are plotted, they are essentially the same graph. Instead, can figures 4 and 5 be combined? For example, can each of the data points that correspond to a UFC categories be colour coded or superimposed on Figure 4? I BELIEVE ALL THREE FIGURES ARE NECESSARY FOR THE PAPER. FIGURE 3 PROVIDES THE READER WITH AN UNCOMPLICATED VISUALIZATION OF THE DATA DISTRIBUTION, FIGURE 4 IS A DEMONSTRATION OF THE CATEGORIES, AND FIGURE 5 ACCOMPLISHES THE SUGGESTED SUPERIMPOSITION. IT JUST SEEMS LIKE THERE IS BETTER CLARITY IN HOW THE UFC STRUCTURE IS PRESENTED IF ALL THREE ARE RETAINED. AS THAT IS THE CORE FOUNDATION OF THIS ANALYSIS, I BELIEVE IT IS BEST PRESENTED AS IT IS. 6. Figure 6. A 2D rather than a 3D plot would be preferred. A legend is required (1 = High confidence, common; 2 = High confidence, moderately common etc) to avoid the reader having to reference back to Figure 4. CORRECTION MADE; LEGEND ADDED TO CAPTION. 7. Figure 7. This figure needs to be produced in a similar format to Figure 1, i.e., consistent formatting of Y and X axis titles. As noted above, it is suggested a legend is added to this figure. I WILL ADD THE LEGEND TO THE CAPTION. HOWEVER, I THINK I DISAGREE WITH THIS SUGGESTION ABOUT THE BARCHART STYLE. THE TWO CHARTS ARE SHOWING DIFFERERNT KINDS OF DATA, AND BY USING STACKED BAR CHARTS/PERCENTAGES, MUCH SPACE IS SAVED. THAT IS, I FEEL CONFIDENT THIS IS THE MOST EFFICIENT TECHNIQUE FOR SHOWING HOW EACH OF THE MAJOR TAXA SORT AMONG THE CATEGORIES. 8. Figure 8. Y axis uses “FREQ(%)” but “FREQ” is used elsewhere. THESE FIGURE PANELS ARE UNIQUE, AND I BELIEVE THEY WARRANT SOMEWHAT UNIQUE AXIS TITLES. I WOULD LIKE TO OPT TOWARD SPELLING THEM OUT, SUCH AS, FROM TOP TO BOTTOM: “Frequency of observation (FREQ)” and “Relative percent difference (RPD)”. Note the use of “/” in manuscript text but hyphen in X axis and figure caption (line 207). See also lines 204, 207 and replace hyphen with “/”. DONE. This figure also requires a legend for the frequency classes. I ADDED A LEGEND TO THE CAPTION FOR THIS, AS WITH THE PRIOR FIGURES. 9. In the discussion are you able to provide an overview or summary of which groups you found to be more problematic than others and, for example, are likely in need for more comprehensive or updated identification guides. THE FOLLOWING TEXT WAS ADDED TO THE DISCUSSION SECTION: “Higher level macroinvertebrate taxa in this analysis shown to have greater identification confidence and consistency are midges (Insecta: Diptera: Chironomidae), caddisflies (Insecta: Trichoptera), beetles (Insecta: Coleoptera), snails (Mollusca: Gastropoda), and stoneflies (Insecta: Plecoptera) (Fig 7), as they are mostly made up of finer level taxa with UFC1-3. Conversely, higher level taxa for which identification data seem to be more problematic (i.e., greater uncertainty) are bivalves (Mollusca: Pelecypoda) and Crustacea (Arthropoda). Each of these groups have a higher percentage of taxa in UFC4-5.” Submitted filename: Response to reviewers.docx Click here for additional data file. 23 Oct 2020 Relationship of taxonomic error to frequency of observation PONE-D-20-27394R1 Dear Dr. Stribling, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Judi Hewitt Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 29 Oct 2020 PONE-D-20-27394R1 Relationship of taxonomic error to frequency of observation Dear Dr. Stribling: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Judi Hewitt Academic Editor PLOS ONE

14 in total

Review 1. Evolutionary informatics: unifying knowledge about the diversity of life.

Authors: Cynthia S Parr; Robert Guralnick; Nico Cellinese; Roderic D M Page
Journal: Trends Ecol Evol Date: 2011-12-10 Impact factor: 17.712

2. A global assessment of endemism and species richness across island and mainland regions.

Authors: Gerold Kier; Holger Kreft; Tien Ming Lee; Walter Jetz; Pierre L Ibisch; Christoph Nowicki; Jens Mutke; Wilhelm Barthlott
Journal: Proc Natl Acad Sci U S A Date: 2009-05-21 Impact factor: 11.205

3. Folk taxonomies and biological classification.

Authors: B Berlin; D E Breedlove; P H Raven
Journal: Science Date: 1966-10-14 Impact factor: 47.728

4. Rarity is a more reliable indicator of land-use impacts on soil invertebrate communities than other diversity metrics.

Authors: Andrew Dopheide; Andreas Makiola; Kate H Orwin; Robert J Holdaway; Jamie R Wood; Ian A Dickie
Journal: Elife Date: 2020-05-19 Impact factor: 8.140

5. Challenges with using names to link digital biodiversity information.

Authors: David Patterson; Dmitry Mozzherin; David Peter Shorthouse; Anne Thessen
Journal: Biodivers Data J Date: 2016-05-25

6. Phylogenetically informed spatial planning is required to conserve the mammalian tree of life.

Authors: Dan F Rosauer; Laura J Pollock; Simon Linke; Walter Jetz
Journal: Proc Biol Sci Date: 2017-10-25 Impact factor: 5.349

7. The geography of biodiversity change in marine and terrestrial assemblages.

Authors: Shane A Blowes; Sarah R Supp; Laura H Antão; Amanda Bates; Helge Bruelheide; Jonathan M Chase; Faye Moyes; Anne Magurran; Brian McGill; Isla H Myers-Smith; Marten Winter; Anne D Bjorkman; Diana E Bowler; Jarrett E K Byrnes; Andrew Gonzalez; Jes Hines; Forest Isbell; Holly P Jones; Laetitia M Navarro; Patrick L Thompson; Mark Vellend; Conor Waldock; Maria Dornelas
Journal: Science Date: 2019-10-18 Impact factor: 47.728

8. Is DNA barcoding actually cheaper and faster than traditional morphological methods: results from a survey of freshwater bioassessment efforts in the United States?

Authors: Eric D Stein; Maria C Martinez; Sara Stiles; Peter E Miller; Evgeny V Zakharov
Journal: PLoS One Date: 2014-04-22 Impact factor: 3.240

9. bold: The Barcode of Life Data System (http://www.barcodinglife.org).

Authors: Sujeevan Ratnasingham; Paul D N Hebert
Journal: Mol Ecol Notes Date: 2007-05-01

10. Scientific research on animal biodiversity is systematically biased towards vertebrates and temperate regions.

Authors: Mark A Titley; Jake L Snaddon; Edgar C Turner
Journal: PLoS One Date: 2017-12-14 Impact factor: 3.240