| Literature DB >> 26023286 |
Jeremy A Miller1, Donat Agosti2, Lyubomir Penev3, Guido Sautter4, Teodor Georgiev5, Terry Catapano2, David Patterson6, David King7, Serrano Pereira8, Rutger Aldo Vos8, Soraya Sierra8.
Abstract
Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovry and description, and 2) the prevelence of rarity in species descriptions.Entities:
Keywords: Araneae; Biodiversity informatics; Data mining; Open access; Spiders; Taxonomy; XML markup
Year: 2015 PMID: 26023286 PMCID: PMC4442254 DOI: 10.3897/BDJ.3.e5063
Source DB: PubMed Journal: Biodivers Data J ISSN: 1314-2828
Figure 2a.12,377 publications listed in the 2014 World Spider Catalog. Zootaxa is the top ranking venue with 509 titles representing just over 4% of spider taxonomy.
Open access articles in Zootaxa containing treatments on spiders () as of August 2014. For each article, the page count, number of treatments (species rank, higher rank, and total), number of specimens, DOI, and Zoobank LSID (where available) are specified.
| Source | Page count | Species treatments | Higher treatments | Total treatments | Specimens | DOI | Zoobank LSID |
|---|---|---|---|---|---|---|---|
|
| 14 | 2 | 2 | 4 | 36 |
|
|
|
| 20 | 6 | 2 | 8 | 23 |
|
|
|
| 4 | 1 | 1 | 2 | 1 |
|
|
|
| 23 | 19 | 7 | 26 | 172 |
|
|
|
| 6 | 4 | 1 | 5 | 13 |
|
|
|
| 6 | 1 | 1 | 2 | 4 |
|
|
|
| 19 | 2 | 1 | 3 | 274 |
|
|
|
| 21 | 13 | 0 | 13 | 65 |
| |
|
| 24 | 4 | 1 | 5 | 304 |
| |
|
| 25 | 6 | 2 | 8 | 118 |
| |
|
| 8 | 3 | 0 | 3 | 3 |
| |
|
| 34 | 7 | 1 | 8 | 1978 |
| |
|
| 14 | 1 | 0 | 1 | 46 |
| |
|
| 4 | 1 | 0 | 1 | 4 |
| |
|
| 18 | 2 | 1 | 3 | 15 |
| |
|
| 10 | 2 | 2 | 4 | 80 |
| |
|
| 21 | 13 | 3 | 16 | 41 |
| |
|
| 24 | 8 | 1 | 9 | 160 |
| |
|
| 127 | 60 | 0 | 60 | 730 |
| |
|
| 4 | 1 | 1 | 2 | 3 |
| |
|
| 36 | 9 | 2 | 11 | 133 |
| |
|
| 17 | 7 | 4 | 11 | 6 |
| |
|
| 11 | 1 | 0 | 1 | 2 |
| |
|
| 12 | 7 | 1 | 8 | 54 |
| |
|
| 4 | 1 | 0 | 1 | 45 |
| |
|
| 23 | 3 | 0 | 3 | 48 |
| |
|
| 14 | 3 | 1 | 4 | 38 |
| |
|
| 24 | 5 | 1 | 6 | 54 |
| |
|
| 10 | 2 | 1 | 3 | 99 |
| |
|
| 12 | 3 | 1 | 4 | 39 |
| |
|
| 11 | 3 | 0 | 3 | 67 |
| |
|
| 14 | 1 | 2 | 3 | 33 |
| |
|
| 19 | 5 | 0 | 5 | 80 |
| |
|
| 19 | 2 | 2 | 4 | 51 |
| |
|
| 14 | 1 | 0 | 1 | 42 |
| |
|
| 8 | 1 | 0 | 1 | 1 |
| |
|
| 8 | 1 | 0 | 1 | 10 |
|
Figure 6.Dashboard charts summarizing content from species-rank treatments published in open access articles in Zootaxa and Biodiversity Data Journal containing treatments on spiders, filtered to show only specimens from the collection of the California Academy of Sciences (Suppl. material 10). CAS was the institution associated with the largest number of specimens in this body of literature. All content shown here is from treatments published in Zootaxa; no CAS specimens were cited in Biodiversity Data Journal treatments on spiders.
Figure 7.Dashboard charts summarizing content from species-rank treatments published in open access articles in Zootaxa and Biodiversity Data Journal containing treatments on spiders, filtered to show only specimens collected in Russia (Suppl. material 11). Russia was the country associated with the largest number of specimens in this body of literature. All content shown here is from treatments published in Zootaxa; no specimens collected in Russia were cited in Biodiversity Data Journal treatments on spiders.
Figure 8.Dashboard charts summarizing content from species-rank treatments published in open access articles in Zootaxa and Biodiversity Data Journal containing treatments on spiders, filtered to show only specimens collected by Y. M. Marusik (Suppl. material 12). Marusik was the collector associated with the largest number of specimens in this body of literature. Note that this count excludes specimens that Marusik collected collaboratively with others (see Discussion: Tracking Individuals). All content shown here is from articles published in Zootaxa; no specimens collected by Marusik were cited in Biodiversity Data Journal treatments on spiders.
Figure 9a.Data from all treatments in Kronestedt and Marusik (2011).
Figure 9b.Data from one treatment, in Kronestedt and Marusik (2011).
Figure 10.Dashboard charts summarizing data published in open access articles in Zootaxa and Biodiversity Data Journal containing treatments on spiders, filtered to show only one species: (Suppl. material 15). Data on was included in three treatments, all published in Biodiversity Data Journal.
Figure 11.Dashboard charts summarizing data published in open access articles in Zootaxa and Biodiversity Data Journal containing treatments on spiders, filtered to show only one lead author: Jeremy A. Miller (Suppl. material 16). Miller was lead author on two publications in Biodiversity Data Journal and one in Zootaxa, and was the only lead author on open access publications featuring spider treatments in both journals. Note that in addition to the three articles on which he is lead author, he is also a contributing author to a fourth article (Wang et al. 2010), but content from this article is not included here (see Discussion: Tracking individuals).
Figure 12.Excerpt from a taxonomic treatment with ambiguously structured materialsCitations data in the source document. The top frame shows the published PDF, the lower frame shows the same content in GoldenGATE with the treatment and materialsCitation tags revealed. The source document is ambiguous about how many paratype specimens are deposited in which natural history collection. This is represented in XML by associating the collection event data (place, time, collector) with each of the listed institutional collections but no quantity of specimens is assigned to any collection. The 110 male and 44 female specimens are also associated with the collection event data, but with no institutional collection.