| Literature DB >> 32765977 |
Damon P Little1,2.
Abstract
PREMISE: The automated recognition of Latin scientific names within vernacular text has many applications, including text mining, search indexing, and automated specimen-label processing. Most published solutions are computationally inefficient, incapable of running within a web browser, and focus on texts in English, thus omitting a substantial portion of biodiversity literature. METHODS ANDEntities:
Keywords: convolutional neural networks (CNN); named entity recognition (NER); taxonomic name recognition (TNR)
Year: 2020 PMID: 32765977 PMCID: PMC7394707 DOI: 10.1002/aps3.11378
Source DB: PubMed Journal: Appl Plant Sci ISSN: 2168-0450 Impact factor: 1.936
Summary of the primary language from 682,396 titles by publication category (languages represented by fewer than 100 titles per category are omitted). The data were extracted from library catalog records housed in the New York Botanical Garden (NY), Biodiversity Heritage Library (BHL), and American Museum of Natural History (AMNH).
| Language | NY monographs | NY periodicals | BHL monographs and periodicals | AMNH monographs | AMNH periodicals | Training/testing words |
|---|---|---|---|---|---|---|
| Chinese | 0.58% | 0.84% | 0.21% | 0.35% | 0.84% | 0 |
| Czech | 0.25% | — | 0.03% | 0.15% | 1.31% | 2,777,209 |
| Danish | 0.28% | — | 0.11% | 0.12% | — | 362,003 |
| Dutch | 0.70% | 0.69% | 0.66% | 0.35% | 1.27% | 309,452 |
| English | 60.77% | 64.24% | 77.90% | 81.23% | 59.05% | 619,032 |
| French | 7.52% | 4.50% | 5.43% | 4.34% | 7.64% | 361,137 |
| German | 10.29% | 4.21% | 9.89% | 6.32% | 10.82% | 2,166,423 |
| Italian | 1.43% | 3.06% | 0.80% | 0.90% | 2.70% | 430,971 |
| Japanese | 0.48% | 1.36% | — | 0.41% | 1.54% | 0 |
| Latin | 3.56% | — | 0.68% | 0.23% | — | 181,241 |
| Norwegian | 0.19% | 0.90% | 0.08% | 0.13% | — | 1,342,803 |
| Polish | 0.73% | 0.94% | — | 0.15% | 1.08% | 3,250,740 |
| Portuguese | 1.66% | 6.33% | 0.42% | 0.60% | 2.43% | 629,128 |
| Russian | 5.06% | 2.39% | 0.11% | 2.72% | 4.80% | 0 |
| Spanish | 4.03% | 7.22% | 2.48% | 1.76% | 5.42% | 790,153 |
| Swedish | 0.95% | — | 0.27% | 0.23% | — | 500,226 |
| Total coverage | 98.48% | 96.68% | 99.07% | 99.99% | 98.90% | — |
The number of words available for neural network training and testing is reported (words written in non‐Latin script were not included).
Neural network structures. The layers are numbered from input to output. Softmax activation was used for the dense output layers; rectified linear unit activation was used for the input and hidden layers. The longest word analyzed was 58 characters, thus LCNN used a padded input of 58. Irrespective of word size, ECNN, PDFFNN, uEDFFNN, and bEDFFNN used fixed inputs of 16, 64, 5, and 5, respectively.
| Layer | ECNN | LCNN | PDFFNN | uEDFFNN | bEDFFNN |
|---|---|---|---|---|---|
| Layer 1 (input) | reshape (7 × 16) | trainable embedding (27 × 64) | dense (kernel = 64 × 64) + batch normalization | dense (kernel = 5 × 5) + batch normalization | dense (kernel = 5 × 5) + batch normalization |
| Layer 2 | convolutional 1D (kernel = 4 × 16 × 32; bias = 32) | random dropout (rate = 0.025) | dense (kernel = 64 × 240) + batch normalization | dense (kernel = 5 × 64) + batch normalization | dense (kernel = 5 × 64) + batch normalization |
| Layer 3 | convolutional 1D (kernel = 4 × 32 × 48; bias = 48) | convolutional 1D (kernel = 4 × 64 × 32; bias = 32) | dense (kernel = 240 × 120) + batch normalization | dense (kernel = 64 × 32) + batch normalization | dense (kernel = 64 × 32) + batch normalization |
| Layer 4 | convolutional 1D (kernel = 4 × 48 × 64; bias = 64) | maximum pooling 1D | random dropout (rate = 0.2) | dense (kernel = 32 × 21) + batch normalization | dense (kernel = 32 × 21) + batch normalization |
| Layer 5 | convolutional 1D (kernel = 4 × 64 × 80; bias = 80) | convolutional 1D (kernel = 4 × 32 × 64; bias = 64) | dense (kernel = 120 × 2; bias = 2) | dense (kernel = 21 × 16) + batch normalization | dense (kernel = 21 × 16) + batch normalization |
| Layer 6 | convolutional 1D (kernel = 4 × 80 × 96; bias = 96) | maximum pooling 1D | — | dense (kernel = 16 × 12) + batch normalization | dense (kernel = 16 × 12) + batch normalization |
| Layer 7 | global average pooling 1D | convolutional 1D (kernel = 4 × 64 × 128; bias = 128) | — | dense (kernel = 12 × 10) + batch normalization | dense (kernel = 12 × 10) + batch normalization |
| Layer 8 | random dropout (rate = 0.025) | global average pooling 1D | — | dense (kernel = 10 × 9) + batch normalization | dense (kernel = 10 × 9) + batch normalization |
| Layer 9 | dense (kernel = 96 × 2; bias = 2) | random dropout (rate = 0.025) | — | dense (kernel = 9 × 8) + batch normalization | dense (kernel = 9 × 8) + batch normalization |
| Layer 10 | — | dense (kernel = 128 × 32; bias = 32) | — | random dropout (rate = 0.2) | dense (kernel = 8 × 7) + batch normalization |
| Layer 11 | — | random dropout (rate = 0.025) | — | dense (kernel = 8 × 2) | dense (kernel = 7 × 6) + batch normalization |
| Layer 12 | — | dense (kernel = 32 × 2; bias = 2) | — | — | dense (kernel = 6 × 5) + batch normalization |
| Layer 13 | — | — | — | — | dense (kernel = 5 × 5) + batch normalization |
| Layer 14 | — | — | — | — | dense (kernel = 5 × 4) + batch normalization |
| Layer 15 | — | — | — | — | dense (kernel = 4 × 4) + batch normalization |
| Layer 16 | — | — | — | — | random dropout (rate = 0.1) |
| Layer 17 | — | — | — | — | dense (kernel = 4 × 2) |
bEDFFNN = binomial ensemble deep feed‐forward neural network; ECNN = Eudex convolutional neural network; LCNN = letter convolutional neural network; PDFFNN = pseudosyllable deep feed‐forward neural network; uEDFFNN = uninomial ensemble deep feed‐forward neural network.
Figure 1Precision‐recall curves for all possible cutoff values of the Bloom filter (BF), Eudex convolutional neural network (ECNN), letter convolutional neural network (LCNN), pseudosyllable deep feed‐forward neural network (PDFFNN), uninomial ensemble deep feed‐forward neural networks (uEDFFNN), and binomial ensemble deep feed‐forward neural networks (bEDFFNN) calculated from the validation data (5% of the total data set; not used for neural network training or testing). A 5% random error was added to the inherent BF error rate to mimic the effect of missing entries, thereby depressing the BF, uEDFFNN, and bEDFFNN curves. The bEDFFNN and uEDFFNN ensemble classifiers perform better than any of the input classifiers demonstrating complementarity.
Figure 2Precision versus recall for the (A) A100, (B) S800, and (C) COPIOUS data sets using LINNAEUS (L), NetiNeti (N), Quaesitor (Q), SPECIES (S), and TaxonFinder (T). Error bars indicate 99% confidence intervals. Confidence area opacity indicates the relative processing time on a log scale, with darker colors indicating slower programs.