| Literature DB >> 22363527 |
Ai-bing Zhang1, Jie Feng, Robert D Ward, Ping Wan, Qiang Gao, Jun Wu, Wei-zhong Zhao.
Abstract
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75-100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62-98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60-99.37%) for 1094 brown algae queries, both using ITS barcodes.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22363527 PMCID: PMC3282726 DOI: 10.1371/journal.pone.0030986
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Species assignments for Neotropical bats [81] based on COI sequences for all 4122 random queries using DV-RBF and FJ-RBF methods.
| No. | Category of Random Tests | Query | DV-RBF | Status | FJ-RBF | Status |
| 1 | random | BCBNT34706-Rhynchonycteris naso |
| (✓ |
| (✓ |
| 2 | splits | BCBNT35706-Rhynchonycteris naso |
| (✓) |
| (✓) |
| 3 | BCBNT13006-Diclidurus isabellus |
| (✓) |
| (✓) | |
| 4 | BCBNT37906-Diclidurus isabellus |
| (✓) |
| (✓) | |
| 5 | BCBNT14306-Diclidurus isabellus |
| (✓) |
| (✓) | |
| 6 | BCBNT92206-Chrotopterus auritus |
| (✓) |
| (✓) | |
| 7 | BCBNT59706-Chrotopterus auritus |
| (✓) |
| (✓) | |
| 8 | BCBNT04006-Cormura brevirostris |
| (✓) |
| (✓) | |
| 9 | BCBNT05606-Cormura brevirostris |
| (✓) |
| (✓) | |
| 10 | BCBNT39906-Pteronotus personatus |
| (✓) |
| (✓) | |
| 11 | BCBNT09706-Pteronotus personatus |
| (✓) |
| (✓) | |
| 12 | BCBNT36906-Noctilio albiventris |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 1295 | ( | BCBNT55406-Lophostoma silvicolum |
| (✓) |
| (✓) |
| 1 | n-fold | BCBNT29806-Trachops cirrhosus |
| (✓) |
| (✓) |
| 2 | cross- | BCBNT63906-Platyrrhinus helleri |
| (✓) |
| (✓) |
| 3 | validation | BCBNC12906-Rhinophylla pumilio |
| (✓) |
| (✓) |
| 4 | BCBNT94306-Molossus molossus |
| (✓) |
| (✓) | |
| 5 | BCBNC01906-Rhinophylla pumilio |
| (✓) |
| (✓) | |
| 6 | BCBNT70306-Phyllostomus discolor |
| (✓) |
| (✓) | |
| 7 | BCBNT99106-Platyrrhinus aurarius |
| (✓) |
| (✗) | |
| 8 | BCBNC16806-Platyrrhinus aurarius |
| (✓) |
| (✓) | |
| 9 | BCBN31305-Rhinophylla pumilio |
| (✓) |
| (✓) | |
| 10 | BCBNC06506-Lionycteris spurrelli |
| (✓) |
| (✓) | |
| 11 | BCBN55205-Trachops cirrhosus |
| (✓) |
| (✓) | |
| 12 | BCBNC16106-Platyrrhinus aurarius |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 766 | ( | BCBNT94606-Glyphonycteris daviesi |
| (✗) |
| (✗) |
: Two categories of randomization were performed in this study. One is random splits which were conducted at species level (5 times) and the other is n-fold cross-validation which was performed on the whole dataset ( was used). 4122 random queries were generated based on the original 766 bat COI sequences, see text and Online Appendix I for details.
: The names of query sequences consist of BOLD sequence accession numbers (a dash was removed before the last two numbers) and their true species names. Only part of the results were presented here, see Online Appendix I for all 4122 queries and corresponding assignments (singletons were excluded since they can only be assigned to the wrong speices).
: DV denotes DV-Curve, RBF indicates RBF neural network, see text for details.
: FJ denotes FJ-Curve.
: Ticks and crosses indicate correct and wrong assignments respectively.
Figure 1Success rate of species identification and 95% confidence intervals with the new methods (DV-RBF or FJ-RBF) proposed in this study based on COI barcodes and ITS barcodes for four empirical datasets.
Species assignments for Pacific Canadian marine fish [82] based on COI sequences for all 5134 random queries using DV-RBF and FJ-RBF methods.
| No. | Category of Random Tests | Query | DV-RBF | Status | FJ-RBF | Status |
| 1 | random | TZFPA15007-Eptatretus stoutii |
| (✓ |
| (✓ |
| 2 | splits | TZFPB55006-Eptatretus stoutii |
| (✓) |
| (✓) |
| 3 | TZFPB57806-Eptatretus stoutii |
| (✓) |
| (✓) | |
| 4 | TZFPB21505-Eptatretus deani |
| (✓) |
| (✓) | |
| 5 | TZFPB32505-Eptatretus deani |
| (✓) |
| (✓) | |
| 6 | TZFPB04605-Porichthys notatus |
| (✓) |
| (✓) | |
| 7 | TZFPB46906-Porichthys notatus |
| (✓) |
| (✓) | |
| 8 | TZFPB04305-Porichthys notatus |
| (✓) |
| (✓) | |
| 9 | TZFPB53606-Squalus acanthias |
| (✓) |
| (✓) | |
| 10 | TZFPB56706-Squalus acanthias |
| (✓) |
| (✓) | |
| 11 | TZFPB55906-Squalus acanthias |
| (✓) |
| (✓) | |
| 12 | TZFPB42505-Cyclothone atraria |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 1585 | ( | TZFPA19707-Malacocottus |
| (✓) |
| (✓) |
| 1 | n-fold | TZFPB55306-Lycodes diapterus |
| (✓) |
| (✓) |
| 2 | cross- | TZFPB69106-Sebastes pinniger |
| (✓) |
| (✓) |
| 3 | validation | TZFPA14506-Talismania bifurcata |
| (✓) |
| (✓) |
| 4 | TZFPB71206-Ronquilus jordani |
| (✓) |
| (✓) | |
| 5 | TZFPB56606-Sebastes aleutianus |
| (✓) |
| (✓) | |
| 6 | TZFPA19407-Nectoliparis pelagicus |
| (✗) |
| (✗) | |
| 7 | TZFPB82006-Sebastes reedi |
| (✓) |
| (✓) | |
| 8 | TZFPB46706-Alosa sapidissima |
| (✓) |
| (✓) | |
| 9 | TZFPB87508-Oligocottus maculosus |
| (✓) |
| (✓) | |
| 10 | TZFPB32805-Alepocephalus tenebrosus |
| (✓) |
| (✓) | |
| 11 | TZFPB58306-Theragra chalcogramma |
| (✓) |
| (✓) | |
| 12 | TZFPB86908-Cyclothone atraria |
| (✗) |
| (✗) | |
|
|
|
|
|
|
| |
| 982 | ( | TZFPB16505-Sebastes flavidus |
| (✓) |
| (✓) |
: Two categories of randomization were performed in this study. One is random splits which were conducted at species level (5 times) and the other is n-fold cross-validation which was performed on the whole dataset ( was used). 5134 random queries were generated based on the original 982 fish COI sequences, see text and Online Appendix II for details.
: The names of query sequences consist of BOLD sequence accession numbers (a dash was removed before the last two numbers) and their true species names. Only part of the results were presented here, see Online Appendix II for all 5134 queries and corresponding assignments (singletons were excluded since they can only be assigned to the wrong speices).
: DV denotes DV-Curve, RBF indicates RBF neural network, see text for details.
: FJ denotes FJ-Curve.
: Ticks and crosses indicate correct and wrong assignments respectively.
Species assignments for rust fungi (BOLD project CHITS) based on ITS sequences for 484 random queries using DV-RBF and FJ-RBF methods.
| No. | Category of Random Tests | Query | DV-RBF | Status | FJ-RBF | Status |
| 1 | random | CHITS08008-Chrysomyxa wereii |
| (✓ |
| (✓ |
| 2 | splits | CHITS07708-Chrysomyxa wereii |
| (✓) |
| (✓) |
| 3 | CHITS11109-Chrysomyxa pirolata |
| (✓) |
| (✓) | |
| 4 | CHITS01308-Chrysomyxa pirolata |
| (✓) |
| (✓) | |
| 5 | CHITS11009-Chrysomyxa pirolata |
| (✓) |
| (✓) | |
| 6 | CHITS09509-Chrysomyxa arctostaphyli |
| (✓) |
| (✓) | |
| 7 | CHITS04108-Chrysomyxa arctostaphyli |
| (✓) |
| (✓) | |
| 8 | CHITS03208-Chrysomyxa empetri |
| (✓) |
| (✓) | |
| 9 | CHITS03308-Chrysomyxa empetri |
| (✓) |
| (✓) | |
| 10 | CHITS03108-Chrysomyxa chiogenis |
| (✓) |
| (✓) | |
| 11 | CHITS02408-Chrysomyxa chiogenis |
| (✓) |
| (✓) | |
| 12 | CHITS06208-Chrysomyxa ledicola |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 135 | ( | CHITS06508-Chrysomyxa nagodhii |
| (✓) |
| (✓) |
| 1 | n-fold | CHITS05608-Chrysomyxa ledi |
| (✗) |
| (✗) |
| 2 | cross- | CHITS01208-Chrysomyxa cassandrae |
| (✓) |
| (✓) |
| 3 | validation | CHITS04008-Chrysomyxa arctostaphyli |
| (✓) |
| (✓) |
| 4 | CHITS02308-Chrysomyxa chiogenis |
| (✓) |
| (✓) | |
| 5 | CHITS02108-Chrysomyxa nagodhii |
| (✗) |
| (✗) | |
| 6 | CHITS05308-Chrysomyxa arctostaphyli |
| (✗) |
| (✗) | |
| 7 | CHITS06208-Chrysomyxa ledicola |
| (✓) |
| (✓) | |
| 8 | CHITS06008-Chrysomyxa ledicola |
| (✓) |
| (✓) | |
| 9 | CHITS09509-Chrysomyxa arctostaphyli |
| (✗) |
| (✗) | |
| 10 | FUCUI00608-Fucus distichus |
| (✓) |
| (✓) | |
| 11 | CHITS11009-Chrysomyxa pirolata |
| (✓) |
| (✓) | |
| 12 | CHITS05708-Chrysomyxa ledi |
| (✗) |
| (✗) | |
|
|
|
|
|
|
| |
| 107 | ( | CHITS08909-Chrysomyxa ledicola |
| (✓) |
| (✓) |
: Two categories of randomization were performed in this study. One is random splits which were conducted at species level (5 times) and the other is n-fold cross-validation which was performed on the whole dataset ( was used). 484 random queries were generated based on the original 107 rust fungi ITS sequences, see text and Online Appendix III for details.
: The names of query sequences consist of BOLD sequence accession numbers (a dash was removed before the last two numbers) and their true species names. Only part of the results were presented here, see Online Appendix III for all 484 queries and corresponding assignments (singletons were excluded since they can only be assigned to the wrong speices).
: DV denotes DV-Curve, RBF indicates RBF neural network, see text for details.
: FJ denotes FJ-Curve.
: Ticks and crosses indicate correct and wrong assignments respectively.
Species assignments for the brown algae (BOLD project PHAEP) based on ITS sequences for 1094 random queries using DV-RBF and FJ-RBF methods.
| No. | Category of Random Tests | Query | DV-RBF | Status | FJ-RBF | Status |
| 1 | random | FUCUI04008-Fucus distichus |
| (✓ |
| (✓ |
| 2 | splits | FUCUI03708-Fucus distichus |
| (✓) |
| (✓) |
| 3 | FUCUI04408-Fucus distichus |
| (✓) |
| (✓) | |
| 4 | FUCUI03408-Fucus distichus |
| (✓) |
| (✓) | |
| 5 | FUCUI00308-Fucus distichus |
| (✓) |
| (✓) | |
| 6 | FUCUI05508-Fucus distichus |
| (✓) |
| (✓) | |
| 7 | FUCUI02608-Fucus distichus |
| (✓) |
| (✓) | |
| 8 | FUCUI04508-Fucus distichus |
| (✓) |
| (✓) | |
| 9 | FUCUI05708-Fucus distichus |
| (✓) |
| (✓) | |
| 10 | FUCUI04608-Fucus distichus |
| (✓) |
| (✓) | |
| 11 | FUCUI02708-Fucus distichus |
| (✓) |
| (✓) | |
| 12 | FUCUI00108-Fucus distichus |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 340 | ( | MACRO97608-Scytosiphon cylindricus |
| (✓) |
| (✓) |
| 1 | n-fold | MACRO69407-Saccharina latissima |
| (✓) |
| (✓) |
| 2 | cross- | MACRO12106-Saccharina latissima |
| (✓) |
| (✓) |
| 3 | validation | MACRO77607-Scytosiphon sp |
| (✓) |
| (✓) |
| 4 | MACRO11406-Scytosiphon cylindricus |
| (✓) |
| (✓) | |
| 5 | MACRO12806-Scytosiphon cylindricus |
| (✓) |
| (✓) | |
| 6 | MACRO49807-Saccharina latissima |
| (✓) |
| (✓) | |
| 7 | MACRO17406-Petalonia sp |
| (✓) |
| (✓) | |
| 8 | MACRO94108-Petalonia sp |
| (✓) |
| (✓) | |
| 9 | FUCUI05308-Fucus spiralis |
| (✓) |
| (✓) | |
| 10 | FUCUI00608-Fucus distichus |
| (✓) |
| (✓) | |
| 11 | MACRO104108-Saccharina latissima |
| (✓) |
| (✓) | |
| 12 | MACRO73607-Scytosiphon cylindricus |
| (✓) |
| (✓) | |
|
|
|
|
|
|
| |
| 207 | ( | FUCUI00708-Fucus distichus |
| (✓) |
| (✓) |
: Two categories of randomization were performed in this study. One is random splits which were conducted at species level (5 times) and the other is n-fold cross-validation which was performed on the whole dataset ( was used). 1094 random queries were generated based on the original 207 brown algae ITS sequences, see text and Online Appendix IV for details.
: The names of query sequences consist of BOLD sequence accession numbers (a dash was removed before the last two numbers) and their true species names. Only part of the results were presented here, see Online Appendix IV for all 1094 queries and corresponding assignments (singletons were excluded since they can only be assigned to the wrong speices).
: DV denotes DV-Curve, RBF indicates RBF neural network, see text for details.
: FJ denotes FJ-Curve.
: Ticks and crosses indicate correct and wrong assignments respectively.
Figure 2The DV-Curve of the 10 bp sequence ‘AGACTGCATC’.
Figure 3The FJ-Curve of the 20 bp sequence ‘GCCTCCGCCCAGACTTCTTC’.
Figure 4The work flow of the RBF network approach proposed in this study and a comparison with the BP network.
Figure 5Topology of the RBF network and a processing unit of hidden units.