| Literature DB >> 30616524 |
Prabina Kumar Meher1, Tanmaya Kumar Sahu2, Shachi Gahoi2, Ruchi Tomar2,3, Atmakuri Ramakrishna Rao4.
Abstract
BACKGROUND: Identification of unknown fungal species aids to the conservation of fungal diversity. As many fungal species cannot be cultured, morphological identification of those species is almost impossible. But, DNA barcoding technique can be employed for identification of such species. For fungal taxonomy prediction, the ITS (internal transcribed spacer) region of rDNA (ribosomal DNA) is used as barcode. Though the computational prediction of fungal species has become feasible with the availability of huge volume of barcode sequences in public domain, prediction of fungal species is challenging due to high degree of variability among ITS regions within species.Entities:
Keywords: BOLD systems; CBOL; DNA barcode; Fungal taxonomy; ITS
Mesh:
Substances:
Year: 2019 PMID: 30616524 PMCID: PMC6323839 DOI: 10.1186/s12863-018-0710-z
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Fig. 1(a) Diagrammatic representation of ITS region of rDNA that includes ITS1 and ITS2 separated by 5.8s gene. (b) Venn diagram showing the number of sequences of this work present in other databases. c Diagrammatic representation of the computation of gapped base pair features of di-nucleotide AA. d Flow diagram showing the steps of training and testing involved in prediction using RF classifier. During training, tree-based classifiers are constructed on bootstrap samples of the training dataset, whereas in testing the test instance is dropped in every constructed classifier for predicting its label based on majority voting scheme
Distribution of collected fungal barcode sequences over different genomic regions. It can be seen that >56000 sequences out of 59847 sequences are from ITS (including ITS1 and ITS2) region. These 59847 barcode sequences are belonged to 3770 species, where at least 3 sequences are present for each species.
| Genomic region | 18S | 28S | 5.8S | AOX-fmt | atp6 | COI-5P | COII | COXIII | ITS | ITS1 | ITS2 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| # Sequences | 5 | 6 | 2418 | 79 | 3 | 595 | 3 | 3 | 51886 | 2428 | 2421 |
Number of sequences, species, sequences/species for the considered seven categories of datasets. For instance, in the first category there are 3770 species with 11210 sequences, where each species has 3 sequences. Further, in the category with k sequences per species, a k-fold cross validation was adopted where k-1 sequences per species were used to train the model and rest one sequence was used to assess the model accuracy.
| #Sequence/Species | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|
| #Species | 3770 | 3461 | 2777 | 2328 | 1998 | 1773 | 1498 |
| #Sequence | 11210 | 13844 | 13885 | 13968 | 13986 | 14184 | 13482 |
Summary of the training and test datasets for five different taxonomical entities.
| Dataset | Taxonomical entity | ||||
|---|---|---|---|---|---|
|
|
| Fish | Bat |
| |
| #Train (reference) | 419 | 791 | 515 | 682 | 1656 |
| #Test (query) | 116 | 122 | 111 | 144 | 352 |
#Train: Number of sequences in the training set
#Test: Number of sequences in the test set
Fig. 2(a) Line graphs showing the trend of OOB-error rates with respect to different number of classification trees (ntree) in RF. b The OOB-error rates for different model representations with default values of mtry at ntree=500. c Heat maps of the OOB-error rates at ntree=500 with different values of mtry for different model representations. d Heat map of the OOB-error rate for the dataset with 9 sequences per species for different mtry values and model representations. It can be seen that the OOB-error got stabilized after reaching 400 classification trees, whereas mtry=9) was observed optimum due to less OOB-error rates as compared to the other values of mtry
Fig. 3(a) The species identification success rates (SISR) for different combinations of g-spaced base pair features. b The SISR for different number of sequences per species. c The SISR of the proposed model for taxonomy prediction in Drosophila, Inga, Fish, Bat and Cypraiedae. d Box plots of the proportion of correctly predicted sequences in 100 sets of each simulated dataset. (e) Heat map of the proportion of correctly predicted sequence of 100 sets of each simulated dataset
Species identification success rates for different combinations of k-mer and g-spaced feature sets, where 4 and 5 sequences per species were used to train the prediction model. It can be seen that though the species identification success rates for both feature sets are at par, number of k-mer features used are larger than that of g-spaced features.
| Feature-type | Feature combination | #Features | #Sequences/Species | |
|---|---|---|---|---|
| 5 | 6 | |||
| 1+2 | 20 | 76.37±4.91 | 79.61±3.33 | |
| 1+2+3 | 84 | 79.21±4.71 | 82.72±2.81 | |
| 1+2+3+4 | 340 | 80.61±4.03 | 83.68±2.85 | |
| g=1+2+3+4+5 | 96 | 81.74±2.72 | 83.49±2.36 | |
Fig. 4(a) The SISRs of the proposed model, similarity-, tree- and diagnostic-based methods for taxonomy prediction of Drosophila, Inga and Cypraiedae. b Accuracy of different taxonomy prediction method for prediction of fungal species using DNA barcode. c Number of correctly predicted fungal species that are common in different taxonomy prediction methods
Fig. 5(a) Snapshot of the server page of the funbarRF and (b) result page after execution of an example dataset