Literature DB >> 34658617

Automated Bacterial Classifications Using Machine Learning Based Computational Techniques: Architectures, Challenges and Open Research Issues.

Shallu Kotwal¹, Priya Rani², Tasleem Arif¹, Jatinder Manhas³, Sparsh Sharma⁴.

Abstract

Bacteria are important in a variety of practical domains, including industry, agriculture, medicine etc. A very few species of bacteria are favourable to humans. Whereas, majority of them are extremely dangerous and causes variety of life threatening illness to different living organisms. Traditionally, this class of microbes is detected and classified using different approaches like gram staining, biochemical testing, motility testing etc. However with the availability of large amount of data and technical advances in the field of medical and computer science, the machine learning methods have been widely used and have shown tremendous performance in automatic detection of bacteria. The inclusion of latest technology employing different Artificial Intelligence techniques are greatly assisting microbiologist in solving extremely complex problems in this domain. This paper presents a review of the literature on various machine learning approaches that have been used to classify bacteria, for the period 1998-2020. The resources include research papers and book chapters from different publishers of national and international repute such as Elsevier, Springer, IEEE, PLOS, etc. The study carried out a detailed and critical analysis of penetrating different Machine learning methodologies in the field of bacterial classification along with their limitations and future scope. In addition, different opportunities and challenges in implementing these techniques in the concerned field are also presented to provide a deep insight to the researchers working in this field. © CIMNE, Barcelona, Spain 2021.

Entities: Chemical

Year: 2021 PMID： 34658617 PMCID： PMC8505783 DOI： 10.1007/s11831-021-09660-0

Source DB: PubMed Journal: Arch Comput Methods Eng ISSN： 1134-3060 Impact factor: 7.302

Introduction

Microorganisms or microbes are microscopic organisms that can be seen under microscope as they are too small to be seen with naked eyes. The microorganisms exist in both single celled and multi-cellular form. The earth is full of microorganisms i.e., bacteria, fungi, algae, viruses, protozoa, etc. and play a very important role in environment as well as human’s life. A very less category of microbes are beneficial for humans and they are extensively used in various activities like food processing, agriculture, industries, medical field, etc. Some microbes are used in the fermentation of foods, treating sewage, production of fuel, maintaining fertility of soil, preparing medicines, etc. however many microbes are very harmful for humans and ecosystem [1]. The microorganisms are responsible for causing various diseases in humans like typhoid, food poisoning, AIDS, polio, milder form of cold, cancer etc. In general, the microbes are very important living creatures on earth and humans must be aware about the fact that they can cause great threat to other organisms living on this planet [2]. Microorganisms play vital role in human’s life and contribute widely in earth’s ecosystem. Humans can easily witness the importance of microbes in their real life by citing an example of the scenario created by COVID-19 worldwide. Similarly the bacteria are also considered to be one of the most important microorganisms that plays immense role in human’s life and earths ecosystem. Identification and classification of bacteria is a tedious task due to its small size and is not visible through naked eyes. Traditional methodology is cumbersome and time consuming, thus waste lot of precious time of the microbiologist in its classification. Research in the field of bacteria takes place only when a proper classification of the said species has been performed and then placed in an appropriate category. Inter family classification is easy when compared to intra family due to minute variation its shape and size. Traditional methodologies involves the classification of bacteria based on morphological features and these morphological features are less in number when compared to automated technologies that involves modern computing environment. The advent of AI in the field of image classification using different ML techniques gave edge to the different researchers to automate the classification of bacteria based on image analysis. The use of different ML techniques assists the microbiologist to a greater extent thus by reducing the time of bacterial classification. Subsequently the ML based models also enhances the accuracy of bacterial image classification and has shown tremendous results. Classification based on ML uses different feature extraction and feature selection techniques to overcome the limitation of traditional methodologies using morphological features that are limited in number. The decision taken using ML model involves extremely high number of parameters thus by adding to its accuracy. Figure 2, diagrammatically depicts the major sections of article covering the system review on bacterial image classification using different ML techniques. This paper presents the review of literature on various ML approaches applied in the field of bacterial classification. The major contributions of the study are presented as follows:

Fig. 2

Flow chart with basic architecture of automatic bacterial image classification system

Bacteria Bacteria are a type of prokaryotic microorganisms that lack true nucleus and their genetic material (DNA) remains scattered in the cytoplasm. Based upon their morphology, these are classified into spherical (cocci), spiral (Sprillium), rod (bacilli), comma (vibrious) or crockscrew (spirochaetes) [3]. These are generally omnipresent; however human beings act as perfect host for them. A broad category of bacteria are harmless and even helpful in many ways whereas, some are pathogenic thus, leads to many diseases in humans and other animals, e.g. tetanus, typhoid fever, foodborne illness, cholera, tuberculosis, etc. [4] many of these can prove to be fatal also. Machine Learning: Machine learning (ML) is a sub-field of Artificial Intelligence (AI) and can be defined as “the field of computer science that gives computers the ability to learn without being explicitly programmed”. ML is extensively used in the prediction systems, network packet classification [5] sentimental analysis [6] speech recognition [7], medical diagnosis [8] and financial industry, signal processing [9] fretting fatigue analysis [10], agriculture [11] etc. Generally, a ML model is divided into two parts: training and testing. In training part, samples are taken as input and features are learned by learning algorithm to build the model and in testing part, the learning model uses the execution engine to make prediction for the previously unknown data. ML is classified into three categories; these are supervised learning, unsupervised learning and reinforcement learning. Several ML algorithms are already available i.e., Linear Regression, Support Vector Machine (SVM), Decision Tree, Naïve Bayes, Artificial Neural Network (ANN), Random Forest (RF), Deep Learning (DL), K-Nearest Neighbour (KNN) etc. [12] that can be either used directly on the given data or one can create a hybrid model combining these pre-existing algorithms with techniques like genetic algorithm [13], neuro-fuzzy [14] etc.. Linear Regression Linear Regression is a statistical method that is used for predictive analysis. It shows a linear relationship between a dependent and one or more independent variables [12]. Decision Tree This algorithm is used for classification problems. It works on both categorical and continuous dependent variables by splitting data into two or more homogeneous sets on the basis of a root node [12]. SVM SVM is extensively used for solving classification problems, hand writing analysis [15], face recognition [13], text recognition etc [16, 17]. Naïve Bayes It is a classification technique based on Bayes theorem with an assumption of independence between predictors. This model assumes that the presence of a particular feature in a class is unrelated with all the other features. This algorithm is easy to implement and is particularly useful for very large datasets [18]. ANN ANN is a ML algorithm that tries to mimic the functioning of a human nervous system. It includes a large number of connected processing units that work together to process information and also generate meaningful results from it. DL method is an enhancement over traditional ANN that is concerned with building much larger and more complex neural network (NN). It automatically extracts features from the given data and does not require the user to do so. It can be used for solving both classification and regression problems. Convolutional neural network (CNN) is a type of DL architecture which is becoming dominant in the area of image. The AlexNet was the first CNN architecture to show DL is effective in computer vision tasks. CNN has achieved the best performance in many areas such as plant disease identification [19], self driving cars [20], flex electricity effect representation [21], remote sensing [22], etc., RF RF is an ensemble learning method for solving classification and regression problems. In random forest, prediction is achieved using number of decision trees [23, 24]. KNN KNN is a distance based ML algorithm and is also one of the most widely used data mining techniques in pattern recognition and classification problems [12]. Traditional Methods in Bacteria Classification: Due to the dual nature of bacteria we can classify them into various categories keeping in view their vital utility in the field of medical diagnosis, biotechnology, food processing, genetic engineering etc. Traditionally, the classification of bacteria was used to perform manually by the microbiologists. The most commonly used techniques include observing phenotypic characteristics of bacteria species like size, shape, color, etc. that too under the microscope. Some bacterial species possess huge similarity in shape and size that force the microbiologists to change the methodology to perform different test like motility testing, molecular testing, and biochemical testing etc. in order to execute successful classification. The whole procedure is entirely dependent on human expertise, which includes long training time of skilful operators. In this type of methodology, there may be the chances that results can be affected by expert’s daily mood. Due to above said drawbacks, suitable and reliable automatic methods are needed for bacterial classification [24, 25]. ML based Bacteria Classification Methods: After going through the thorough literature survey, it has been observed that the traditional methods of classification are time consuming and frequently prone to errors. To overcome these problems the application of traditional methods creates a wide scope for the scientists to adopt ML approaches in the field of bacteria classification. ML techniques have already shown promising results in classification of microscopic images [26] which includes diabetic image classification, cervical cancer cell classification, oral cancer classification, microorganism classification, etc. This provides base and encouragement to carry out similar work in microbiology to classify different bacteria. As compared to traditional methodology, the ML techniques have proven to be faster, accurate, convenient and cheaper in nature. [27] A lot of work has been done by various researchers in the field of bacteria classification using different techniques. It has been experimentally shown by many researchers that ML algorithms can efficiently classify images on the basis of feature extraction techniques. ML techniques have been employed by many researchers to develop an automated tool for segmentation, feature extraction and classification of microscopic bacterial images. In Fig. 1 a flowchart representing automatic microscopic bacterial image classification system is given. As shown in Fig. 1, the ML model works in five different phases: first phase executes the acquisition of bacterial images, second phase performs data-pre-processing to remove unwanted noise, blur etc. through different operations to make the image data more clear and technically suitable for processing. The third phase deals with the segmentation of data. The fourth phase carries out feature extraction and feature selection on digital images and the last one involves classification of images into their respective classes.

Fig. 1

Article overview

Article overview Motivation of this Study: Flow chart with basic architecture of automatic bacterial image classification system Contribution of this Study: Important literature from different authors working in the field of microscopic bacterial image classification for the period 1998–2020 in the field of microscopic bacterial image classification has been explored and collected to present consolidated information at single platform to apprise the new researchers interested in this domain. Detailed summary of ML based approaches has been presented; covering type of bacteria studied, feature extraction, dataset details, different ML techniques implemented and performance metrics used for evaluation. Highest performance achieved by different ML techniques like KNN, SVM, RF, DL etc. and on what feature set, has been discussed. The selected research works have been deeply and cautiously analyzed to find out the limitations and future scope in each work. Different opportunities and challenges in implementing ML techniques for bacteria image classification have also been presented to provide an insight to the researchers working in the concerned field. This review work is structured as follows: Section II covers the different sources of information from where the articles have been collected during literature survey. Section III presented the use of different ML techniques in image analysis through literature survey w.e.f 1998 to 2020. Section IV covers the discussion on classification techniques and their performance metrics in each case. Section V presents conclusion and future scope of each article undertaken for the study. Future scope from all the articles is presented to give fruitful direction to the researchers interested to carry out further improvement this domain.

Article Search & Sources of Information

The aim of this review paper is to identify the related studies in the field of bacteria classification and identification using ML techniques. The time period taken for the review is from year 1998 onwards up to year 2020. The articles were searched and collected from reputed publication houses like Springer link, IEEE Explore, Elsevier, ACM digital library, Research Gate, PLOS etc. Keywords used while searching for the relevant articles in respective journal databases are given in the Table 1. Initially, a total of 81 research articles were obtained in the concerned field but after careful investigation, 21 irrelevant and duplicate articles were excluded. Then after screening the abstract and full text of remaining articles, numbers of articles were reduced to 48, considering work written in English Language only (Table 2). Out of which 40 relevant articles solely based on ML techniques were finally selected to conduct this review. The selected articles include research papers and book chapters that are published in reputed journals and in the proceedings of national and international conferences.

Table 1

keywords based journals selection

Resources	Keywords used	Journals referred
Springer link	Bacteria, Mycobacterium tuberculosis, Food borne pathogens, Machine learning OR Deep learning, Classification OR Classify OR Classifier	Signal, Image and Video Processing Journal of Food Measurement and Characterization Image Analysis and Processing-ICIAP Applied Microbiology and Biotechnology Intelligence in Big Data Technologies-Beyond the Hype Information System Frontiers
Elsevier (Science Direct)	Bacteria, Mycobacterium tuberculosis, Food borne pathogens, Machine learning OR Deep learning, Classification OR Classify OR Classifier	Ecological informatics Biocybernetics and biomedical engineering Real time Imaging Infrared Physics and Technology Sensors and Actuators B: Chemical
IEEE	Bacteria, Mycobacterium tuberculosis, Foodborne pathogens, Machine learning OR Deep learning, Classification OR Classify OR Classifier	IEEE Access
Plos	Bacteria, Mycobacterium tuberculosis, Foodborne pathogens, Machine learning OR Deep learning, Classification OR Classify OR Classifier	Computational Biology PLOS one

Table 2

List of IEEE Conferences

International Conference on Signal and Image Processing (ICSIP)

International Conference on Electrical, computer and Communication Engineering (ECCE)

International Seminar on Research of Information Technology and Intelligent System (ISRITI)

10th International Conference on Electrical and Computer Engineering

IEEE International Conference On Engineering and Technology (ICETECH)

IEEE Journal of Biomedical and health informatics

International Conference on Electrical Engineering and Informatics

IEEE Transactions on Information Technology in Biomedicine

International Conference on Computer Science and Software Engineering

9th Cairo International Biomedical Engineering Conference (CIBEC)

IEEE International Conference on Imaging System and Techniques (IST)

10th International Conference on Intelligent System Design and Applications

keywords based journals selection Signal, Image and Video Processing Journal of Food Measurement and Characterization Image Analysis and Processing-ICIAP Applied Microbiology and Biotechnology Intelligence in Big Data Technologies-Beyond the Hype Information System Frontiers Ecological informatics Biocybernetics and biomedical engineering Real time Imaging Infrared Physics and Technology Sensors and Actuators B: Chemical Computational Biology PLOS one List of IEEE Conferences Research articles from 15 journals and 12 IEEE conferences were included in this review paper. Out of 14 journals that are referred, 7 journals are published by Springer link, 5 by Elsevier (Science Direct), 1 are published by IEEE and 2 journals are published by PLOS. Out of 40 articles selected, 13 articles are from different conference proceedings. Year wise numbers of publications in the field of bacteria classification that are included in this review paper are given in Fig. 3.

Fig. 3

Yearly distribution of selected articles

ML in Bacterial Image Analysis

ML is a branch of AI that has shown great success in computer science field. ML algorithms get trained on given data and thereafter can be used to predict the outcome on new data. In microbiology, the researchers use ML techniques for analysis of digital microscopic bacterial images. The study of bacteria is known as bacteriology, in which researcher analyzes the bacteria on the basis of its morphological features and its genetics. Good bacteria are economically important in many areas like: food processing, biotechnology, fiber retting, pest control and genetic engineering. The Escherichia coli is used to prepare vitamin K and riboflavin, Lactobacillus is used to make curd and cheese and Ruminococcus spp. helps digest cellulose by secreting the enzyme cellulase. The pathogenic bacteria are harmful for humans; they create many diseases in humans like: Mycobacterium Tuberculosis cause tuberculosis disease in humans, Saprotrophic bacteria attack and decompose organ and cause food poisoning [28]. Thus it becomes very important to classify these species of bacteria through ML techniques to understand their behavior. ML techniques have been extensively employed by various researchers for studying different types of bacteria and the following section discusses some of important studies during the period from 1998 to 2020. Studies carried out w.e.f. 1998 to 2010: In the year 1998, Holmberg et al. [29] presented an ANN based technique for the classification of urinary bacteria. They used ANN with electronic nose to extract features of bacteria from sensor data. The data was collected from the department of microbiology, The University Hospital in Linkoping. A total of 100 Petri dishes containing bacteria (E. coli, Enterococcus, Proteus mirabilis, Pseudomonas aeruginosa and Staphylococcus saprophytica) were monitored, and finally 48 features were extracted from the dataset. These features were then fed to the proposed system and it resulted in a final system classification accuracy of 76%. In the same year,neural network based technique was proposed by Veropoulos et al. [30] for the classification of tuberculosis bacteria. The architecture consisted of multi-layered NN using the back propagation learning rule and 2 hidden units. In this proposed system, a dataset consisting of 1147 image objects was taken which consisted of 267 tubercle bacilli and 880 other objects. Out of these total images, 1000 images were used for training purpose and 147 for testing. The dataset was collected from South African Institute for medical Research (SAIMR), Cape Town. The proposed system showed accuracy with KNN-91.8%, BP- 97.9%, SCG-96.6% and KA-95.9%. In the year 2001, Liu et al. [31] worked on the classification of morphotypes bacterial species using morphological features. The authors used KNN classifier for the classification purpose and proposed an image analysis program, CMEIAS for Windows NT environment. In this proposed system the authors obtained 1937 grayscale digital images of various communities. The CMEIAS shape classifier gave an accuracy of 96.0% on the training set of 1471 cells and 97.0% on the test set of 466 cells representing all 11 bacterial morphotypes classes. In 2004, Forero et al. [32] implemented a technique for the classification of Mycobacterium Tuberculosis bacilli using K-means clustering algorithm. In this paper, the authors used microscopic sputum samples (between 8 to 100 images per full sample) and not individual images. A set of 397 negative images from 31 subjects and 75 positive images from 4 patients were acquired. The proposed approach achieved a specificity of 93.54% and sensitivity of 100%. Three-layered back propagation neural network (BPNN) was introduced by Xiaojuan et al. [33] for classification of bacteria. Six features have been extracted from 8 types of bacteria. 60 samples from the NMCR (National Microbe Culture resource) database were selected for the training purpose and 20 samples were used for testing purpose. The proposed system achieved classification accuracy of 86.3%. Men et al. [34] presented a ML based technique for the classification of heterotrophic colonies of bacteria. The classification was done using SVM. The primary heterotrophic bacteria colony images were first pre-processed and thereafter were used to extract 6 different features i.e. area, perimeter, equivalent diameter, shape factor, length and width for the experiment. All the pursuant features were then fed to the classification algorithm. Total 300 instances were obtained after feature extraction, of which 200 were heterotrophic bacteria colony and 100 belonged to non-heterotrophic bacteria colony. Out of these, 60% were used as training set and remaining 40% as testing set. The proposed method achieved a training accuracy of 98.7% and testing accuracy of 96.9%. A ML based approach was implemented by Chen et al. [35] (2009) to count and classify the bacterial colonies on Petri dish. The proposed method was able to recognize chromatic and achromatic images, and it was also able to deal with both colored and clear medium images. After recognizing, dish/plate regions containing bacteria were detected. The morphological features of each species were used for classification by using SVM with Radial basis function. Petri dishes were collected with two different types of medium and bacteria stains. The first type of plate which contains blue color Mitis-Salivarius was collected from the Department of Pediatric Dentistry at the University of Alabama at Birmingham and the second type of plate contains the clear LB agar was obtained from the Division of Nephrology, Department of medicine at the University of Alabama at Birmingham. This proposed system is robust and effective automatic on bacterial colony. This method achieved the accuracy of 96%, the accuracy rate for the 25 Chromatic images is 92% and the accuracy rate for the 75 achromatic images is 97%. In the same year, Xiaojuan et al. (2009) [36] proposed a BPNN for the classification of wastewater bacteria. The experiment consisted of total 16 samples out of which, 10 samples were used for training and 6 samples were used for testing purpose. The proposed method employed a self-adaptive accelerated back propagation algorithm for training the bacterial microscopic images classification process. After training, the method was tested with CECC database and the results showed that the given approach was effective in improving the speed and consistency of performing large-scale surveys or rapid determination of bacterial abundance, morphology that could make the estimation of bacterial condition more accurate. The proposed approach gave and accuracy of 85.5%. Osman et al. (2009) [37] developed a NN method with the combination of genetic-algorithm for the detection in tissues of Mycobacterium tuberculosis bacteria. The proposed method comprised of four main steps i.e. image segmentation using K-means clustering, feature extraction, feature selection and classification using GA-NN approach. The authors extracted seven-Hu moment invariant as features and applied genetic algorithm for feature selection and Levenberg–Marquardt algorithm to train multilayer Perceptron for classification of bacteria into two classes i.e. ‘possible TB’ and ‘true TB’. For the given experiment, the Ziehl–Neelsen (ZN) stained tissue images of tubercle bacilli were collected from the Pathology Department, Hospital University Sains Malaysia, Kelantan. A total of around 960 object images (360 ‘true TB’ and 600 ‘possible TB’) were obtained from 120 tissue slide images with various staining condition. From these 960 object images, the best 680 images were chosen, out of which, 400 object images (200 ‘true TB’ and 200 ‘possible TB’) were used for training and 280 object images (80 ‘true TB’ and 200 ‘possible TB’)were used for testing. The given approach achieved an accuracy level of 89.64%. An automated classification method was introduced by Khutlanget al. (2010) [38] in order to identify Mycobacterium tuberculosis images of ZN-stained sputum smears. The author used KNN, PNN and SVM classifier for classification of objects. The dataset used in the experiment consisted of total 11,259 instances of bacteria, out of which, 6901 instances from 11 subjects were used for training, which further consisted of 4999 instances of bacilli and 1902 instances of non-bacilli. For testing purpose, a total of 4358 instances of bacteria obtained from 8 subjects were used, out of which, 1838 instances were of bacilli and 2520 instances were of non-bacilli. The given approach achieved the sensitivity and specificity values of 95%. Inthe same year, Hiremanth et al. (2010) [39] developed a ML based technique for the classification of cocci bacterial cell. The proposed system employed three classifiers namely 3 , K-NN, and NN for identifying the arrangement of cocci bacterial cells i.e. their geometric features. The ANN consisted of 5 neurons for 5 shape features as input, three outputs, a back propagation function and a gradient descent function. The dataset consisted of 500 color digital bacterial cell images of different types i.e. cocci, diplococci, streptococci, tetrad, sarcinae and staphylococci. The proposed approach achieved an accuracy range of 84% to 94% with 3 classifier, for K-NN classifier it gave an accuracy of 75% to 100% with k = 1 and from 96 to 100% for k = 3. The NN classifier achieved an accuracy of 98% to 100%. Among all the classifiers, ANN proved to be the best for the given experiment. Akova et al. [40] in 2010 proposed a ML based method to classify bacterial serovars. A total of 28 subclasses from 5 different bacterial species (E. coli, staphylococcus, salmonella, vibrio and Listeria) were used in the given experiment. The dataset consisted of 2054 random samples of the bacteria which were then split into 80% and 20% for training and testing purpose respectively. Initially, the textural features were obtained from the samples which were then fed to the Support vector data description (SVDD) and Bayesian classifier. The Bayesian classifier was not specific to bacteria classification problem, thus the results for the Bayesian classifier were validated on the benchmark letter recognition dataset from the UGI repository. The experiment was able to achieve an accuracy of 82%. Studies carried out w.e.f. 2011 to 2020: In 2011, Rulaningtyas et al. [41] implemented NN trained with back propagation approach for the classification of tuberculosis bacteria. The classification was done using NN with fine-tuned hyper-parameter values i.e. momentum = 0.9, learning rate = 0.5, mean square error = 0.00036 and number of hidden layers = 20. The dataset used in the experiment consisted of 100 samples of tuberculosis binary images out of which 75images were used for training process and 25 images for testing process. The given approach delivered and accuracy of 80%. Cocci Classifier Tool was developed by Hiremath et al. [42] to detect its different types i.e. cocci, diplococci, streptococci, tetrad, sarcinae and staphylococci. The proposed system consisted of three classifiers namely 3 KNN classifier and NN classifier. The NN classifier comprised of an input layer with five neurons that were provided with five shape features as inputs, a hidden layer and an output layer with six neurons, 1 for each class. The authors took 1733 cells of cocci, diplococcic and tetrad for classification and achieved accuracy 97%. The remaining 310 cells representing classes’ sarcinae, streptococci and staphylococci were classified with 91% accuracy, resulting in an overall classification accuracy of 94% for the entire training set and testing set. The accuracy achieved by 3 classifier was in the range of 92–99%, while that of KNN was from 91–100% and that of NN classifier was from 97–100%. From all the classifiers, NN gave best results. In the year 2013, Ahmed et al. [43] implemented a distributed grid computing for the classification of vibrio bacteria. The fisher’s Discriminant analysis was first selected for extracting the features and subsequently SVM with linear kernel was used for classification. The technique involved standardization of scatter patterns obtained from bacterial colony using histogram equalization and image centering. Thereafter, features like haralick texture, Zernike and Chebyshev moments were extracted, resulting into a feature vector containing thousands of features. Finally the most prominent features were extracted using Fisher’s criterion.The dataset consisted of a total of 1000 images, i.e. 100 images per strain. The proposed method achieved accuracy levels varying from 90 to 99% for different bacterial species. In the same year, Chayadevi et al. [44] employed both statistical and NN approach for extracting bacterial clusters and counting different bacterial species from digital microscopic images. The proposed technique involved thresholding method and binarization; followed by segmentation and feature extraction. Finally, K-means clustering and self-organizing maps (SOM) were used for clustering and counting purpose. Authors used primary dataset collected from the hospital that consisted of 320 digital images of bacteria. The results obtained by using the proposed system were compared with the manual count taken by the doctors, wherein the proposed system proved to be more accurate than human visual counting. In 2014, Ferrari et al. [45] proposed a multistage classification technique using SVM with radial basic function (RBF) kernels. The given experiment used solid agar plates from which a total of six species of bacteria were classified i.e. Enterococcus faecalis, E. coli, Klebsiella, Proteus mirabilis, Staphylococcus aureus, Streptococcus agalactiae. For each bacterial class, two solid agar plates had been inoculated. In total, 22 images of agar plate were analyzed, from these images, 74 isolated bacterial colonies had been identified. 60% of samples were used for training while 20% for cross validation and 20% of the samples were used to measure the generalization performance of the classifier. The proposed system gave an accuracy of 93%. Novel RF based technique was applied by Ayas et al. [46] for tuberculosis bacilli bacteria classification. The dataset for the given experiment was obtained from the Mycobacteriology laboratory at Faculty of Medicine, Karadenix Technical University. The dataset consisted of 116 images from five positive-smear slides of five different subjects. During the experiment, 40 images were used for training and 76 images for the purpose of testing. The given method achieved an accuracy level of 89.38% with sensitivity of 93% and specificity of 96.97%. The results were then compared with SVM, ANN and GPDF, where the given approach showed better performance than the other techniques. Govinda et al. [47] introduced a technique for the classification of tuberculosis bacilli bacteria using ZN images. The given approach used SVM with library tools (LIBSVM) for classification purpose. The experiment was carried out using two microscopic image sets: the first one was obtained from Public health Image library (PHIL), an open source for microscopic images and second was obtained from Global Hospital and health city, Chennai. These image sets were then used to extract the ZN stained images of 34 tuberculosis positive and 16 negative patients. The proposed method achieved an accuracy of 90.89% with sensitivity of 72.89%. In the same year, DL based technique was implemented by Nie et al. [48] for classification of bacterial images. The classification was done by using CNN model with multiple intermediate layers. The authors cultured bacteria on two different media (ISP2 and Oatmeal agar) where each plate contained 2 colonies selected from a set of 17 strains. The plates were incubated at 30 degree Celsius for 8 days and were imaged at 2nd, 4th, 6th and 8th day after spotting. From these, 862 images of growing colonies were obtained. The proposed CNN architecture was then trained on these images which resulted in accuracy ranging from 52.5 to 78.47%. In 2016, Seo et al. [49] proposed a ML algorithm for the classification of five species of staphylococcus bacteria. The experiment used Hyperspectral microscopic images of five bacterial species were obtained from the Poultry Microbiology Safety and processing Research unit of the U.S National poultry research center, Agricultural research service in Athens, GA. Firstly, the five species (aureus, haemolyticus, hyicus, sciuri and simulans) were collected by spectral data generation from region-of-interest (ROIs) using threshold based segmentation. In the next step, outliers were removed using Mahalanobis distance method, followed by key wavelength selection using Pearson’s correlation coefficient. In the last step, classification was done using SVM and Partial least square Discriminant analysis (PLS-DA). The proposed method gave an accuracy of 89.8% and 97.8% for SVM and PLS-DA respectively. The SVM method enabled to identify not only gram-negative bacteria, but gram-positive bacteria too. In the same year, neural network based technique was implemented by Priya et al. [50] for the classification of tuberculosis bacilli bacteria. The presented technique involved the steps in the order: segmentation, feature extraction by Fourier descriptor, and feature selection using fuzzy entropy. The selected features were given to fed to hybrid classifier i.e. SVM coupled with multilayer perceptron network (SVNN) for classification. The result of SVNN was compared with the BPNN and the given approach showed better accuracy results. The dataset was collected from the South African National Health Laboratory Services, Groote Schuur Hospital in Cape Town, where it was prepared by smearing the sputum specimen on a clean slide. A total of 1537 objects from 100 images were used for training. The proposed system achieved accuracy of 92.5% with 95% of sensitivity and 90% of specificity. Lopez et al. [51] presented classification technique for identification of mycobacterium tuberculosis (MT) using CNN. The authors created a patch dataset with 9,770 patches of positive and negative smears from 492 images. CNN models were trained using three versions of patches, R-G, RGB and grayscale. For comparative analysis of different versions, ROC curve was implemented and the best experimental results were obtained by using R-G format. The author’s used three convolutional layers for evaluation of image dataset with a robust balanced fusion and classify a patch in positive or negative for MT. This proposed system achieved accuracy of 96%. Again same year, CNN based technique was also introduced by Turra et al. [52] for the classification of hyper spectral bacterial colonies. The architecture consisted of two convolutional layers, 1 pooling layer and a softmax layer for classification. The dataset was taken from American Type Culture Collection (ATCC) which consisted of 16,642 colonies that were streaked from eight classes (E. coli (5539 colonies), E. faecalis (1958 colonies), S. aureus (2355 colonies), P. mirabilis (2315 colonies), P. vulgaris (654 colonies), K. pneumonia (542 colonies), Ps. Aeruginosa (1529 colonies) and Str. Agalactiae (1750 colonies)) and that colonies grown on Petri dishes to form 106 Hyperspectral Imaging (HSI) volumes. The CNN model uses 50,000 training iterations. The given experiment was compared with SVM and RF, and showed a better overall performance. The accuracy achieved by this approach was 99.7%, whereas the accuracy of SVM was 99.5%and that of RF was 93.8%. In the year 2017, Zielinski et al. [53] presented a DL based hybrid technique for classification of bacterial species and genera. The authors used deep convolutional neural network (DCNN) to obtain image descriptors and then pooling encoder was used to produce feature vectors and the classification task was finally carried out using SVM or RF. The experiment was carried out on the dataset of Digital Images of Bacteria Species (DIBaS) collected from Chair of Microbiology of the Jagiellonian University in KraKow, Polandwhich consisted of 33 bacteria species with 20 images of each bacterium. The dataset was partitioned randomly into 50% training and 50% testing instances. The proposed approach achieved an accuracy of 97.24%. Mohamed et al. [54] implemented a ML technique for bacteria classification. The technique involved feature extraction using Bag of Words model and classification using multiclass Linear SVM. The authors selected database of 200 DIBaS images containing 10 species of bacteria (Acinetobacterbaumanni, Actinomycesisraelii, Enterococcus faecium, lactobacillus jensenii, lactobacillus paracasei, fusobacterium, lactobacillus delbrueckii, lactobacillus reuteri, micrococcus spp. and candida albicans) with 20 images of each species. The dataset was divided into two parts: 70% dataset was used for training and 30% was used for testing. This method achieved an accuracy of 97% for the ten-class classification problem and for 8 classes 100% accuracy was achieved. In the same year 2018, Panicker et al. [27] proposed a DL based approach for the automatic detection of tuberculosis bacilli (TB) from microscopic sputum smear images. This proposed method used image binarization for image de-noising and CNN for pixel-level classification. The CNN architecture consisted of 3 Convolutional layers, a fully connected layer and a sigmoid layer. The dataset of TB images was collected from InstitutoNacional de pesquisas da Amazonia (INPA) lab, Manaus, Brazil. It consisted of 120 images from which 900 positive patches and 900 negative patches were cropped. Out of these, 80% patches were used for training and 20% for testing. The proposed method achieved a recall value of 97.13%, precision value of 78.4% and F-score of 86.76%. Wahid et al. [55] applied Inception V1 approach for the classification of microscopic bacterial images. The technique involved manual cropping of the images and conversion of images from grayscale to RGB and then flipping the image and finally translating the resulting image. The classification of images was done by using ‘Inception V1’, DCNN model. The authors collected 5 bacteria species (clostridium botulinum, vibrio cholera, neisseriagonorrhoeae, borreliaburgdoferi and mycobacterium tuberculosis) with 500 images from several online resources such as: HOWMED, PIXNIO, etc. During the experiment, 100 images were used for testing and 400 images were used for training. This method achieved an accuracy level of 95%. Again same year, CNN based approach was applied by Traore et al. [56] for image recognition of Vibrio cholera and Plasmodium falciparum in order to classify epidemic pathogen. The presented approach was implemented with seven hidden layers: six convolution layers, one fully connected layer, and softmax layer was used for the final classification. Tensorflow was used as the backend framework for implementing DL. The dataset was downloaded from Google and consisted of 200 images of Vibrio Cholera and 200 images of Plasmodium Falciparum. The data was portioned into 80% training and 20% testing set. This method achieved an overall accuracy of 94%. Rahmayuna et al. [57] proposed a technique to classify the images of four popular Pathogen bacteria at genus level. The authors used two steps to classify the images: first step involved improving the image quality with Contrast Limited Adaptive Histogram Equalization (CLAHE) method and second step involved texture analysis. After this, classification was done by using Linear and Radial Basis Function (RBF). The authors collect 600 optical images of bacteria which included: 150 images of Escherichia, 150 images of Listeria, 150 images of Salmonella and 150 images of Staphylococcus. From this 540 images were used for training and 60 images were used for testing on the proposed framework and an accuracy of 90.33% was obtained. DL based approach was introduced by Hay et al. [58] for the identification of gut bacteria from larval zebra-fish intestine using 3D light sheet fluorescence microscopy images. The proposed CNN architecture was used for classifying the images into bacterial or non-bacterial. The authors uses Google open source Tensorflow framework for implementation of the CNN architecture. The Comparison of CNN with RF and SVM was also presented in this paper and it was observed that the proposed CNN architecture gave best results with an accuracy of 90%. In 2019, Mithra et. al. [59] applied Deep belief neural networks for the classification of Mycobacterium tuberculosis bacilli from ZN-stained microscopic images. The presented model was also evaluated with the existing models and concluded than deep belief neural network classifier gave better results. The authors collected dataset from ZNSM-iDB having 500 images of autofocus, no bacillus, few bacilli, overlapping and over stained images. Out of this 275 images were used for training the dataset and 225 images were used for testing. The proposed system achieved accuracy of 98.23% with sensitivity of 97.55% and specificity of 97.86%. In the same year, Treebupachatsakul et al. [60] implemented LeNet CNN approach for recognition of two species of bacteria i.e. staphylococcus and lactobacillus. The proposed method was implemented by using Python programming and the Keras API with Tensorflow DL framework. The authors created their own dataset which consisted of more than 400 sample images of two types i.e. Staphylococcus Aureus and Lactobacillus Delbruekii. The dataset was separated into two parts: 80% of data was used for training and 20% for testing datasets. The proposed system achieved an accuracy of 75%. Hybrid technique based on Inception-V3 and SVM was proposed by Ahmed et al. [61] for classification of microscopic bacterial images. The proposed method worked by image preprocessing using manual-cropping and conversion from grayscale to RGB then flipping the image and at last, translation of image followed by feature extraction using Inception-V3 DCNN model. The classification of the images into their respective class was then done by using SVM. The authors used seven bacteria species such as: Clostridium Botulinum, Borreliaburgdoferi, Rickettisiaricketsii, Mycobacterium tuberculosis, Streptococcus pyogenes, Vibrio Cholerae and Neisseria gonorrhoeae. 80% samples of the dataset were used for training hybrid network which include 800 microscopic images and 20% samples were used for training DCNN model which include 160 samples. For testing authors used 200 images of seven bacteria. The proposed system achieved an accuracy of 96%. Abd-Alhalem et al. [62] presented DL technique for classification of bacteria like (Actinobacteria, Firmicutes and Proteobacteriaon) on the basis of DNA nucleotides i.e. A, T, C & G using CNN based on random projection on different data reduction layers. The proposed system consisted of four layers; three layers are Convolutional layers each having pooling layer and fourth layer is fully connected layer which uses Softmax activation function. The author’s collected 2000 sequences which can be grouped into 5 ordered taxonomic ranks named phylum, class, order, family and genus. The proposed system was compared with SVM classifier, and the proposed method gave better results. Next year, Bonah et al. [63] used meta-heuristic optimization algorithm for food-borne pathogenic bacteria classification using hyper spectral imaging. The proposed method experimentally showed better results w.r.t. SVM, Competitive Adaptive Weighted Sampling (CARS) and particle swarm optimization. The authors obtained 320 images of 8 bacteria species i.e. E. coli, Escherichia coli O157:H7, Ampicillin resistant E. coli, Listeria monocytogenes, Staphylococcus aureu, Methicillin-resistant Staphylococcus aurens, Salmonella enteritidisand Salmonella typhimurium,consisting of 40 samples per species. The accuracy achieved by the proposed system was 99.47%. Kang et al. [64]developed multiple advanced DL frameworks i.e. long-short term memory (LSTM) network, deep residual network (ResNet), and one dimensional CNN (1D-CNN) for the classification of food-borne bacteria using Hyperspectral Microscopic imaging (HMI) technology. The dataset of five common foodborne bacterial cultures (Campylobacter jujuni, generic E. Coli, Listeria innocua, Staphylococusaureus and Salmonella Typhimurium) was collected from Poultry Microbiological safety and Processing Research Unit (PMSPRU) of the U.S. department of Agriculture in Athens, Georgia, USA. The given dataset consisted of 5000 images, during experiment which was randomly partitioned into 72% (3600 cells), 18% (900 cells) and 10% (500 cells) for training, validation and testing purposes respectively. The given experiment predicted that LSTM, ResNet, and 1D-CNN achieved accuracy of 92.2%, 93.8% and 96.2% respectively. In the same year, DL based classifier was implemented by the same authors [65] for classification of food-borne bacterial species at the cellular level by using HMI technology combined with CNN. The proposed method implemented 1D-CNN architecture, KNN and SVM for classification. The imaging dataset for five food-borne bacteria i.e. Campylobacter fetus, Escherichia coliform, Listeria Innocua, Salmonella Typhimurium and Staphylococcus Aureuswas collected from Poultry Microbiological safety and Processing Research Unit (PMSPRU) of the U.S. Department of Agriculture in Athens, Georgia, USA. The proposed system achieved an overall accuracy of 90%. In the same year, Mhathesh et al. [66] applied DL technique for the classification of 3D light sheet fluorescence microscopy images of larval zebra fish. The authors used CNN for classification of bacterial images. In this model different activation functions i.e. sigmoid, Tanh and ReLuwere used to analyze the accuracy of the model. The results of the given approach were then compared with other classifiers such as Support vector classifier, RF and ConvNet. The given approach achieved accuracy of 95% which outperformed the other selected techniques. In the same year H. Sajediet. al. [67] used Extreme Gradient Boosting classification (XGBoost) method to classify three different Myxobacterial suborders i.e. Cystobacterineae, Sorangiineae and Nannocystineae. The proposed method used Gabor transform to extract texture features and XGBoost to classify three categories of bacteria. The accuracy achieved by the proposed system was 90.28%. In addition, the authors also write some literature review related to classification of bacteria by using ML methods including deep neural network. Critical Analysis: Tables 3 and 4 summarize the different ML approaches employed for classification of different bacterial species along with the limitation and future scope of each work. Throughout the literature survey, it is demonstrated that ML techniques have achieved tremendous success in automating traditional procedure of bacteria classification. The majority of the work has focused on food bacteria, different bacterial colonies, wastewater bacteria, cocci bacteria and some pathogenic bacteria species like tuberculosis, vibrio cholera etc. The type of images used included ZN-stained sputum smear microscopic images, hyper spectral images and digital microscopic images of bacteria species as well as whole agar plate. In entire course of study we have found that the datasets and the number of studies are limited. Most of the datasets employed by researchers are private and cannot be accessed directly. Very few datasets like DIBas http://misztal.edu.pl/software/databases/dibas/) [68] and HOWMED (http://howmed.net/microbiology), PIXNIO (http://pixnio.com/photos/science/microscopyimages) [69] are available online. They provide initial point for ML based bacteria classification. But still there is a need of high quality benchmark datasets. Lot of work has been done by different researchers using different datasets, but limited work has been done on each dataset making difficult for researchers to compare the performance of the different techniques on different datasets. This creates a scope to carry out rigorous research in this field. It has been observed that few ML techniques perform better on particular dataset but face performance degradation on some other dataset. Several researchers working in this domain have used supervised (SVM, RF, KNN etc.) and unsupervised ML algorithms (K-means clustering) to achieve bacteria classification through semi-automatic mode. The performance on these techniques has been evaluated using accuracy metrics. In case dataset is imbalanced the accuracy cannot be considered as only a parameter for measuring efficiency of a system. Majority of the scientist have carried out experimentation on imbalanced datasets and have used only accuracy as performance metrics. To thoroughly examine and study the system using imbalanced dataset few researchers have used more performance metrics like Precision, Recall and F-score. From 2015 onwards, deep learning based fully automatic system using CNN has shown excellent results when implemented for bacteria classification. On DIBas dataset CNN has achieved an accuracy of 97.3%.

Table 3

A brief summary of different ML approaches used in classification of bacteria (1998 to 2010)

Author/ Year	Methodology/Technique	Types of Bacteria	Feature Selection	Dataset		Results	Limitations	Future Scope
Author/ Year	Methodology/Technique	Types of Bacteria	Feature Selection	Dataset used	Dataset details	Results	Limitations	Future Scope
Holmberg et al. [29]	ANN	Urinary bacteria	Shape features	Petri dish Images	C = 5 T = 100	Acc = 76%	Details of Dataset are incomplete	In future, model will take limited parameters and large amount of data
Veropoulos et al. [30]	NN, Back propogation	Tuber-culosis bacilli	Shape features	ZN stained Sputum smear Images	T = 1147 Tr = 1000 Te = 147	Acc = 97.9% Se = 94.1% Sp = 99.1%	Evaluation detail is less	This model may combine with other diagnostic technique that use automatic microscope to reduce the cost effect
Liu et al. [31]	KNN	Different Bacteria species	Shape features	Digital Images	C = 11 T = 4270	Acc = 97%	Overlapping species Images	New version of CMEIAS may be developed with additional features
Forero et al. [32]	K-means Clustering	Tuberculosis bacteria	Shape features	ZN stained Sputum smear Images	C = 8 T = 100	Sp = 93.54% Se = 100%	Dataset is small	Bayesian decision theory along with Gaussian mixture model may be used and dataset will also be extended
Xiaojuan et al. [33]	Back propogation, NN	Different bacteria species	Shape and Texture features	NMCR database	C = 4 Tr = 60 Te = 20	Acc = 86.3%	Evaluation details are not described	A hybrid model may be proposed for bacteria classification
Men et al. [34]	SVM	Heterotrophic bacteria colonies	Shape features	Heterotropic bacteria colony Images	C = 2 T = 300	Acc = 98.7%	Types of species in images are less explained	A hybrid model may be proposed to classify bacteria colony
Chen et al. [35]	SVM and Radial basis function	Oral cavity bacteria	Color features	Petri Dish Images	C = 2 T = 100	Acc = 96% Pre = 0.97 + -0.03 Re = 0.96 + -0.04	Clustered colonies are not distinguished	Model may be improved to detect and distinguish different species of bacteria and clustered colonies
Xiaojuan et al. [36]	NN, Back propagation	Wastewater bacteria	Shape features	CECC database	C = 4 T = 16 Tr = 10 Te = 6	Acc = 85.5%	Evaluation details are not described	Model may be upgraded so as to get better result and classify other bacteria also
Osman et al. [37]	NN	Tuberculosis bacteria	Texture features	ZN stained Sputum smear Images	T = 680 Tr = 400 Te = 280	Acc = 89.64%	Detail of the given dataset is incomplete	More features may be added to improve the performance and analysis
Khultang et al. [38]	KNN, SVM, PNN	Mycobacterium tuberculosis	Fisher’s mapped features	ZN stained sputum smear Images	T = 11,259 Tr = 6901 Te = 4358	Acc = 95%	Evaluation details are not described	Feature set may be extended to include other bacilli classes also
Hiremanth et al. [39]	K-NN, NN	Cocci bacterial cells	Geometric features	Digital Microscopic Images	T = 500	Acc = 96%(KNN) Acc = 98%(ANN)	Detail of the dataset is incomplete	Given method may be modified by using better pre-processing techniques and feature sets
Akova et al. [40]	SVDD	Bacteria Serovars	Texture feature	Scatter Pattern Dataset	C = 5 T = 2054	Acc = 82%	Evaluation details are not described	It may use Bayesian approaches to make robust modeling and improve prediction

Table 4

A brief summary of different ML approaches used in classification of bacteria (2011 to 2020)

Author/year	Methodology/technique	Types of bacteria	Feature selection	Dataset		Results	Limitations	Future Scope
Author/year	Methodology/technique	Types of bacteria	Feature selection	Dataset used	Dataset details	Results	Limitations	Future Scope
Rulaningtyas et al. (2011) [41]	NN	Tuberculosis bacteria	Shape features	ZN stained Sputum smear Images	T = 100 Tr = 75 Te = 25	Mean square error = 0.000368	Model was not tested on real images	Automated tool to count and analyze tuberculosis bacteria may be developed
Hiremanth et al. (2011) [42]	K-NN, NN	Cocci bacteria cells	Geometric and statistical features	Digital microscopic Images	T = 350	Acc = 99% K-NN = 100% ANN = 98%	Dataset size is small	Better preprocessing methods and feature sets may be used to get improved results
Ahmed et al. [43]	SVM	Food bacteria	Shape and texture features	Scatter pattern Dataset	T = 1000	Acc = 90%	Detail of dataset is not elaborative	Dataset may be enhanced to improve the speed of classification
Chayadevi et al. [44]	K-means, SOM	Bacteria clusters from microscopic images	Shape features	ZN stained Sputum smear Images	T = 320	–	High computation cost	Hybrid approach may be used to reduce computational time and cost
Ferrari et al. [45]	SVM with radial basis function	Bacterial colonies	Shape features	Agar plate Images	C = 6 T = 22	Acc = 93%	Detail of dataset is not elaborated	Large dataset may be used to improve results
Ayas et al. [46]	RF	Mycobacterium tuberculosis bacteria	Shape features	ZN stained sputum smear Images	T = 116 Tr = 40 Te = 76	Sp = 62.89% Se = 89.34%	Dataset size is small	This model may be extended to identify the bacteria on the basis of other features
Govindan et al. [47]	SVM	Tuberculosis bacteria	Shape features	ZN stained Sputum smear Images	T = 50	Se = 72.89%	Detail of dataset and result is incomplete	Automated tool in identification of other types of bacteria such as cocci and vibrio may be developed
Nie et al. [48]	CNN	Bacterial colonies	CNN features	ISP2 and oatmeal Agar plates Images	T = 862	Acc = 62.10% Pre = 83.76% Re = 82.16%	Less efficient in classifying bacterial colony images	Methods for training models that label patches spanning multiple heterogenous colonies may be explored
Seo et al. [49]	SVM and PLS-DA	Staphylococcus species	Spectral features	Hyperspectral Images	C = 5	Acc = 97.8%	Detail of dataset is incomplete	The dataset may be increased so as to validate the results on larger testing dataset
Priya et al. [50]	SVM, BPNN	Tuberculosis bacteria	Shape features	Sputum smear slides Images	T = 100	Acc(BPNN) = 82.5% Acc(SVNN) = 92.5%	Details of dataset are incomplete and fewer features were used	More features may be used to improve the classification results
Lopez et al. [51]	CNN	Mycobacterium tuberculosis bacteria	CNN features	Original and Augmented Dataset	T = 492	Acc = 96%	Evaluation details are not described	Model for classifying other bacteria may be proposed
Turra et al. [52]	CNN, SVM, RF	Hyperspectral bacterial images	Shape features	Hyperspectral Images	C = 8 T = 16,642 colonies	Acc(CNN) = 99.7% Acc(SVM) = 99.5% Acc(RF) = 93.8%	Very complex dataset and cost of Hyperspectral imaging is high	Higher number of UTI-relevant pathogens and clinical laboratory validations may be used
Zielinski et al. [53]	CNN, SVM, RF	Bacteria colonies	CNN features	DIBaS dataset	C = 33 T = 660	Acc = 97.24%	When database size is increased, the accuracy decreases	The DIBaS dataset may be extended along with improved investigation method
Mohamed et al. [54]	SVM	Bacteria species	SURF description	DIBaS dataset	C = 10 T = 200	Acc = 97%	Detail of dataset is incomplete	Experiments may be carried out on increased species of the bacteria
Panicker et al. [27]	CNN	Tuberculosis bacteria	CNN features	ZN stained Sputum smear Images	T = 120	Rec = 97.13% Prec = 78.4% F-Score = 86.76%	Dataset is small	Dataset may be enhanced for getting better results
Wahid et al. [55]	CNN	Pathogenic bacteria species	CNN features	HOWMED, PIXNIO	C = 5 T = 500 Tr = 400 Te = 100	Acc = 95%	Numbers of species are less	Number of species may be increased for classification
Traore et al. [56]	CNN	Vibrio cholera and plasmodium falciparum	CNN features	Digital microscopic images downloaded from Google	C = 2 T = 480 Tr = 400 Te = 80	Acc = 94%	Dataset is small	The model may be optimized to get better results
Rahmayuna et al. [57]	SVM	Pathogenic bacteria	Texture features	Dataset Downloaded from kaggle website	C = 4 T = 600 Tr = 540 Te = 60	Acc = 90.33%	Features were Selected manually	Automatic feature selection may be implemented
Hay et al. [58]	CNN	Larval zebra fish intestinal bacteria	Texture features	Zebrafish dataset	T = 1190	Acc = 89.3%	Dataset is small	Dataset may be enhanced for better performance
Mithra et al. [59]	Deep belief neural network	Tuberculosis bacteria	Pixel intensity based features	ZN stained Sputum smearImages	T = 500 Tr = 275 Te = 225	Acc = 97.55% Se = 97.86% Sp = 98.23%	Detail of dataset is incomplete	Model for classifying other bacteria may be proposed
Treebupachatsakul et al. [60]	CNN	Staphylococcus aureus and lactobacillus delbrueckii	CNN features	ZN stained Sputum smear Images	C = 2 T > 400	Acc = 75%	Evaluation details are not described	The LeNET architecture may be modified to train it on larger number of epochs
Ahmed et al. [61]	SVM	Pathogenic bacteria	CNN features	HOWMED, PIXNIO	C = 7 T = 800	Acc = 96%	Classifying images with multiple bacterial species is very less efficient	RNN model may be combined with CNN model to create more powerful network architecture
Alhalem et al. [62]	CNN with random projection	Phyla bacteria (16S)	CNN features	Ribosomal Dataset Project	C = 3 T = 2000	Acc = 97%	Detail of dataset is incomplete	CNN model may be proposed for classification of other bacteria
Bonah et al. [63]	SVM, LDA, genetic algorithm	Food-borne pathogens	Pixel based features	Hyperspectral Images	C = 8 T = 320 Tr = 192 Te = 128	Acc = 99.47%	Very complex dataset and cost of Hyperspectral imaging is high	An enhanced model may be proposed to classify other bacteria
Kang et al. [64]	Deep learning with LSTM, ResNet, 1D-CNN	Food-borne pathogens	Shape features	Hyperspectral Images	C = 5 T = 5000 Tr = 3600 Val = 900 Te = 500	Acc(LSTM) = 92.2% Acc (ResNet) = 93.8% Acc(1D-CNN) = 96.2%	Very complex dataset and cost of Hyperspectral imaging is high	Method may be proposed to practically implement testing of food matrix at food industry and clinical laboratory
Kang et al. [65]	Deep learning with KNN, SVM, 1D-CNN	Food-borne pathogens	Shape features	Hyperspectral Images	C = 5 T = 1500 Tr = 1000 Te = 500	Acc = 90%(1D-CNN) Acc = 81%(KNN) Acc = 81%(SVM)	Very complex dataset and cost of Hyperspectral imaging is high	Method may be proposed to classify other types of bacteria
Mhathesh et al. [66]	CNN	Larval zebra fish intestine bacteria	Texture features	Zebrafish Dataset	C = 1	Acc = 95%	Detail of dataset is incomplete	A CNN based classifier may be proposed to classify different biological images with more dimensions
H.Sajedi et. al. (2020) [67]	XGBoost	Myxobacteria	Texture features	Myxobacterial dataset	–	Acc = 90.28%	Details of dataset is incomplete	More techniques can be applied for better result

Abbreviations used in the Tables 3 and 4—Accuracy (Acc), Classes (C), Total Images (T), Training (Tr), Testing (Te), Sensitivity (Se), Specificity (Sp), Precision (Pre), Recall (Re), Validation (Val)

A brief summary of different ML approaches used in classification of bacteria (1998 to 2010) C = 5 T = 100 NN, Back propogation Tuber-culosis bacilli ZN stained Sputum smear Images T = 1147 Tr = 1000 Te = 147 Acc = 97.9% Se = 94.1% Sp = 99.1% C = 11 T = 4270 ZN stained Sputum smear Images C = 8 T = 100 Sp = 93.54% Se = 100% Xiaojuan et al. [33] NMCR database C = 4 Tr = 60 Te = 20 Men et al. [34] C = 2 T = 300 Chen et al. [35] SVM and Radial basis function C = 2 T = 100 Acc = 96% Pre = 0.97 + -0.03 Re = 0.96 + -0.04 Xiaojuan et al. [36] NN, Back propagation C = 4 T = 16 Tr = 10 Te = 6 Osman et al. [37] ZN stained Sputum smear Images T = 680 Tr = 400 Te = 280 Khultang et al. [38] T = 11,259 Tr = 6901 Te = 4358 K-NN, NN Acc = 96%(KNN) Acc = 98%(ANN) Akova et al. [40] C = 5 T = 2054 A brief summary of different ML approaches used in classification of bacteria (2011 to 2020) Rulaningtyas et al. (2011) [41] ZN stained Sputum smear Images T = 100 Tr = 75 Te = 25 Hiremanth et al. (2011) [42] K-NN, NN Acc = 99% K-NN = 100% ANN = 98% Ahmed et al. [43] K-means, SOM ZN stained Sputum smear Images Ferrari et al. [45] C = 6 T = 22 Ayas et al. [46] T = 116 Tr = 40 Te = 76 Sp = 62.89% Se = 89.34% ZN stained Sputum smear Images Nie et al. [48] ISP2 and oatmeal Agar plates Images Acc = 62.10% Pre = 83.76% Re = 82.16% Seo et al. [49] SVM and PLS-DA Priya et al. [50] SVM, BPNN Sputum smear slides Images Acc(BPNN) = 82.5% Acc(SVNN) = 92.5% Lopez et al. [51] Turra et al. [52] CNN, SVM, RF C = 8 T = 16,642 colonies Acc(CNN) = 99.7% Acc(SVM) = 99.5% Acc(RF) = 93.8% Zielinski et al. [53] C = 33 T = 660 C = 10 T = 200 Panicker et al. [27] ZN stained Sputum smear Images Rec = 97.13% Prec = 78.4% F-Score = 86.76% Wahid et al. [55] C = 5 T = 500 Tr = 400 Te = 100 Traore et al. [56] C = 2 T = 480 Tr = 400 Te = 80 Rahmayuna et al. [57] C = 4 T = 600 Tr = 540 Te = 60 Hay et al. [58] Mithra et al. [59] ZN stained Sputum smearImages T = 500 Tr = 275 Te = 225 Acc = 97.55% Se = 97.86% Sp = 98.23% Treebupachatsakul et al. [60] ZN stained Sputum smear Images C = 2 T > 400 Ahmed et al. [61] C = 7 T = 800 Alhalem et al. [62] C = 3 T = 2000 Bonah et al. [63] C = 8 T = 320 Tr = 192 Te = 128 Kang et al. [64] Deep learning with LSTM, ResNet, 1D-CNN C = 5 T = 5000 Tr = 3600 Val = 900 Te = 500 Acc(LSTM) = 92.2% Acc (ResNet) = 93.8% Acc(1D-CNN) = 96.2% Kang et al. [65] Deep learning with KNN, SVM, 1D-CNN C = 5 T = 1500 Tr = 1000 Te = 500 Acc = 90%(1D-CNN) Acc = 81%(KNN) Acc = 81%(SVM) H.Sajedi et. al. (2020) [67] Abbreviations used in the Tables 3 and 4—Accuracy (Acc), Classes (C), Total Images (T), Training (Tr), Testing (Te), Sensitivity (Se), Specificity (Sp), Precision (Pre), Recall (Re), Validation (Val) To train deep learning model for better accuracy, comparatively the large datasets are required to achieve the set target; small datasets if used may create the problem of over fitting in this case. The DIBas and other private dataset on which deep learning has been applied are small and it is not clear from the literature review that whether the proposed deep learning models are over fitted or otherwise. To solve small dataset issue, various methods like data augmentation and transfer learning have also been employed by the researchers. Although these methods are also insufficient and can carry other related problems. Like in transfer learning one cannot provide suitable convolutional filters specific to the task and in data augmentation unknown pattern and data cannot be managed. It is very difficult to create large datasets, because the most of the technologies are patent and lack the thorough involvement of microbiologists. The optimal way to improve the efficiency of deep learning model is to work in collaboration with microbiologists, to design and annotate the labeled benchmark datasets.

Discussion

In this paper, the authors presented the literature review of the research done in the field of bacteria classification using ML based on the articles published in various journals and conference proceedings such as Elsevier, Springer, IEEE Xplore, ACM digital library, PLOS etc. The review timeline is taken from the year 1998 to 2020. From the literature review, it has been analyzed that ML techniques can be effectively used for the classification of various types of bacteria. Based on data and imaging modalities, researchers used different feature extraction and selection techniques for ML techniques to be more effective and efficient in case of bacterial classification. This section aims to discuss the bacterial image classification consist of data acquisition, data pre-processing, segmentation, feature extraction and classification, as shown in Fig. 1. Pre-processing is a prominent step to increase the performance of ML methods. In bacterial image classification field, researchers have used variety of microscopic images including gram stained images, ZN-stained sputum smear images and hyper-spectral images etc. To extract reliable features different image pre-processing techniques i.e. background correction, image resizing, image enhancement and color-space transformation have been employed. In addition, to reduce errors resulting from noise and artifacts, researchers also used filtering algorithms like adaptive median filter, Gaussian filter etc. Another important task in image pre-processing step is segmentation of images into meaningful regions for automatic region-of-interest extraction. For bacterial image segmentation, researchers have used a number of segmentation algorithms, broadly categorized as: i) Edge detection based techniques, (ii) Threshold method, (iii) Region based techniques, (iv) Clustering based methods, (v) NN-based approaches, (vi) Bayesian segmentation. Feature extraction and selection is another step which is considered being the most important in image classification task. The most commonly used features for bacterial classification, identified from the literature include shape, size, texture and color. To extract these features researchers have employed feature extraction methods like Histogram of Gradient, Scale invariant feature transform, Haralick features, local binary pattern, color histogram etc. Some researchers have also explored CNN for automatic feature extraction. Sometimes long feature vectors are obtained using above mentioned feature extraction techniques, which increases the computational cost. To make them shorter and to reduce redundancy, feature selection methods namely genetic algorithm, Principal component analysis and linear discriminant analysis have also been applied by the researchers in the concerned field. Researchers have used various ML classifiers in bacterial image classification. It is evident from the literature available that the most common and widely used ML techniques are KNN, SVM, ANN, RF and DL. DL techniques further include deep neural networks, deep belief networks and CNN. Out of all above techniques, the different variations of DL and SVM were the most commonly used ML techniques in bacterial image classification. The Fig. 4 shows the number of papers in each category with similar technique used during bacterial Image classification out of the total number of papers included in the literature. It is clearly visible that the DL and SVM are most commonly used techniques by the researchers working in this domain. Performance metrics is an important aspect without which it is impossible to analyze the application of different ML techniques in different domains. Researchers have used various metrics like accuracy, precision, recall, sensitivity, specificity and F-score to measure the performance of the underlying classification. Table 5 shows the highest performance achieved by different ML technique studied in this review article. From this table it can also be concluded that ML techniques are having a great potential in the field of bacterial image classification. The field of ML has already proven its worth in fixing different problems like prediction systems, image recognition, speech recognition, medical diagnosis and financial industry, etc. When it comes to applying ML approaches in the field of microscopic bacterial classification, there are significant numbers of challenges that are required to be solved. These issues can be summed up in a single phrase if the datasets are small, image quality is low and target objects are also small in nature. Due to the availability of small public datasets the researchers are devoid of studying the large variety of bacterial species which leads to a wide amount of research gap. This disadvantage of having small dataset may leads to another problem of model over fitting. Several strategies, like data augmentation, dropout and regularization have proven to be successful in preventing over fitting arises due to the availability of smaller datasets. Image rotation, vertical and horizontal flip, zoom in and out in a certain range, random horizontal shifts and vertical shifts are all effective data augmentation methods for microscopic bacterial images. Image noise, poor spatial resolution small object size, and low image contrast are among other issues that plague microscopic images. To address these difficulties various image pre-processing techniques like adaptive median filtering and Gaussian filtering are employed. In addition to this the researchers have several other opportunities to explore more in image pre-processing techniques like Wiener filter, unsharp mask filtering, deep neural networks such as autoencoders [70], deep residual dense network [21], CNN [62], linear contrast adjustment, etc. More feature descriptors using the combination of colour and texture features can also be applied for extracting more significant features from bacterial images [16].

Fig. 4

Different number of Research papers employed similar techniques

Table 5

performance evaluation of ML techniques with highest Accuracy

Machine learning techniques	Performance metrics (%)	Types of feature set
Artificial Neural Network	Accuracy = 98	Geometric features
K-Nearest Neighbor	Accuracy = 100	Geometric and statistical features
Support Vector Machine	Accuracy = 99.5	Shape features
Random forest	Accuracy = 93.8	Shape features
Deep learning	Accuracy = 99.7	Deep features
K-means	Accuracy = 93.54	Shape features

Pre-processing Techniques and segmentation: Feature Extraction and selection: Classification Techniques: Different number of Research papers employed similar techniques Performance metrics: performance evaluation of ML techniques with highest Accuracy Opportunities and Challenges: The researchers have used a wide range of ML algorithms in this domain for feature extraction and classification. However, both the quality and quantity of data are insufficient for creating effective ML model. Some researchers ([45, 62, 64, 65]) merge one ML technique with other ML techniques, like SVM with radial basis function, CNN with random projection and DL with LSTM, ResNet and 1D-CNN to produce hybrid model for bacterial image classification. However the desired efficiency of the model is yet to be attained due to poor data quality. As a result, the fundamental difficulty that ML-based research still faces is the scarcity of high-quality data sets which arises due to the non availability of standard datasets in different online repositories. The majority of the data used for experimentation by the authors was their own created private data and lacks several benchmarks. Due to this reason, till date only few bacteria species have been classified by using different ML approaches thus by restricting the researchers in deriving extreme potential for solving complex problems relating this field.

Conclusion & Future Scope

In this paper, an overview of the different ML techniques used in the bacterial classification is given. ML techniques have been extensively applied in the field of bacterial classification. The field of microbiology has improved with the application of these techniques. ML techniques help the microbiologist in many aspects i.e. in the identification, classification of bacteria and also the over-all automation of these processes. The aim of the review is to explore different ML based methodologies used by various researchers in the bacterial image classification. The review included articles published in different journals and conference proceedings during the time period from 1998 to 2020. Different imaging and data modalities have been explored in the domain of bacterial classification. The data collected in this paper highlights the efforts put by the researchers in their studies. However it is difficult to compare their methods and performances due to the difference in the dataset and imaging modalities used by each of them. It can be noted that the variation in their performances may also be attributed to the different approaches and methods used in their research work. It has been observed that ML techniques proved to be a better approach, by giving better accuracy and precision values during the classification of bacteria. DL methods have also been used extensively for the classification of bacteria. In DL some state-of-the-art algorithms are also used i.e. deep belief network, CNN and LSTM for classification purpose. The researchers studied very few species of bacteria due to the limited availability of the datasets. Some researchers used dataset by downloaded the imaging datasets from Google, some created their own, while others collected it from different laboratories or hospitals. ML techniques have been used in many fields and give better results, but due to lack of datasets, the performance is not as per the requirement. Some researchers combined DL techniques with other ML techniques i.e. LSTM along with other image processing method to create a hybrid ML model as per their requirement. From the review it is analyzed that most of the researcher used 2D images while few used 3D imaging data also. ML has been successfully used in the classification of bacteria in the past few years. In future, the researchers can combine DL techniques with other ML approaches for getting better results. Researchers can explore other DL approach like recursive neural networks, auto encoders etc. in microbiology field. The dataset can also be enhanced by using different species of bacteria and then classify that data with different ML techniques or combination of such techniques. There is a great potential in future for ML techniques in the field of microbiology. Researchers continue to explore other domains like informatics, medicine, entomology and biology where ML techniques can be used.

1 in total

1. Machine Learning Algorithms for Classification of MALDI-TOF MS Spectra from Phylogenetically Closely Related Species Brucella melitensis, Brucella abortus and Brucella suis.

Authors: Flavia Dematheis; Mathias C Walter; Daniel Lang; Markus Antwerpen; Holger C Scholz; Marie-Theres Pfalzgraf; Enrico Mantel; Christin Hinz; Roman Wölfel; Sabine Zange
Journal: Microorganisms Date: 2022-08-17

1 in total