Literature DB >> 32646771

Systematic review of artificial intelligence techniques in the detection and classification of COVID-19 medical images in terms of evaluation and benchmarking: Taxonomy analysis, challenges, future solutions and methodological aspects.

O S Albahri¹, A A Zaidan², A S Albahri³, B B Zaidan¹, Karrar Hameed Abdulkareem⁴, Z T Al-Qaysi⁵, A H Alamoodi¹, A M Aleesa⁶, M A Chyad¹, R M Alesa⁶, L C Kem¹, Muhammad Modi Lakulu¹, A B Ibrahim¹, Nazre Abdul Rashid¹.

Abstract

This study presents a systematic review of artificial intelligence (AI) techniques used in the detection and classification of coronavirus disease 2019 (COVID-19) medical images in terms of evaluation and benchmarking. Five reliable databases, namely, IEEE Xplore, Web of Science, PubMed, ScienceDirect and Scopus were used to obtain relevant studies of the given topic. Several filtering and scanning stages were performed according to the inclusion/exclusion criteria to screen the 36 studies obtained; however, only 11 studies met the criteria. Taxonomy was performed, and the 11 studies were classified on the basis of two categories, namely, review and research studies. Then, a deep analysis and critical review were performed to highlight the challenges and critical gaps outlined in the academic literature of the given subject. Results showed that no relevant study evaluated and benchmarked AI techniques utilised in classification tasks (i.e. binary, multi-class, multi-labelled and hierarchical classifications) of COVID-19 medical images. In case evaluation and benchmarking will be conducted, three future challenges will be encountered, namely, multiple evaluation criteria within each classification task, trade-off amongst criteria and importance of these criteria. According to the discussed future challenges, the process of evaluation and benchmarking AI techniques used in the classification of COVID-19 medical images considered multi-complex attribute problems. Thus, adopting multi-criteria decision analysis (MCDA) is an essential and effective approach to tackle the problem complexity. Moreover, this study proposes a detailed methodology for the evaluation and benchmarking of AI techniques used in all classification tasks of COVID-19 medical images as future directions; such methodology is presented on the basis of three sequential phases. Firstly, the identification procedure for the construction of four decision matrices, namely, binary, multi-class, multi-labelled and hierarchical, is presented on the basis of the intersection of evaluation criteria of each classification task and AI classification techniques. Secondly, the development of the MCDA approach for benchmarking AI classification techniques is provided on the basis of the integrated analytic hierarchy process and VlseKriterijumska Optimizacija I Kompromisno Resenje methods. Lastly, objective and subjective validation procedures are described to validate the proposed benchmarking solutions.

Entities: Chemical Disease Gene Species

Keywords: Artificial intelligence; Benchmarking; COVID-19; Decision-making; Evaluation; MCDA; Medical image

Mesh：

Year: 2020 PMID： 32646771 PMCID： PMC7328559 DOI： 10.1016/j.jiph.2020.06.028

Source DB: PubMed Journal: J Infect Public Health ISSN： 1876-0341 Impact factor: 3.718

Introduction

In the last days of 2019, a group of patients infected with a novel coronavirus disease (coronavirus disease 2019, COVID-19) was recognised in Wuhan, China. Since then, the contagions of COVID-19 have spread around the world. COVID-19 affects people in different ways. Most infected patients develop common symptoms (i.e. fever, fatigue and dry cough) [1], and others may experience additional symptoms (i.e. aches and pains, nasal congestion, runny nose, sore throat and diarrhoea) [2]. COVID-19 exposed weaknesses in the healthcare system of many countries, and the inability of healthcare systems to manage patients has caused anxiety. One of the important reasons behind the rapid spread of COVID-19 is the lack of specificity in clinical detection methods [3]. Molecular approaches such as quantitative real-time reverse transcription–polymerase chain reaction (rRT-PCR) [4] and other methods such as serologic tests [5] and viral throat swab testing [6] are necessary and widely utilised for the detection of COVID-19. However, studies have shown that chest radiographs (X-rays) [7] and chest computed tomography (CT) scans [8] can assist and reveal anomalies indicative of different lung diseases, including COVID-19. CT scan and X-ray tests could be utilised as a primary detection tool to evaluate the severity of COVID-19, monitor the emergency case of infected patients and predict COVID-19 progression [9]. However, time is often limited in such emergencies and does not allow these experiments to be performed using existing traditional manual diagnosis [10]. These procedures require a specialist doctor and are susceptible to human error during testing or reading and interpreting findings, which are not acceptable in crucial cases. Given the recent spread of COVID-19, hospitals are filled with numerous patients who are either improving from the viral infection or becoming worse (dying) [11]. In this case, CT scan and X-ray tests should be performed with maximum speed and efficiency to save as many lives as possible [9]. The role of intelligent technologies would effectively help in the diagnosis and classification processes [7]. The use of artificial intelligence (AI) has increased in different fields, especially in medical detection [12]. AI has been widely used to gain more accurate detection results and decrease the burden on the healthcare system [13]. It can decrease the decision time associated with the detection process of traditional methods [14]. The development of AI techniques to recognise the risks of epidemic diseases is considered a key factor in the improvement of the prediction, prevention and detection of future global health risks [15]. Numerous types of AI classifiers have been reported by a few researchers with real COVID-19 datasets with different case studies and targets [9]. Although AI techniques can be beneficial in the diagnosis and classification of COVID-19, selecting the appropriate AI technique that can produce accurate results is challenging [16,17]. The large diversity amongst available AI techniques creates difficulties in deciding which of them to use in the development of COVID-19 diagnosis and classification particularly when there is no dedicated AI technique that is far better than the other. In addition, the majority of these techniques suffer from low accuracy and computational efficiency [18]. On the other hand, the difficult part is associated with the evaluation and comparison because of the multiple evaluation criteria and conflict between them are increasing the challenge [19]. The evaluation and benchmarking procedures of AI techniques are critical in acquiring a technique that can produce the best results [17,20]. A similar process is essential since there will be affected on the persons who suspected with COVID-19 and medical organisation due to this process could result in losing their life and spreading the virus amongst others. In order to evaluate and benchmark AI classification techniques that can be used in the detection of COVID-19 medical images, several requirements guarantee the reliability of these techniques given that they are associated with patients’ lives. However, two main questions can be encountered in this process. Firstly, what are the appropriate criteria that could be used in the evaluation? Secondly, what is the correct benchmark procedure that could be used to select a suitable AI technique amongst others? Therefore, the present study aims to (i) shed light and systematically review the research efforts of emerging and new technologies of COVID-19 medical image detection based on AI approach; (ii) map related studies into coherent taxonomy and highlight the AI techniques, datasets, case studies and AI classification types used; (iii) highlight and analyse different aspects such as research gabs and future challenges with respect to evaluation and benchmarking; and (iv) propose a potential pathway solution with detailed methodology to tackle the identified research gabs and future challenges of evaluation and benchmarking of AI classification techniques used in COVID-19 medical image detection. The remaining sections of this study are presented as follows. Section “Methods” presents the methods used in reviewing systematic literatures of the topic. Section “Results and discussion” presents the taxonomy analysis highlight points of the included final set of studies. Section “Critical review and analysis” presents a critical review and analysis of the identified studies. Section “Future challenges of the evaluation and benchmarking of AI classification techniques used in the detection of COVID-19 medical images” presents the future challenges related to the evaluation and benchmarking of AI classification techniques used in COVID-19 medical image detection. Section “Research proposal for potential future direction” presents a proposal of potential future solutions for the identified research gabs. Section “Methodology” provides the methodology of the proposed solutions. Finally, Section “Conclusion” presents the conclusion.

Methods

This study is based on a systematic literature review (SLR), which has been recognised for its role in acquiring a sufficient understanding with regard to a topic of interest [21,22]. SLR has also been highlighted for its remarkable structured analytical methods for research synthesis and its ability to accommodate different types of studies from various scientific disciplines [23,24]. During the process, different academic digital databases are utilised to extract relevant literature, including (1) ScienceDirect, which offers different scientific literature across all domains; (2) Scopus, which offers sufficient coverage of literature from all disciplines; (3) IEEE, which is recognised for its scientific reliability of covering multi-disciplinary sciences and engineering and computer science literature; (4) Web of Science, which demonstrates high coverage of different literature topics and studies across all domains; and (5) PubMed, which also covers a variety of disciplines with multi-disciplinary focus on medicine- and technology-related literature [[25], [26], [27]]. These databases are deemed sufficient to cover the latest and most reliable literature for COVID-19 diagnosis and detection. The studies extracted from these databases are relevant and reliable to understand the role of intelligent systems (i.e. AI) and their involvement in scientist’s efforts with relation to COVID-19. The literature search was comprehensively conducted on the five major databases in a span of 10 years between 2010 and May 5, 2020. The selection of the databases was due to their scientific reliability, soundness and coverage for literature from various domains with regard to deep learning efforts and COVID-19. Boolean operators were utilised during the process to gather as much relevant literature as possible. The first group of keywords was meant for intelligent systems and their relevant keywords, the second group was meant for COVID-19 relevant keywords, whilst the third group was meant for medical images with different relevant keywords to make sure all literature associated with the three groups are included. In this SLR, different criteria were enforced for the selection of related literature. All articles were selected if they were English and conducted between 2010 and May 5, 2020. For the publication types, all articles were selected if they were journal, conference or review papers [[28], [29], [30]]. As far as the topic of interest is concerned, this SLR only selected publications that discuss any form of AI and COVID-19 using medical images. The exact query is presented in Fig. 1 .

Fig. 1

Method of SLR of the study topic.

Method of SLR of the study topic. The search was conducted in the middle of May 2020 using the advanced search boxes of the five digital databases. The initial search yielded 36 publications after duplication process, which excluded a total of six duplicated records. Next, the titles and abstracts of the publications were scanned. Twenty articles were excluded as they failed to meet the criteria. The remaining articles underwent another round of screening through full-text reading to investigate the relevancy of the selected articles from the previous phase and determine whether they are suitable to be included in the final set. After this process, five articles were excluded, and only 11 articles met all criteria and were deemed suitable for inclusion in this review. Furthermore, the key demographic statistical findings from the articles are presented on the basis of two aspects, namely, database used and countries (Fig. 2 ).

Fig. 2

Statistics of the included studies by databases and countries.

Statistics of the included studies by databases and countries. All these 11 articles identified from the literature are scattered over the databases and the countries. For the databases, four studies were obtained from ScienceDirect, three from IEEE, two from PubMed, and two from Scopus. The only database with no identified articles was Web of Science. As for the countries of the corresponding authors, three studies came from China, three from Turkey, one from South Africa, one from Italy, one from UK, one from Korea, and one from Egypt.

Results and discussion

This section elaborates the final set of articles (11 articles) that have been collected in this systematic review regarding AI techniques used in the detection and classification of COVID-19 medical images. This final set was divided into two clusters, namely, review cluster and research cluster. The taxonomy and classification of the related articles are shown in Fig. 3 .

Fig. 3

Taxonomy of research literature on AI techniques used in the detection and classification of COVID-19 medical images.

Review

The primary aim of reviewing articles on AI techniques used in the detection and classification of COVID-19 medical images is to understand the current thinking in this field and justify the need for future research on related topics that have been overlooked or understudied. This cluster contained only one article. In Ref. [9], the study reviewed the rapid responses in the community of medical imaging (empowered by AI) towards COVID-19. The authors emphasised that AI-empowered image acquisition can significantly help automate the scanning procedure and reshape the workflow with minimal contact to patients, providing the best protection to the imaging technicians. They focused on the entire pipeline of medical imaging and analysis techniques involved with COVID-19, including image acquisition, segmentation, diagnosis and follow-up, using the integration of AI with X-ray and CT images.

Research studies

The second cluster focused on research studies and contained 10 articles, which consist of four sub-clusters: binary, multi-class, integrated multi-class and binary, and integrated hierarchical and multi-class.

Binary classification

The ﬂat classiﬁcation refers to binary classiﬁcation problems with only two different classes. One article involved this sub-cluster. The study of Ref. [31] demonstrated the ability of deep learning method in the diagnosis of COVID-19 on the basis of medical images acquired by CT. Regarding the class labels that have been used in identifying the existence of the infection, this study relied on false-negative (FN) results, which jeopardise the epidemic from being prevented and controlled and affect decisions for health monitoring or discharging. The dataset utilised was made out of the information of 10 patients. Out of the 10 negative cases, two were positively identified for COVID-19 by utilising the rRT-PCR test. The previous clearly indicated and yielded almost 20% FN rate for rRT-PCR.

Multi-class classification

There are numerous issues and challenges linked to multi-class classification. However, there is one output for a sample. This sub-cluster includes four different publication works identified. The first work [32] involved the development of a scoring tool aimed at COVID-19 severity. Such tool was proven to be important in assisting healthcare workers in identifying and determining which patients suspected or confirmed for COVID-19 are in high need for respiratory interventions. The research utilised a tool for assigning patients into categories according to severity in line with the classifications of the WHO, namely, severe and moderate/mild. In different terms, patients who are at the critical stage need ventilation, other patients in the severe stage need oxygen, whereas those patients in the moderate stage do not need oxygen despite having pneumonia. For patients in the mild stage, they only have upper respiratory tract disease. In addition, the dataset utilised was gathered from 13,500 COVID-19 patients. According to an early assessment, the tool developed correctly classified 93.6% of patients, underestimated 0.8% of patient severities and overestimated 5.7%. Another work [33] introduced COVIDiagnosis-Net, which is an AI detection approach for COVID-19. The approach is based on deep SqueezeNet with Bayes optimisation, which can help detect COVID-19. The deep learning technique exhibited 98.3% accuracy in detecting normal, pneumonia and COVID-19 cases. The technique also had a 100% accuracy for the single detection of COVID-19 amongst other classes. Ref. [34] proposed a patch-based technique with convolutional neural network. The technique makes use of a small number of training parameters for the diagnosis of COVID-19. The work was inspired by a statistical analysis for potential imaging biomarkers of chest X-rays. Their results indicated that pre-processing for normalisation of data helped in the processing of cross-database and significantly improved the accuracy of segmentation (Jaccard similarity coefﬁcients from 0.932 to 0.943, p < 0.001). According to the results, pre-processing was a significant aspect in ensuring the performance of the segmentation in the cross-database. In Ref. [35], CoVID-19 was identified with the use of MobileNetV2 and SqueezeNet, a deep learning technique, in addition to feature sets obtained by the techniques. They were processed using the social mimic optimisation method. Fuzzy colour technique was used to restructure data classes as a pre-processing measure, and the structured images were stacked with the original images. Thereafter, efficient features were grouped and classified with the use of support vector machines (SVMs) with an overall classification rate of 99.27%.

Integrated multi-class and binary classifications

This sub-cluster contains three articles that focused on integrated multi-class and binary classiﬁcation problems. Ref. [36] emphasised the deployment of AI to support the work of a radiologist. They indicated that the application of AI in COVID-19 infection will allow monitoring the course of the disease. Ref. [37] indicated the importance of AI in maintaining the spread of COVID-19 [7]. presented a model for detecting COVID-19 by using 125 X-ray images in accurately diagnosing binary classification (COVID vs. no findings), in addition to multi-class classification (COVID vs. no findings vs. pneumonia). The accuracy of the model was 98.08% for binary classes and 87.02% for multi-class cases.

Integrated hierarchical and multi-class classifications

Another classification problem type is hierarchical classification where the learning output is identified over special class taxonomy. Ref. [38] defined hierarchical classification as follows: ‘the input is to be classified into one, and only one, each class which are be divided into subclasses or grouped into superclasses. The hierarchy is defined and cannot be changed during classification. Hierarchical classification can be transformed into flat classification.’ This sub-cluster contains two articles that focused on integrated hierarchical and multi-class classiﬁcation problems. Ref. [39] identified COVID-19 pneumonia from various healthy lung types and developed a classification approach, which takes into consideration multi-class and hierarchical perspectives. In addition, resampling algorithms were used for re-balancing the distribution of the classes. The approach acquired a macro-average F1 score of 0.65 with the use of multi-class method and F1 score of 0.89 for the identiﬁcation of COVID-19 in hierarchical classiﬁcation scenario. Ref. [40] developed a model with hybrid capability for detecting COVID-19 with the use of improved marine predators algorithm (IMPA) and ranking-based diversity reduction strategy to acquire particle numbers that are not capable of finding suitable solution within a consecutive number of iterations. Nine chest X-ray images were utilised for the validation of IMPA performance. The threshold levels were between 10 and 100 and compared with five algorithms: (1) whale optimisation algorithm, (2) salp swarm algorithm, (3) sine cosine algorithm, (4) equilibrium optimiser and (5) Harris hawks algorithm. Results showed that the hybrid model proposed based on the experiment outperforms all other algorithms for a range of metrics. Furthermore, on all threshold levels, the performance was convergent in Structural Similarity Index and Universal Quality Index metrics.

Highlight points

The academic literature in the research cluster was further discussed from different points of view on the basis of three perspectives, namely, dataset, AI technique and case study used. The dataset that has been used or proposed to be used in the development and evaluation of the COVID-19 diagnosis system was categorised as primary or secondary. The primary dataset is the dataset that was collected during the research and approved by the ethical approval committee. Conversely, the secondary dataset was acquired online (public dataset) and published by researchers to help other researchers test their AI methods and techniques. Moreover, regarding the AI techniques, this study identified the AI algorithms into traditional machine learning algorithms, such as SVM and decision tree, or deep learning algorithms such as convolutional and deep neural networks. Table 1 presents a summary of the studies described in this cluster, focusing on their most important characteristics for COVID-19 diagnoses such as the type of datasets used and summarising AI techniques utilised to solve the case study problems for detecting the COVID-19 medical images.

Table 1

Summary of the perspectives of works described in research cluster studies.

Ref.	Type of datasets		AI techniques		Case study
	Primary data	Secondary data	Traditional machine learning techniques	Deep learning techniques
[31]	χ	χ	χ	√	CT scan
[32]	√	χ	√	χ	CT scan
[36]	√	√	√	√	CT scan
[37]	√	√	√	√	X-ray
[7]	χ	√	χ	√	X-ray
[33]	χ	√	χ	√	X-ray
[34]	χ	√	χ	√	X-ray
[40]	χ	√	√	χ	X-ray
[35]	χ	√	√	√	X-ray
[39]	√	√	√	√	X-ray

Summary of the perspectives of works described in research cluster studies. AI approaches for the detection of COVID-19 are considered one of the latest and most trending topics due to the growing pandemic. It is difficult to represent the true state-of-the-art for this purpose considering that new works are emerging every day. Nevertheless, we concluded that majority of the literature aimed to investigate hybrid AI techniques by combining deep learning and traditional machine learning, which is contributed by different types of datasets. In addition, the standard image diagnosis tests for pneumonia are chest X-ray and CT scan. X-ray is more useful amongst these studies because it is cheaper, faster and more widespread than CT. The primary aims of the studies are to identify pneumonia caused by COVID-19 from other types using either X-ray or CT scan. Given that pneumonia can be structured as a hierarchy, a classiﬁcation scheme considering the multi-class and hierarchical perspectives requires attention and leads to the best COVID-19 recognition rate. The reason behind this is that there is a hierarchy between the pathogens that cause pneumonia. However, only one study [39] considered hierarchical classiﬁcation approach in the literature.

Critical review and analysis

On the basis of previous literature, classification tasks for COVID-19 were different in terms of aspects related to the accuracy of results, in spite of the differences of the overall performance. Previous literature was solely focused on accuracy enhancement, time reduction or even overall performance improvements for the classification. Furthermore, differences exist in previous literature with respect to classification techniques, phases and classification procedures. On the one hand, the developed COVID-19 classification techniques in the analysed studies provide three COVID-19 classification tasks (i.e. binary classification, multi-class classification and hierarchical classification). On the other hand, Ref. [38] indicated that all relevant label distribution in a classification problem changes, which explains why four classification types can be performed in the AI techniques, namely, binary, multi-class, multi-labelled and hierarchical classifications. Multi-labelled classification is described in Ref. [38] as follows: ‘the input is to be classified into several of non-overlapping classes. When the learning task is document topic classification, multi-labelling is often referred as multi-topic classification. In the multi-labelled classification problem, categories are isolated and their relations are not considered important.’ However, no study has provided multi-labelled classification for the detection of COVID-19 medical images. This is considered the first research gab identified in the literature reviewed. Furthermore, the growing number of classification techniques developed for COVID-19 is considered a major problem for health organisations and other treatment centres. The reason behind that these medical organisations that aim to adopt classification techniques for detection of COVID-19 will be encountered a challenge on how to select the best and an appropriate classification technique that would provide an accurate and rapid detection of COVID-19 medical images. Apart from the disparity in COVID-19 classification techniques in terms of their overall performance, all results confirm the difficulty of making a decision to choose a better technique amongst others. In the analysed studies, there is no evidence or proposed solution confirmed to be superior over the rest. Moreover, although multi-labelled classification AI techniques used in the detection of COVID-19 have not been developed, they might be developed in the near future. In the case of this development, another important question will arise: ‘which classification technique is appropriate for such purpose?’ According to the included final set of articles that met the search query used, no study has provided a comprehensive evaluation and benchmarking solution for AI classification techniques (i.e. binary, multi-class, multi-labelled and hierarchical classifications) used in the detection of COVID-19 medical images. This is considered the second research gab identified in the literature reviewed. Ref. [17] recommended that an evaluation and benchmarking solution for multi-labelled and/or hierarchical classification techniques could be beneficial and essential to determine which AI technique is appropriate amongst others. To explain the detailed solution for the identified gabs, two problems should be discussed: ‘what are the evaluation criteria used in each classification type (i.e. binary, multi-class, multi-labelled and hierarchical classifications), and what are the calculation processes of these criteria? Each of these classification methods has its own evaluation criterion. The calculation procedure for each evaluation criteria is completely different from each classification type [17,38]. Thus, the evaluation and benchmarking procedure will be different within each classification method (the evaluation criteria and calculation procedures are specified in detail in the methodology section). This study attempts to fill the gap in the evaluation and benchmarking of different classification types that will be used in the detection of COVID-19. The proposed solution shall assist the administrations of health organisations to evaluate and benchmark COVID-19 AI classification techniques. It can also ensure that the selected classification techniques meet all necessary requirements. To provide such a solution, three specific challenges need to be addressed in the process of evaluation and benchmarking classification techniques, which are described in the next section.

Future challenges of the evaluation and benchmarking of AI classification techniques used in the detection of COVID-19 medical images

In this section, three future challenges will be encountered in the processes of evaluation and benchmarking AI classification techniques used in the detection of COVID-19 medical images as discussed in the following subsections.

Challenge of multiple evaluation criteria

As stated in the previous section, four categories of classification tasks are identified. Each category is different in terms of criteria type, where the calculation procedure is different for each evaluation criterion. Furthermore, the number of criteria is different within each classification category, for example, six evaluation criteria for binary classification, eight criteria for multi-class classification, four criteria for multi-labelled classification and six criteria for hierarchical classification [38]. In general, most evaluation processes for COVID-19 classification techniques need to consider more than one criterion. For example, the reliability of classification techniques can be measured on the basis of a confusion matrix that contains four parameters: true positive (TP), false positive (FP), true negative (TN) and FN. In other words, the rate of correct and incorrect classified samples is compared between actual class and the predicted class. Thus, this status will affect the results if only one or a full set of parameters is considered in the evaluation process. However, in this regard, there are no suggested solutions to handle these particular issues in terms of evaluation and benchmarking of COVID-19 AI classification techniques. Furthermore, the recommended solution must consider the issue that the evaluation of COVID-19 classification techniques is based on multiple evaluation criteria and consider the difference amongst classification tasks in terms of type of criteria.

Challenge of criteria trade-off

The issue of trade-off is defined as a situation when a reliability or aspect of something decreases whilst the reliability or aspect of another increases. According to the nature of the evaluation criteria used in AI techniques, different types of trade-off utilised by researchers for different criteria were performed, which in turn were confusing for decision-makers. In addition, in the scope of this study, the different use ratio in different criteria demonstrated effect that explains the conflict on other criteria utilised by researchers. Thus, the evaluation criteria conflict for COVID-19 classification shows important challenges in our intention towards creating a COVID-19 classification approach. Fundamentally, these types of challenges are due to terms confliction, especially the one between the criteria and the data. Thus, it is crucial to realise the advantages and disadvantages of a particular choice whilst making a decision. The trade-off term is frequently used in the context of evaluation, where the process of selection acts as a decision-maker [[44], [45], [46]]. The trade-off, also known as conflicting criteria problem, between the evaluation criteria concentrated on the application reliability, time complexity for the COVID-19 classification procedure and error rate within the dataset in the benchmarking and evaluation of AI classification techniques used in COVID-19. With the aim of evaluating the COVID-19 classification techniques, these sorts of criteria are considered main necessities. The reliability should possess a high rate; time complexity to conduct the output that also need to below. In addition, the apparent error rate from the training of the dataset has to be simultaneously low. The generated conflicting data are monitored because the matrix of parameter section contains TP, FP, TN and FN, which displays the rise in TP and TN when FP and FN are minimised [47,48]. This phenomenon shows an apparent conflict amongst the probability criteria. These parameters have a considerable effect on some of the remaining criteria values because some of the criteria rely on the values of these four parameters. Therefore, the process of evaluation and benchmarking must take into consideration such requirements. As a result, a new approach for the evaluation that handles all conflict criteria and data problems should emerge, and this method should be flexible. However, in this regard, there are no suggested solutions to handle these particular issues.

Challenge of criteria importance

Another challenge that might be encountered is associated with the importance of the criteria through the evaluation and benchmarking phases despite their conflict. In addition, this conflict between the criteria poses a significant challenge during the evaluation stage [49]. A suitable procedure for this kind of objectives needs to be developed whilst boosting the significance of a certain evaluation criterion and minimising others [50]. Two major key points must be considered. The first one is to achieve a sufficient understanding of the COVID-19 classification technique behaviour whilst assigning certain significance to the design. The next point is the evaluation approach whilst bearing in mind the issue of trade-off. However, a conflict might exist between the opinions of the evaluator and the objective of the developer, which poses an effect over the last evaluation of the needed approach [51]. From a technical point of view, the COVID-19 classification technique by means of evaluation and benchmarking simultaneously considers multiple criteria and then assign a suitable weight for all evaluation criteria of the COVID-19 classification technique. After making a comparison for all scores of the approach, the approaches with the most balancing rate should be assigned with the highest priority level, whereas the approaches with the least balancing rate should be assigned with lowest priority level. In addition, because COVID-19 classification techniques have to consider multiple criteria, it considered as a difficult and challenging task in time and error rate in the dataset which also could be significantly important in the COVID-19 classification. In addition, each decision-maker assigns a different weight for all these previous criteria [52]. On the other hand, the experts who are in charge of assigning a score for the COVID-19 classification techniques could assign more weights to different features aside from the ones that acquire less interest than any other criteria. By contrast, experts who aim to make use of benchmarking method in order to address such problems would consider different criteria as the most significant ones.

Research proposal for potential future direction

This section describes the potential future direction of the process of evaluation and benchmarking the COVID-19 classification techniques used in medical image detection. According to the future challenges discussed, such process could face a multi-complex attribute problem; like that all the AI techniques are considered available alternatives to be a suitable technique. Therefore, adapting candid and structured techniques for decisions using multiple criteria could boost the decision-making quality. Aside from analysis, assessment and ranking, multi-criteria decision analysis (MCDA) is considered a solution that aids decision-makers to organise and solve any problem [53,54].

Definition and significance of MCDA

MCDA is defined as ‘an extension of decision theory that covers any decision with multiple objectives. MCDA is a methodology for assessing alternatives on individual, often conflicting criteria, and combining them into one overall appraisal’ [55,[107], [108], [109]]. The techniques of decision-making are widely recognised, and amongst them, MCDA is the most significant. It is also considered as an important part of operation research that handles problems of decision-making with respect to decision criteria [56]. The technique involves various processes including structuring, planning and solving different decision problems with the use of many criteria [57]. MCDA is increasingly being used as it can promote the decision quality [58]. It is achieved by making the process of the decision more reasonable, efficient, clear and explicit compared with other traditional processes [59,60]. The most significant goals of MCDA include the allocation of the data miner to choose the most suitable alternatives, assigning a rank to the alternatives in decreasing order with regard to the efficiency and classifying the applicable alternatives amongst groups of available alternatives [61,62]. On this basis, the ranking will take place on the most suitable alternative(s) [63]. There is a need for fundamental terms in MCDA to be defined, in addition to containing the decision matrix (DM) and its associated criteria [64,65]. An evaluation matrix contains n attribute and m alternatives, which need identification [66,67]. The intersection of both criteria and alternatives is defined as z_ij. Therefore, we have a matrix (z_ij) _ (m*n) explained as follows:where are probable alternatives, which decision-makers need to rank (i.e. COVID-19 classification AI techniques). are the criteria against which the performance of each alternative is evaluated. Finally, is the rating of alternative with respect to criterion . There is an improvement possibility for the decision-making process by means of comprising decision-makers and stakeholders, which will enable the process with support and structure [68,69]. With the use of candid, the structure of multi-criteria decision methods can aid towards improving the decision-making quality and set of techniques [70,71]. These techniques could identify which of the criteria are relevant and provide information for evaluating the current alternatives [72]. By performing this process, they are able to improve transparency, consistency and decision validity [73]. MCDA can contribute to fair, transparent and rational priority-setting processes [74]. MCDA has been widely used in many areas for different applications [75]. MCDA works by means of ranking and finding the suitable solution to select appropriate alternatives in different domains [76,[78], [79], [80], [81]], especially in healthcare domain [[82], [83], [84]].

MCDA methods

Several MCDA methods can be found in the literature, including the analytic hierarchy process (AHP), weighted product method, hierarchical adaptive weighting, best–worst method, multiplicative exponential weighting, weighted sum model, simple additive weighting, analytic network process, VlseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR), technique for reorganisation of opinion order to interval levels and technique for order of preference by similarity to ideal solution (TOPSIS). Each technique uses different representations [[85], [86], [87], [88]]. The diversity of MCDA techniques raises a challenge in terms of the selection of the most suitable method for a single scenario. Each technique has its own limitations and strengths [89,90]. Therefore, selecting the most appropriate MCDA technique is highly important. According to our analysis, all the presented methods in the literature were not used for evaluation and benchmarking of COVID-19 medical image classification over AI techniques. These methods are challenged by non-adoption requirement-driven approach, which makes them unsuitable for measurement and scoring in decision-making [75,88]. However, for cases that involve numerous alternatives and criteria, TOPSIS and VIKOR are applicable. VIKOR and TOPSIS are convenient to use when the given data are quantitative or objective. TOPSIS can create a shortest distance solution towards the ideal solution and also the largest distance away from the negative-ideal solution. Nevertheless, there is no consideration for the relative significance of these distances [91]. On the other hand, VIKOR has functional relationship to discrete-alternative problems. TOPSIS and VIKOR are considered the most practical techniques in solving real-world problems. The advantage of TOPSIS and VIKOR is that they can rapidly decide the best alternative. Furthermore, they are suitable techniques for cases where there are many alternatives and criteria situations [91]. Nevertheless, the major drawback of TOPSIS and VIKOR is the lack of provisioning for elicitation of weight and checking for judgment consistency [91]. Thus, TOPSIS and VIKOR need an effective technique to acquire the relative importance of various criteria with respect to the objective, and AHP is able to provide such a technique. However, AHP is utilised for setting objective weights on the preferences of the stakeholder [92], and it is restricted majorly by the human capacity for information processing. Therefore, 7 ± 2 would be the comparison ceiling [93]. The latest trend in MCDA techniques integrates two or more techniques to compensate for the drawbacks of single techniques. AHP and VIKOR are commonly used MCDA approaches in various studies and especially in the medical domain [57]. To evaluate and benchmark AI classification techniques used in the detection of COVID-19 medical images, the present study recommends to integrate AHP for assigning weights for the evaluation criteria of each classification type subjectively by relying on the judgment of experts, and VIKOR is needed to offer a comprehensive ranking of COVID-19 AI classification techniques.

Methodology

This section describes and explains the evaluation and benchmarking methodology of AI classification techniques used in COVID-19 medical image detection. Fig. 4 illustrates all elements of our study in the overall proposed methodology.

Fig. 4

Proposed methodology for the evaluation and benchmarking of binary, multi-class, multi-labelled and hierarchical classification of COIVID-19 AI classification techniques.

Proposed methodology for the evaluation and benchmarking of binary, multi-class, multi-labelled and hierarchical classification of COIVID-19 AI classification techniques. According to the proposed methodology, three phases have been performed for evaluating and benchmarking the COVID-19 AI classification techniques. The first phase is identification, which illustrates the datasets and required pre-processing and identifies the evaluation criteria used in the evaluation and benchmarking of COVID-19 AI classification techniques and the number and type of techniques. The output of this phase are four DMs, one for each classification type. In the second phase, integration of MCDA methods is presented. The AHP method is used to weigh the evaluation criteria subjectively, and the VIKOR method is used for benchmarking AI classification techniques. In the third phase, objective and subjective validations are illustrated for ranking COVID-19 AI classification techniques. Further details are provided in the following subsections.

Identification phase

In this phase, four main stages are conducted. First, the dataset and required pre-processing procedure (presented in Section “Dataset and pre-processing”) are identified. Second, the evaluation criteria within each type of classification (presented in Section “Evaluation criteria definition”) are identified. Third, the number and type of COVID-19 AI classification techniques are described in Section “AI classification techniques”. Fourth, the construction of the four types of DMs based on identified elements is described in Section “DM construction”.

Dataset and pre-processing

In this step, three main portions should be defined, namely, target dataset, required pre-processing technique for dataset and most suitable features for classification task [[94], [95], [96], [97], [98]]. Different COVID-19 datasets can be found in the literature. Some are based on X-ray images [33], whilst others are based on CT scan images [32]. Each dataset has some limitations. For example, the number of training samples is small, the provided images are of low quality, and the size of the images is not equal. Thus, pre-processing steps (e.g. using data augmentation [33] techniques to generate more medical image samples in order to provide a comprehensive training) are needed to tackle such issues. Furthermore, because COVID-19 can overlap with other pneumonia cases, image segmentation [34] can be used to define the region of interest as a pre-processing step for further analysis of COVID-19 cases. The features extracted from images have a great impact on classification in terms of improving accuracy and minimising error rate, over-fitting and under-fitting issues [100,101]. Thus, all mentioned scenarios will have a great impact on the results of evaluation and benchmarking for COVID-19 classification techniques. Accordingly, three steps should be provided to achieve an efficient evaluation and benchmarking process for COIVID-19 classification over AI techniques. To train and test COVID-19 classification techniques, the dataset will be separated into two parts. The first part will be used towards training the set, whereas the second part will be used for testing the set.

Evaluation criteria definition

As mentioned before, each classification type has its own evaluation criteria. Accordingly, in this section, the criteria within each classification type are identified, which will involve DMs. As mentioned in the critical review and analysis section, classification tasks are divided into four types, namely, binary, multi-class, multi-labelled and hierarchical. On the basis of each classification task, the evaluation criteria of COVID-19 AI classification techniques are listed in Table 2 .

Table 2

Evaluation criteria of binary, multi-class, multi-labelled and hierarchical AI classification techniques.

Binary classification
Evaluation criteria	Formula	Description
Accuracy	tp+tntp+fn+fp+tn	Overall effectiveness of a classifier
Precision	tptp+fp	Class agreement of the data labels with the positive labels given by the classifier
Recall (sensitivity)	tptp+fn	Effectiveness of a classifier to identify positive labels
F score	(β2+1)tp(β2+1)tp+β2fn+fp	Relations between data positive labels and those given by a classifier
Specificity	tnfp+tn	How effectively a classifier identifies negative labels
AUC	12(tptp+fn+tntn+fp)	Classifier’s ability to avoid false classification

tp = true positive, tn = true negative, fp = false positive, fn = false negative, AUC = area under the curve, μ = micro-averaging, M = macro-averaging, I = indicator function, Li = set of class labels, , = subclasses assigned by a classifier,,=.

Evaluation criteria of binary, multi-class, multi-labelled and hierarchical AI classification techniques. tp = true positive, tn = true negative, fp = false positive, fn = false negative, AUC = area under the curve, μ = micro-averaging, M = macro-averaging, I = indicator function, Li = set of class labels, , = subclasses assigned by a classifier,,=. A shown in Table 2, the evaluation criteria for COVID-19 classification techniques are different in terms of the number and calculation procedures within each type of classification. For example, binary classification has eight evaluation criteria, multi-class and hierarchical classifications have six criteria each, whereas multi-labelled classification has four criteria. Furthermore, as shown in Table 2, the precision criteria in the binary type are different from the criteria of precisionμ in multi-class type because the formulas for the two types are different, and other criteria belong to a single classification type. The usage of criteria depends on the target of classification (binary, multi-class, multi-labelled and hierarchical). Thus, different numbers and types of criteria will be involved in a particular DM of each classification task.

AI classification techniques

In this step, the number and type of COVID-19 AI classification techniques are identified, which will be included in each DM type. In general, different types of COVID-19 classification techniques can be found in the literature. Some studies are based on traditional machine learning classification techniques (e.g. Ref. [32]). On the other hand, the majority of classification tasks are based on deep learning techniques (e.g. Refs. [7,33,35]). However, the classification techniques that belong to a similar type (e.g. traditional machine learning and deep learning techniques) need to be included for the evaluation and benchmarking process. Furthermore, the number of candidate classification techniques should be defined in the evaluation and benchmarking scenario. As mentioned in Section “Dataset and pre-processing”, the dataset is divided into training and testing sets. However, each individual instance is supposed to belong to a predefined class [18,97,102,103]. In the testing portion, if the classification technique performance looks ‘acceptable’, then the classification technique can be used to classify future data for which the class label is unknown. Ultimately, the classification technique that provides an acceptable result can be considered an ‘acceptable technique’. Furthermore, for more reliable classification techniques, the difference ratio between the performance of the technique in the training and validation stages in terms of accuracy and loss function is very important to avoid over-fitting and under-fitting issues.

DM construction

DM considers the main component in the proposed methodology of evaluation and benchmarking of AI classification techniques used in COVID-19 medical images. DM is composed of decision alternatives and identified criteria. In our case, the classification techniques for COVID-19 are the decision alternatives, and the criteria are identified evaluation criteria based on each classification task. As mentioned earlier, the AI domain has four types of classification tasks (binary, multi-class, multi-labelled and hierarchical). Each type has its own evaluation criteria; thus, each type should have a unique DM based on the distinction of the evaluation criteria. In this study, the DMs of COVID-19 medical image classifications will be constructed based on four different types, namely, binary DM, multi-class DM, multi-labelled DM and hierarchical DM. The DM data of specific classification type are generated from the crossover between the number of COVID-19 AI classification techniques and the number of specific classification type evaluation criteria as follows. Binary DM: This DM is constructed on the basis of the intersection between decision alternatives (i.e. set of COVID-19 AI classification techniques) and six evaluation criteria (i.e. accuracy, precision, recall [sensitivity], F score, specificity, area under the curve) as presented in Table 3 .

Table 3

DM of COVID-19 AI binary classification techniques.

Evaluation criteria	Accuracy	Precision	Recall (sensitivity)	F score	Specificity	Area under the curve
AI COVID-19 classification techniques
Technique 1	Av(T1/TS)	Pv (T1/TS)	Rv (T1/TS)	FSv (T1/TS)	Sv (T1/TS)	AUCv (T1/TS)
Technique 2	Av (T2/TS)	Pv (T2/TS)	Rv (T2/TS)	FSv (T2/TS)	S (T2/TS)	AUCv (T2/TS)
..	...	...	...	...	...	...
Technique n	Av(Tn/TS)	Pv (Tn/TS)	Rv (Tn/TS)	FSv (Tn/TS)	Sv (Tn/TS)	AUCv (Tn/TS)

T = classification technique; Av = accuracy value; Pv = precision value; Rv = recall (sensitivity) value; FSv = F score value; Sv = specificity value; AUCv = area under the curve value; TS = test samples; n = number of AI classification techniques.

DM of COVID-19 AI binary classification techniques. T = classification technique; Av = accuracy value; Pv = precision value; Rv = recall (sensitivity) value; FSv = F score value; Sv = specificity value; AUCv = area under the curve value; TS = test samples; n = number of AI classification techniques. Multi-class DM: This DM is constructed on the basis of the intersection between decision alternatives (i.e. set of COVID-19 AI classification techniques) and eight evaluation criteria (i.e. average accuracy, error rate, precisionμ, recallμ, F scoreμ, precisionM, recallM, F scoreM) as presented in Table 4 .

Table 4

DM of COVID-19 AI multi-class classification techniques.

Evaluation criteria	Average accuracy	Error rate	Precisionμ	Recallμ	F scoreμ	PrecisionM	RecallM	F scoreM
COVID-19 AI classification techniques
Technique 1	AAv (M1/TS)	ERv (M1/TS)	Pμv (M1/TS)	Rμv (M1/TS)	FSμv (M1/TS)	P_MV (M1/TS)	R_MV (M1/TS)	FS_MV (M1/TS)
Technique 2	AAv (M2/TS)	ERv (M2/TS)	Pμv (M2/TS)	Rμv (M2/TS)	FSμv (M2/TS)	P_MV (M2/TS)	R_MV (M2/TS)	FS_MV (M2/TS)
..	...	...	...	...	...	...	...	...
Technique n	AAv (Mn/TS)	ERv (Mn/TS)	Pμv (Mn/TS)	Rμv (Mn/TS)	FSμv (Mn/TS)	P_MV (Mn/TS)	R_MV (Mn/TS)	FS_MV (Mn/TS)

T = classification technique; AAv = average accuracy value; ERv = error rate value; Pμv = precisionμ value; Rμv = recallμ value; FSμv = F scoreμ value; PMV = precisionM value; RMV = recallM value; FSMV = F scoreM value; TS = test samples; n = number of AI classification techniques.

DM of COVID-19 AI multi-class classification techniques. T = classification technique; AAv = average accuracy value; ERv = error rate value; Pμv = precisionμ value; Rμv = recallμ value; FSμv = F scoreμ value; PMV = precisionM value; RMV = recallM value; FSMV = F scoreM value; TS = test samples; n = number of AI classification techniques. Multi-labelled DM: This DM is constructed on the basis of the intersection between decision alternatives (i.e. set of COVID-19 AI classification techniques) and four evaluation criteria (i.e. exact match ratio, labelling F score, retrieval F score and Hamming loss) as presented in Table 5 .

Table 5

DM of COVID-19 AI multi-labelled classification techniques.

Evaluation criteria	Exact match ratio	Labelling F score	Retrieval F score	Hamming loss
COVID-19 AI classification techniques
Technique 1	EMv (M1/TS)	LFv (M1/TS)	RFv (M1/TS)	HLv (M1/TS)
Technique 2	EMv (M2/TS)	LFv (M2/TS)	RFv (M2/TS)	HLv (M2/TS)
..	...	...	...	...
Technique n	EMv (Mn/TS)	LFv (Mn/TS)	RFv (Mn/TS)	HLv (Mn/TS)

T = classification technique; EMv = exact match ratio value; LFv = labelling F score value; RFv = retrieval F score value; HLv = Hamming loss value; TS = test samples; n = number of AI classification techniques.

DM of COVID-19 AI multi-labelled classification techniques. T = classification technique; EMv = exact match ratio value; LFv = labelling F score value; RFv = retrieval F score value; HLv = Hamming loss value; TS = test samples; n = number of AI classification techniques. Hierarchical DM: This DM is constructed on the basis of the intersection between decision alternatives (i.e. set of COVID-19 AI classification techniques) and six evaluation criteria (i.e. precision↓, recall↓, F score↓, precision↑, recall↑ and F score ↑) as presented in Table 6 .

Table 6

DM of COVID-19 AI hierarchical classification techniques.

Evaluation criteria	Precision↓	Recall↓	F score↓	Precision↑	Recall↑	F score ↑
COVID-19 AI classification techniques
M1	P↓v (M1/TS)	R↓v (M1/TS)	FS↓v (M1/TS)	P↑v (M1/TS)	P↑v (M1/TS)	FS↑v (M1/TS)
M2	P↓v (M2/TS)	R↓v (M2/TS)	FS↓v (M2/TS)	P↑v (M2/TS)	P↑v (M2/TS)	FS↑v (M2/TS)
..	...	...	...	...	...	...
Mn	P↓v (Mn/TS)	R↓v (Mn/TS)	FS↓v (Mn/TS)	P↑v (Mn/TS)	P↑v (Mn/TS)	FS↑v (Mn/TS)

T = classification technique; P↓v = precision↓ value; R↓v = recall↓ value; FS↓v = F score↓value; P↑v = precision↑ value; R↑v = recall↑ value; FS↑v = F score ↑value; TS = test samples; n = number of AI classification techniques.

DM of COVID-19 AI hierarchical classification techniques. T = classification technique; P↓v = precision↓ value; R↓v = recall↓ value; FS↓v = F score↓value; P↑v = precision↑ value; R↑v = recall↑ value; FS↑v = F score ↑value; TS = test samples; n = number of AI classification techniques. However, the data within the four DMs represent the values of the evaluation of each COVID-19 AI classification technique based on the identified evaluation criteria of each classification task. Practically, on the basis of these constructed DMs, three evaluation and benchmarking challenges will be generated and encountered in the future (i.e. multi-criteria, trade-off amongst the criteria and important criteria(, as highlighted in Section “Future challenges of the evaluation and benchmarking of AI classification techniques used in the detection of COVID-19 medical images”. The evaluation and benchmarking of AI classification techniques used in COVID-19 medical images is considered a complex MCDA problem. To this end, the development of decision-making approach is important to preclude the evaluation and benchmarking problem complexity.

Development phase

To develop a methodology of evaluation and benchmarking of AI classification techniques used in COVID-19 medical image detection, integration of MCDA methods is presented. Such development is based on AHP method for subjective weighting of identified evaluation criteria within each constructed DM as presented in Section “AHP weighting method” and VIKOR method for benchmarking and selecting best alternatives (i.e. COVID-19 AI classification techniques) in the constructed DMs as presented in Section “VIKOR benchmarking method”.

AHP weighting method

This stage presents the process of assigning suitable weights to the multi-evaluation criteria within each DM subjectively based on the AHP method. The AHP approach involves several steps, which are applicable for any AI classification type of COVID-19 medical image detection. The procedure of AHP includes the following steps [55]. Step 1: The problem is modelled as a hierarchy to start the AHP approach. The hierarchy contains the decision goal and the criteria that must be designed. Pairwise comparison amongst the criteria in the DM of each classification type is conducted to obtain the weights subjectively. Examples of pairwise comparison for three criteria are illustrated in Fig. 5 .

Fig. 5

Pairwise comparison example.

Pairwise comparison example. Step 2: The AHP builds pairwise matrix comparison in Eq. (1) to determine a weighting decision:where . Step 3: This stage involves the design of a pairwise comparison questionnaire within each type of classification and distributes it to the experts. However, in this step, the number of experts included in the questionnaire should be defined. The target experts are those who have relevant experience with a case study, besides enough period of experience in the same domain. Their preferences and judgments on the evaluation criteria of each classification type used in AHP were evaluated. Step 4: In this step, each element in matrix A (1) is normalised to construct the normalised matrix, () as follows: where A() is given by Eq. (2). Step 5: This step includes AHP pairwise comparison to utilise mathematical calculations, convert judgments and assign weights for each criterion of each AI classification type. The weights of the decision criterion can be calculated using Eq. (4):where n is the number of compared evaluation criteria of each COVID-19 AI classification type. Step 6: In this step, Eq. (5) is utilised to check the consistency ratio (CR) to the pairwise comparison matrix as follows: The consistency index (CI) is calculated using Eq. (6) as follows:where is the maximum eigenvalue of the judgement matrix. Random CI (RI) is calculated using Eq. (7) as follows: A pairwise comparison matrix with a corresponding CR of no more than 10% or 0.1 is acceptable; otherwise it will be ignored.

VIKOR benchmarking method

To start with the benchmarking of COVID-19 AI classification techniques, the VIKOR method is utilised considering its suitability for such purpose. In addition, it can provide rapid results and determine which option is the most appropriate one. The COVID-19 AI classification techniques can be benchmarked and ranked according to the VIKOR method using the obtained criteria weights from the AHP method. The VIKOR approach involves several steps [104,105]. Step 1: Identify the best and worst values of all criteria within each DM, i = 1; 2; …; n. If the ith function represents: A benefit criterion (the larger the better): A cost criterion (the smaller the better): Step 2: AHP is considered for the computation of each criterion weight. A set of weights from the decision-maker is accommodated in the DM; this set is equal to 1. The resulting matrix can also be computed as demonstrated in the following equation: A weighted matrix is generated as follows: Step 3: Compute the Sj and Rj values, j = 1,2,3,….,J, i = 1,2,3,…,n by using the following equations: where wi is the weight of criteria expressing their relative importance. Step 4: Compute the values of Qj, using the following relation:where v is introduced as the weight of the strategy of ‘the majority of criteria’ (or ‘the maximum group utility’); here, v = 0.5. Step 5: Now the alternative set (i.e. COVID-19 AI classification techniques) can be benchmarked. This process is accomplished by sorting the R and Q values in ascending order. The lowest value indicates the optimal performance. Step 6: Propose the alternative () as a compromise solution. It ranks the best by the measure Q (minimum) if two conditions are satisfied. The conditions are as follows: R1. ‘Acceptable advantage’where () is the alternative in the second position in the ranking list by Q, DQ = 1/(J−1) and J is the number of alternatives. R2. ‘Stability’ is acceptable with the decision-making context. Alternative should also be the best as ranked by S and/or R. This compromise solution is stable within the decision-making process, which could be a ‘voting by majority rule’ (v > 0:5), ‘by consensus’ (v 0.5) or ‘with veto’ (v < 0.5). Here, v is the decision-making strategy weight of ‘the majority of criteria’ (or ‘the maximum group utility’).

Validation phase

This phase presents the process of objective (Section “Objective validation”) and subjective (Section “Subjective validation”) validations for the results of benchmarking COVID-19 AI classification techniques. Further details are explained in the following subsections.

Objective validation

The results of the proposed methodology will be validated by utilising an objective approach as similar to [106]. To validate the results of the ranking with the use of the previous test, the COVID-19 AI classification techniques will be divided into (n) groups on the basis of the ranking results, which were acquired from the proposed methodology. Every group consists of a number of selected COVID-19 AI classification techniques. The number of techniques within each group varies depending on various scenarios. The validation result will not be influenced by the number of groups or AI classification techniques within each group. To make sure that the benchmarking results of COVID-19 AI classification techniques are valid, this study utilises two statistical approaches: mean and standard deviation. The mean ± standard deviation can be calculated for each group of data and is used to ensure that the set of COVID-19 AI classification techniques is subjected to systematic ordering. The mean is the average result. It is calculated by performing a deviation of the sum of the observed results over the result numbers with the use of the following equation: Standard deviation is used to determine the dispersion or variation amount in the set of values and is calculated as follows: For example, let us consider that we have four groups with (n) number of COVID-19 AI classification techniques for each group. In this scenario, the first group must reach the best value, and that has to be proven when the standard deviation and the mean are measured. We assumed that the first group acquired the best in both standard deviation and the mean compared with the other three groups. However, for the second group, its results for the mean and standard deviation have to be poorer than those in the first group and better than those in the third and fourth groups or have to be equal to those in the third group. Accordingly, for the systematic ranking results, the first group must prove that it is the best compared with the other groups.

Subjective validation

This section describes the subjective validation process. The COVID-19 AI classification techniques will be evaluated by specialist experts in AI classification of medical cases. The experts can prove the effectiveness of the benchmarking results of COVID-19 AI classification techniques obtained by our proposed decision-making approach by examining the values of all evaluation criteria used.

Conclusion

The COVID-19 pandemic has a tremendous impact on the life of people around the world, and the number of infected patients has considerably increased. COVID-19 quickly gained a foothold, and nations, governments and scholars are attempting to address this worldwide crisis. Different medical tests are used in the detection of COVID-19. Several studies have used X-rays and CT scans to support and reveal anomalies indicative of COVID-19. CT scan and X-ray tests are utilised as initial detection tools to evaluate the severity of COVID-19, monitor the emergency conditions of patients and predict disease progression. The growing developments of AI techniques have led to the challenges of choosing evaluation and benchmarking AI techniques and which technique is suitable for the diagnosis and classification of COVID-19 medical images. Thus, this study presented a systematic review of AI techniques in the detection and classification of COVID-19 medical images in terms of evaluation and benchmarking. The results showed that only 11 studies utilised AI techniques in detecting and classifying COVID-19 with different case studies. However, this study proved that the process of evaluating and benchmarking of AI classification techniques (i.e. binary, multi-class, multi-labelled and hierarchical classifications), which could be used in the detection and diagnosis of COVID-19 medical image, is a critical gap of related literature. The challenges of such gap are discussed, and the process of evaluation and benchmarking of COVID-19 AI classification techniques is considered a multi-complex attribute problem. Thus, using MCDA is essential. As a potential future research direction, this study provided a detailed methodology for the evaluation and benchmarking of AI classification techniques used in the detection of COVID-19 medical images. Such methodology is presented on the basis of three sequential phases (i.e. identification, development and validation).

Funding

The authors would like to thank the Universiti Pendidikan Sultan Idris, Malaysia for funding this research under UPSI Rising Star Grant 2019, grant No: 2019-0125-109-01.

Competing interests

None declared.

Ethical approval

Not required.

50 in total

1. Fingerprint-based robust medical image watermarking in hybrid transform.

Authors: S Prasanth Vaidya
Journal: Vis Comput Date: 2022-01-29 Impact factor: 2.601

2. A Systematic Review on the Use of AI and ML for Fighting the COVID-19 Pandemic.

Authors: Muhammad Nazrul Islam; Toki Tahmid Inan; Suzzana Rafi; Syeda Sabrina Akter; Iqbal H Sarker; A K M Najmul Islam
Journal: IEEE Trans Artif Intell Date: 2021-03-01

3. Realizing an Effective COVID-19 Diagnosis System Based on Machine Learning and IoT in Smart Hospital Environment.

Authors: Karrar Hameed Abdulkareem; Mazin Abed Mohammed; Ahmad Salim; Muhammad Arif; Oana Geman; Deepak Gupta; Ashish Khanna
Journal: IEEE Internet Things J Date: 2021-01-11 Impact factor: 10.238

Review 4. The Promise of AI in Detection, Diagnosis, and Epidemiology for Combating COVID-19: Beyond the Hype.

Authors: Musa Abdulkareem; Steffen E Petersen
Journal: Front Artif Intell Date: 2021-05-14

5. Checklist for responsible deep learning modeling of medical images based on COVID-19 detection studies.

Authors: Weronika Hryniewska; Przemysaw Bombiski; Patryk Szatkowski; Paulina Tomaszewska; Artur Przelaskowski; Przemysaw Biecek
Journal: Pattern Recognit Date: 2021-05-21 Impact factor: 7.740

Review 6. COVID-19-another influential event impacts on laboratory medicine management.

Authors: YunTao Luo; JingHua Wang; MinMin Zhang; QingZhong Wang; Rong Chen; XueLiang Wang; HuaLiang Wang
Journal: J Clin Lab Anal Date: 2021-05-25 Impact factor: 2.352

7. Priority setting in the Brazilian emergency medical service: a multi-criteria decision analysis (MCDA).

Authors: Talita D C Frazão; Ana F A Dos Santos; Deyse G G Camilo; João Florêncio da Costa Júnior; Ricardo P de Souza
Journal: BMC Med Inform Decis Mak Date: 2021-05-06 Impact factor: 2.796

8. An intelligence design for detection and classification of COVID19 using fusion of classical and convolutional neural network and improved microscopic features selection approach.

Authors: Javaria Amin; Muhammad Almas Anjum; Muhammad Sharif; Tanzila Saba; Usman Tariq
Journal: Microsc Res Tech Date: 2021-05-08 Impact factor: 2.893

9. Evaluation of government strategies against COVID-19 pandemic using q-rung orthopair fuzzy TOPSIS method.

Authors: Nurşah Alkan; Cengiz Kahraman
Journal: Appl Soft Comput Date: 2021-06-30 Impact factor: 6.725

10. Linguistic methods in healthcare application and COVID-19 variants classification.

Authors: Marek R Ogiela; Urszula Ogiela
Journal: Neural Comput Appl Date: 2021-07-06 Impact factor: 5.606