Literature DB >> 32457147

Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis.

Agostina J Larrazabal1, Nicolás Nieto1,2, Victoria Peterson2,3, Diego H Milone1, Enzo Ferrante4.   

Abstract

Artificial intelligence (AI) systems for computer-aided diagnosis and image-based screening are being adopted worldwide by medical institutions. In such a context, generating fair and unbiased classifiers becomes of paramount importance. The research community of medical image computing is making great efforts in developing more accurate algorithms to assist medical doctors in the difficult task of disease diagnosis. However, little attention is paid to the way databases are collected and how this may influence the performance of AI systems. Our study sheds light on the importance of gender balance in medical imaging datasets used to train AI systems for computer-assisted diagnosis. We provide empirical evidence supported by a large-scale study, based on three deep neural network architectures and two well-known publicly available X-ray image datasets used to diagnose various thoracic diseases under different gender imbalance conditions. We found a consistent decrease in performance for underrepresented genders when a minimum balance is not fulfilled. This raises the alarm for national agencies in charge of regulating and approving computer-assisted diagnosis systems, which should include explicit gender balance and diversity recommendations. We also establish an open problem for the academic medical image computing community which needs to be addressed by novel algorithms endowed with robustness to gender imbalance.
Copyright © 2020 the Author(s). Published by PNAS.

Entities:  

Keywords:  computer-aided diagnosis; deep learning; gender bias; gendered innovations; medical image analysis

Year:  2020        PMID: 32457147      PMCID: PMC7293650          DOI: 10.1073/pnas.1919012117

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   11.205


Artificial intelligence (AI) influences almost every aspect of our daily life. The media articles we read, the movies we watch, even the driving road map we take are somehow influenced by these systems. In particular, the rise of AI in healthcare during the last few years is changing the way medical doctors diagnose, especially when dealing with medical images. AI systems cannot only augment the information provided by such images with useful annotations (1, 2), but they are also starting to take autonomous decisions by performing computer-aided diagnosis (CAD) (3, 4). Although the interest in performing fair and unbiased evaluations of AI medical systems has existed since the 1980s (5), the ethical aspects of AI have gained relevance in the last few years. It has been shown that human bias, such as gender and racial bias, may not only be inherited but also amplified by AI systems in multiple contexts (6–9). For example, face recognition systems have been shown to exhibit accuracy disparities depending on gender and ethnicity, with darker-skinned females being the most misclassified group (10). This tendency of AI systems to learn biased models, which reproduce social stereotypes and underperform in minority groups, is especially dangerous in the context of healthcare (11, 12). In recent years, the research community of gendered innovations has largely contributed to create awareness and integrate sex and gender analyses into all phases of basic and applied research (13). However, such assessment in the context of medical imaging and CAD remains largely unexplored. In this work, we perform a large-scale study that quantifies the influence of gender imbalance in medical imaging datasets used to train AI-based CAD systems. It is worth mentioning that most of the existing work dealing with imbalanced data in the context of deep learning focuses on cases where it is related to the target classes (14, 15). In our study, this would translate to an imbalance in terms of number of patients per pathology. However, note that, in this case, the imbalance is given by a demographic variable different from the target class: gender, which is generally neglected. Our results show that using gender-imbalanced datasets to train deep learning-based CAD systems may affect the performance in pathology classification for minority groups.

Results and Discussion

A model based on deep neural networks, which achieves state-of-the-art results when diagnosing 14 common thoracic diseases using X-ray images (16), was implemented to perform CAD. We employed the area under the receiver operating characteristic curve (AUC) (17) to quantify its performance. Fig. 1 shows the experimental results obtained when training the classifier under different gender imbalance ratios. In Fig. 1, the box plots aggregate the results for 20 experiments using fully imbalanced datasets. The blue boxes represent the performance for models trained only with male images, while orange boxes indicate training with female-only images. Both models are evaluated over male-only (Fig. 1 , Top) and female-only (Fig. 1 , Bottom) test images. A consistent decrease in performance is observed when using male patients for training and female for testing (and vice-versa). The same tendency was confirmed when evaluating three different deep learning architectures in two X-ray datasets with different pathologies.
Fig. 1.

Experimental results for a DenseNet-121 (18) classifier trained with images from the NIH dataset (16, 19) for 14 thoracic diseases under different gender imbalance ratios. (A) The box plots aggregate the results for 20 folds, training with male-only (blue) and female-only (orange) patients. Both models are evaluated given male (Top) and female (Bottom) test folds. A consistent decrease in performance is observed when using male patients for training and female for testing (and vice versa). (B and C) AUC achieved for two exemplar diseases under a gradient of gender imbalance ratios, from 0% of female images in training data to 100%, with increments of 25%. In B, 1 and 2 show the results when testing on male patients, while, in C, 1 and 2 present the results when testing on female patients. Statistical significance according to Mann–Whitney U test is denoted by **** ( 0.00001), *** (0.00001 0.0001), ** (0.0001 0.001), * (0.001 , 0.01), and not significant (ns) ().

Experimental results for a DenseNet-121 (18) classifier trained with images from the NIH dataset (16, 19) for 14 thoracic diseases under different gender imbalance ratios. (A) The box plots aggregate the results for 20 folds, training with male-only (blue) and female-only (orange) patients. Both models are evaluated given male (Top) and female (Bottom) test folds. A consistent decrease in performance is observed when using male patients for training and female for testing (and vice versa). (B and C) AUC achieved for two exemplar diseases under a gradient of gender imbalance ratios, from 0% of female images in training data to 100%, with increments of 25%. In B, 1 and 2 show the results when testing on male patients, while, in C, 1 and 2 present the results when testing on female patients. Statistical significance according to Mann–Whitney U test is denoted by **** ( 0.00001), *** (0.00001 0.0001), ** (0.0001 0.001), * (0.001 , 0.01), and not significant (ns) (). We also explored intermediate imbalance scenarios, where both female and male patients were present in the training dataset but considering different proportions (0%/100, 25%/75, and ). Fig. 1 shows the average classification performance for two exemplar diseases, Pneumothorax and Atelectasis, under such gradient of gender imbalance ratios (indicated with the percentage of female patients used for training). We found that, even with a 25/75 imbalance ratio, the average performance across all diseases in the minority class is significantly lower than a model trained with a perfectly balanced dataset. Moreover, we did not find significant differences in performance between models trained with a gender-balanced dataset (50 male and 50 female) and an extremely imbalanced dataset from the same gender. In other words, a CAD system trained with a diverse (and balanced) dataset achieved the best performance for both genders. Altogether, our results indicate that diversity provides additional information and increases the generalization capability of AI systems. Thereafter, it also suggests that diversity should be prioritized when designing databases used to train machine learning-based CAD systems. Our study shows that gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis based on convolutional neural networks (CNNs), with significantly lower performance in underrepresented groups. We provide experimental evidence in the context of X-ray image classification for such potential bias, aiming to raise the alarm not only within the medical image computing community but also for national agencies in charge of regulating and approving medical systems. As an example, let us take the US Food and Drug Administration. Even though they have released several documents related to the importance of gender/sex issues in the design and evaluation of clinical trials and medical devices (21), when looking at the specific guidelines to obtain the certification to market medical computer-aided systems (22, 23), there is no explicit mention of gender/sex as one of the relevant demographic variables that should describe the sampled population. Similar issues are observed in the medical imaging community. Albeit a few datasets provide this information at the subject level, most public datasets of similar characteristics do not contain gender/sex information at the patient level to date [e.g., the recent MIMIC-CXR (24) x-ray dataset or the Retinal Fundus Glaucoma Challenge (REFUGE) database of ophtalmological images (25), just to name a few]. The same tendency is observed in many of the datasets included in a recent analysis of 150 databases from grand challenges on biomedical image analysis (26), which provides recommendations for database and challenge design, where there is no explicit mention of the importance of sex/gender demographic information. In general, it is well known that CNNs tend to learn representations useful to solve the task they are being trained for. When we go from male to female images (or vice versa), structural changes in the images appear, leading to a change in data distribution which explains the decrease in performance. Algorithmic solutions to such “domain adaptation” problems (27) should be engineered, especially in cases when it is difficult to obtain gender-balanced datasets [e.g., Autism Brain Imaging Data Exchange (ABIDE) I (28)].

Materials and Methods

Datasets.

We use the NIH Chest-XRay14 dataset (16, 19), which includes 112,120 chest X-ray images from 30,805 patients, labeled with 14 common thorax diseases (including hernia, pneumonia, fibrosis, emphysema, edema, cardiomegaly, pleural thickening, consolidation, mass, pneumothorax, nodule, atelectasis, effusion, and infiltration). Labeling was performed according to an automatic natural language processing analysis of the radiology reports. The dataset provides demographic information including the patient’s gender: 63,340 (56.5) images for male and 48,780 (43.5) images for female patients. Following the demographic variables reported in the original dataset publication (19), we used the term “gender” to characterize our imbalance study. However, given that some anatomical attributes are reflected in X-ray images, the term sex could be more accurate, according to the Sex and Gender Equity in Research guidelines (29). The CheXpert database (30) was also used to confirm that our observations generalize for different datasets. It contains 224,316 chest radiographs of 65,240 patients with diagnostic information ( male and female). The uncertainty labels included in CheXpert were interpreted as negative following the U-Zeros approach discussed in the original paper (30).

Deep Learning Model.

Deep neural networks are machine learning methods with multiple abstraction levels, which compose simple but nonlinear modules transforming representations at one level into a representation at a higher, slightly more abstract level (31). A special type of deep neural network, known as CNNs, was used to implement the CAD system (19, 20). Results shown in Fig. 1 correspond to a Densely Connected CNN (DenseNet) architecture with 14 outputs, one for each disease (18). We adopted a Keras implementation of the DenseNet-121 which has been shown to achieve state-of-the-art results in X-ray image classification (16). The network has 121 convolutional layers and a final fully connected layer producing a 14-dimensional output, after which we apply an element-wise sigmoid nonlinearity. A model pretrained on ImageNet (32) was used to initialize the network weights. We trained it end to end using Adam optimizer with standard parameters (1 = 0.9 and 2 = 0.999), a batch size of 32, and an initial learning rate of 0.001 that was decayed by a factor of 10 each time the validation loss plateaued after an epoch. Additionally, we evaluated two other CNN architectures, the ResNet (33) and the Inception-v3 (34), confirming that our observations generalize for different neural models.

Methodology.

Since images can be labeled with multiple diseases, we implemented an automatic method to construct random splits, which guarantees that male and female folds will have the same number of images per pathology. Given a frontal X-ray image, the CAD system predicts the presence or absence of the 14 thoracic diseases. Two models were trained in each experiment, one considering male-only datasets, while the other considered female-only training datasets. Intermediate imbalance scenarios were also analyzed, in which female and male images were presented in the training dataset at different proportions (0%/100, 25%/75 and ). To avoid other sources of bias, care was taken to guarantee, by training data construction, that male and female folds include the same number of pathological cases per class. For the NIH Chest-XRay14, every split included 48,568 images. For the CheXpert dataset, every split included 27,147 images. The same experiment was performed 20 times, using different random splits. In the testing phase, both models were evaluated in male and female patients separately. The classification performance was measured by the well-known AUC (17).

Data Availability.

The NIH Chest-XRay14 dataset is publicly available at https://nihcc.app.box.com/v/ChestXray-NIHCC. The CheXpert dataset is publicly available at https://stanfordmlgroup.github.io/competitions/chexpert/. The source code of the original CNNs is publicly available at https://github.com/brucechou1983/CheXNet-Keras. The modified version of this code with our auxiliary scripts, the data splits used in our experiments, and the additional results for all of the CNN architectures in both datasets can be accessed at https://github.com/N-Nieto/GenderBias_CheXNet.
  13 in total

Review 1.  Deep learning.

Authors:  Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal:  Nature       Date:  2015-05-28       Impact factor: 49.962

2.  AI can be sexist and racist - it's time to make it fair.

Authors:  James Zou; Londa Schiebinger
Journal:  Nature       Date:  2018-07       Impact factor: 49.962

3.  REFUGE Challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs.

Authors:  José Ignacio Orlando; Huazhu Fu; João Barbosa Breda; Karel van Keer; Deepti R Bathula; Andrés Diaz-Pinto; Ruogu Fang; Pheng-Ann Heng; Jeyoung Kim; JoonHo Lee; Joonseok Lee; Xiaoxiao Li; Peng Liu; Shuai Lu; Balamurali Murugesan; Valery Naranjo; Sai Samarth R Phaye; Sharath M Shankaranarayana; Apoorva Sikka; Jaemin Son; Anton van den Hengel; Shujun Wang; Junyan Wu; Zifeng Wu; Guanghui Xu; Yongli Xu; Pengshuai Yin; Fei Li; Xiulan Zhang; Yanwu Xu; Hrvoje Bogunović
Journal:  Med Image Anal       Date:  2019-10-08       Impact factor: 8.545

4.  Implementing Machine Learning in Health Care - Addressing Ethical Challenges.

Authors:  Danton S Char; Nigam H Shah; David Magnus
Journal:  N Engl J Med       Date:  2018-03-15       Impact factor: 91.245

Review 5.  Do no harm: a roadmap for responsible machine learning for health care.

Authors:  Jenna Wiens; Suchi Saria; Anna Goldenberg; Mark Sendak; Marzyeh Ghassemi; Vincent X Liu; Finale Doshi-Velez; Kenneth Jung; Katherine Heller; David Kale; Mohammed Saeed; Pilar N Ossorio; Sonoo Thadaney-Israni
Journal:  Nat Med       Date:  2019-08-19       Impact factor: 53.440

6.  Enhancing studies of the connectome in autism using the autism brain imaging data exchange II.

Authors:  Adriana Di Martino; David O'Connor; Bosi Chen; Kaat Alaerts; Jeffrey S Anderson; Michal Assaf; Joshua H Balsters; Leslie Baxter; Anita Beggiato; Sylvie Bernaerts; Laura M E Blanken; Susan Y Bookheimer; B Blair Braden; Lisa Byrge; F Xavier Castellanos; Mirella Dapretto; Richard Delorme; Damien A Fair; Inna Fishman; Jacqueline Fitzgerald; Louise Gallagher; R Joanne Jao Keehn; Daniel P Kennedy; Janet E Lainhart; Beatriz Luna; Stewart H Mostofsky; Ralph-Axel Müller; Mary Beth Nebel; Joel T Nigg; Kirsten O'Hearn; Marjorie Solomon; Roberto Toro; Chandan J Vaidya; Nicole Wenderoth; Tonya White; R Cameron Craddock; Catherine Lord; Bennett Leventhal; Michael P Milham
Journal:  Sci Data       Date:  2017-03-14       Impact factor: 6.444

Review 7.  Computer-aided detection in chest radiography based on artificial intelligence: a survey.

Authors:  Chunli Qin; Demin Yao; Yonghong Shi; Zhijian Song
Journal:  Biomed Eng Online       Date:  2018-08-22       Impact factor: 2.819

8.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.

Authors:  Alistair E W Johnson; Tom J Pollard; Seth J Berkowitz; Nathaniel R Greenbaum; Matthew P Lungren; Chih-Ying Deng; Roger G Mark; Steven Horng
Journal:  Sci Data       Date:  2019-12-12       Impact factor: 6.444

9.  Clinically applicable deep learning for diagnosis and referral in retinal disease.

Authors:  Jeffrey De Fauw; Joseph R Ledsam; Bernardino Romera-Paredes; Stanislav Nikolov; Nenad Tomasev; Sam Blackwell; Harry Askham; Xavier Glorot; Brendan O'Donoghue; Daniel Visentin; George van den Driessche; Balaji Lakshminarayanan; Clemens Meyer; Faith Mackinder; Simon Bouton; Kareem Ayoub; Reena Chopra; Dominic King; Alan Karthikesalingam; Cían O Hughes; Rosalind Raine; Julian Hughes; Dawn A Sim; Catherine Egan; Adnan Tufail; Hugh Montgomery; Demis Hassabis; Geraint Rees; Trevor Back; Peng T Khaw; Mustafa Suleyman; Julien Cornebise; Pearse A Keane; Olaf Ronneberger
Journal:  Nat Med       Date:  2018-08-13       Impact factor: 53.440

Review 10.  Sex and Gender Equity in Research: rationale for the SAGER guidelines and recommended use.

Authors:  Shirin Heidari; Thomas F Babor; Paola De Castro; Sera Tort; Mirjam Curno
Journal:  Res Integr Peer Rev       Date:  2016-05-03
View more
  48 in total

1.  Artificial Intelligence in COPD: New Venues to Study a Complex Disease.

Authors:  Raúl San José Estépar
Journal:  Barc Respir Netw Rev       Date:  2020 May-Dec

2.  Deep learning prediction of sex on chest radiographs: a potential contributor to biased algorithms.

Authors:  David Li; Cheng Ting Lin; Jeremias Sulam; Paul H Yi
Journal:  Emerg Radiol       Date:  2022-01-10

3.  Sensitivity of neural networks to corruption of image classification.

Authors:  Shimon Kaplan; Doron Handelman; Amir Handelman
Journal:  AI Ethics       Date:  2021-03-23

Review 4.  Navigating the pitfalls of applying machine learning in genomics.

Authors:  Sean Whalen; Jacob Schreiber; William S Noble; Katherine S Pollard
Journal:  Nat Rev Genet       Date:  2021-11-26       Impact factor: 53.242

5.  Ethical Machine Learning in Healthcare.

Authors:  Irene Y Chen; Emma Pierson; Sherri Rose; Shalmali Joshi; Kadija Ferryman; Marzyeh Ghassemi
Journal:  Annu Rev Biomed Data Sci       Date:  2021-05-06

Review 6.  Artificial intelligence and machine learning for medical imaging: A technology review.

Authors:  Ana Barragán-Montero; Umair Javaid; Gilmer Valdés; Dan Nguyen; Paul Desbordes; Benoit Macq; Siri Willems; Liesbeth Vandewinckele; Mats Holmström; Fredrik Löfman; Steven Michiels; Kevin Souris; Edmond Sterpin; John A Lee
Journal:  Phys Med       Date:  2021-05-09       Impact factor: 2.685

Review 7.  Preventing dataset shift from breaking machine-learning biomarkers.

Authors:  Jérôme Dockès; Gaël Varoquaux; Jean-Baptiste Poline
Journal:  Gigascience       Date:  2021-09-28       Impact factor: 6.524

8.  Towards gender equity in artificial intelligence and machine learning applications in dermatology.

Authors:  Michelle S Lee; Lisa N Guo; Vinod E Nambudiri
Journal:  J Am Med Inform Assoc       Date:  2022-01-12       Impact factor: 7.942

9.  An empirical characterization of fair machine learning for clinical risk prediction.

Authors:  Stephen R Pfohl; Agata Foryciarz; Nigam H Shah
Journal:  J Biomed Inform       Date:  2020-11-18       Impact factor: 6.317

10.  Effects of Neighborhood-level Data on Performance and Algorithmic Equity of a Model That Predicts 30-day Heart Failure Readmissions at an Urban Academic Medical Center.

Authors:  Gary E Weissman; Stephanie Teeple; Nwamaka D Eneanya; Rebecca A Hubbard; Shreya Kangovi
Journal:  J Card Fail       Date:  2021-05-26       Impact factor: 6.592

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.