| Literature DB >> 32627174 |
Pedro L Moura1,2, Johannes G G Dobbe3, Geert J Streekstra3, Minke A E Rab4,5, Martijn Veldthuis6, Elisa Fermo7, Richard van Wijk4, Rob van Zwieten6,8, Paola Bianchi7, Ashley M Toye1,2,9, Timothy J Satchwell1,2,9.
Abstract
Entities:
Mesh:
Substances:
Year: 2020 PMID: 32627174 PMCID: PMC8221027 DOI: 10.1111/bjh.16868
Source DB: PubMed Journal: Br J Haematol ISSN: 0007-1048 Impact factor: 6.998
Fig 1Different hereditary rare anaemias display distinct area and deformability profiles. (A) Design of the method for automatic sample classification. Whole blood is collected by the clinician, and a sample is obtained and processed using an Automated Rheoscope and Cell Analyzer (ARCA). Images acquired are subjected to computational analysis to determine cross‐sectional area and deformability of at least 1000 individual cells, and the resulting datasets are then classified through trained computational models, achieving a diagnosis in less than 30 min. (B) Contour plots of cross‐sectional area plotted against the deformability index (as measured by dividing cell length by cell width), visualizing the probability distribution of erythrocytes (RBCs), cultured reticulocytes (reticulocytes) and erythrocytes treated with an anti‐Glycophorin A antibody (BRIC256, International Blood Group Reference Laboratory) before analysis to induce membrane stiffening (BRIC256 RBCs). The control erythrocyte and cultured reticulocyte data shown in this panel were previously reported in Moura et al.9 A minimum of 1000 cells were analysed per sample. All samples were analysed using the ARCA. (C) Contour plots of cross‐sectional area plotted against the deformability index (as measured by dividing cell length by cell width), visualizing the probability distribution of patient samples overlaid to allow for comparison with healthy controls. A minimum of 1000 cells were analysed per blood sample. All samples were analysed using the ARCA. The samples are listed from left to right: Top row: healthy controls (n = 6), hereditary spherocytosis patients (n = 13), congenital dyserythropoietic anaemia II patients (n = 9). Bottom row: pyruvate kinase deficiency patients (n = 6), dehydrated stomatocytosis type 1 or hereditary xerocytosis patients (n = 10), dehydrated stomatocytosis type 2 or Gardos xerocytosis patients (n = 3).
Fig 2Machine‐learning‐based classification of automated rheoscopy datasets provides accurate diagnoses for unseen samples. (A) Flow diagram outlining the procedure for ARCA‐based data visualisation and automated sample classification. The sample is first analysed to produce a raw data table. These data are then reorganised into a Python pandas (“panel data”) data frame for ease of processing. If visualisation is required, samples from a given sample type are stochastically equalised in cell number, joined and subjected to kernel density estimation to estimate the probability density functions of analysed features (e.g. cross‐sectional area, deformability index, cell angle) and then visualized through contour plots or scatter plots. Data to be used for machine learning undergo feature extraction (removal of all non‐essential information) and a subsection is sampled randomly (without reposition) for creation of a testing set. The remaining data then undergo augmentation by generation of a series of randomly sampled datasets (with reposition, 10,000×) which will be used for training a supervised machine‐learning algorithm. After training, a predictive model (i.e. classifier) is generated which first is tested with the previously generated testing set. Upon satisfactory results with the testing set, the classifier can then generate predictions for new unseen data. The final results consist of a sample label (or classification) and the certainty of that classification (B) Comparison of the overall prediction accuracy of multiple supervised machine‐learning algorithms in ARCA‐based automated sample diagnosis as a function of the number of datasets per condition used for classifier training (from no datasets used, which should result in a random diagnosis, to a maximum of six datasets), comparing the samples analysed at the University of Bristol (except Gardos xerocytosis samples, which were too few to analyse). Prediction accuracy is coloured on a percentage scale from red (0%) to blue (100%). The best‐performing algorithm per no. of datasets is bolded in the accuracy matrix. The graph displays the average prediction accuracy of all algorithms (blue). Error bars = ± standard deviation (SD). The prediction accuracies of the best‐performing algorithms are plotted in green, while the prediction accuracies of the worst‐performing algorithms are plotted in red. (C) Prediction accuracy of the best performing algorithm in (B). The samples used consist of healthy controls, congenital dyserythropoietic anaemia II patients (CDAII), hereditary spherocytosis patients (HS), hereditary xerocytosis patients (HX) and pyruvate kinase deficiency patients (PKD). Rows identify real samples provided, whilst columns identify the algorithm's prediction of the provided samples' identity. The blue diagonal indicates samples that were correctly diagnosed (true positives). Red cells in the surrounding matrix indicate incorrect diagnoses (i.e. two HS samples were misdiagnosed as CDAII and one HX sample was misdiagnosed as HS). Accuracy is provided as a percentage of the true positives within the total number of samples and is coloured on a percentage scale from red (0%) to blue (100%). Average accuracy is provided as an average of the accuracies for all sample types. Data for all other algorithms and sample numbers tested are provided in Figs S1–S7. (D) Comparison of the overall prediction accuracy of multiple supervised machine‐learning algorithms in ARCA‐based automated sample diagnosis as a function of the number of datasets used for training, comparing samples from healthy controls, hereditary spherocytosis patients and hereditary elliptocytosis patients analysed at Sanquin. The graph displays the average prediction accuracy of all algorithms (blue). Error bars = ±SD. The prediction accuracies of the best‐performing algorithms are plotted in green, while the prediction accuracies of the worst‐performing algorithms are plotted in red.