Literature DB >> 35938023

PRECISION.array: An R Package for Benchmarking microRNA Array Data Normalization in the Context of Sample Classification.

Huei-Chung Huang¹, Yilin Wu¹, Qihang Yang¹, Li-Xuan Qin¹.

Abstract

We present a new R package PRECISION.array for assessing the performance of data normalization methods in connection with methods for sample classification. It includes two microRNA microarray datasets for the same set of tumor samples: a re-sampling-based algorithm for simulating additional paired datasets under various designs of sample-to-array assignment and levels of signal-to-noise ratios and a collection of numerical and graphical tools for method performance assessment. The package allows users to specify their own methods for normalization and classification, in addition to implementing three methods for training data normalization, seven methods for test data normalization, seven methods for classifier training, and two methods for classifier validation. It enables an objective and systemic evaluation of the operating characteristics of normalization and classification methods in microRNA microarrays. To our knowledge, this is the first such tool available. The R package can be downloaded freely at https://github.com/LXQin/PRECISION.array.

Entities: Chemical

Keywords: benchmarking; classification; microRNA; microarray; normalization

Year: 2022 PMID： 35938023 PMCID： PMC9354575 DOI： 10.3389/fgene.2022.838679

Source DB: PubMed Journal: Front Genet ISSN： 1664-8021 Impact factor: 4.772

Introduction

Sample classification is an important goal in precision oncology for informing practitioners on treatment decisions and trialists on patient stratification (Pencina and Peterson, 2016; Pencina et al., 2020). Many classifiers that have been reported in the literature suffered irreproducibility partly due to data artifacts that result from disparate handling of tissue specimens (Simon et al., 2003; Ransohoff, 2005; Akey et al., 2007; Ioannidis et al., 2009; McShane et al., 2013). While data normalization is routinely used to circumvent the negative impacts of these artifacts, its performance has been evaluated primarily for differential expression analysis and is yet to be thoroughly assessed for the development of sample classifiers (Rahman et al., 2015; Qin et al., 2016). To enable such an assessment, we utilized two datasets from the same set of tumor samples using Agilent microarrays for microRNAs (miRNAs), a class of small RNAs closely linked to carcinogenesis (Dillies et al., 2013; Maza et al., 2013). One dataset was collected with uniform handling and balanced array-to-sample assignment, and the other had the samples arrayed over time in the order of collection (Qin et al., 2014; Qin et al., 2018). To simulate addition paired datasets that mimic real-world data distribution, we used the first dataset to approximate the biological effects for each sample, serving as the “virtual samples”; we used the difference between the two arrays (one from each dataset) for the same sample to approximate the array effects for each array in the second dataset, serving as the “virtual arrays.” They can then be used for “virtual re-hybridization,” a re-sampling-based algorithm, to simulate data under various signal-to-noise ratios (Rahman et al., 2015; Qin et al., 2018). We have built an R package PRECISION.array, PaiREd miCrorna analysIs of molecular clasSificatION for microarrays (https://github.com/LXQin/PRECISION.array), for interested researchers to use for assessing their choice of normalization methods in combination with various methods for sample classifier training and validation under a chosen level of signal-to-noise ratio.

Implementation

MiRNAs were profiled for 96 endometroid endometrial and 96 serous ovarian tumor samples twice. One dataset used uniform handling (by one technician in one batch) and balanced array-to-sample assignment (via blocking and randomization), and the other used neither (by two technicians in multiple batches with the arrays assigned in the order of tumor sample collection) (Qin et al., 2014; Qin et al., 2018). The data for a random subset of the miRNAs are included in the PRECISION.array package for demonstration purposes. The full datasets can be loaded from the PRECISION.array.DATA package (https://github.com/LXQin/PRECISION.array.DATA), where the first dataset can be called using the function data.benchmark() and the second using data.test(). The uniformly handled dataset is used to approximate the biological effects for each sample by calling the function estimate.biological.effect(); the difference between the two arrays (one from each dataset) for a sample is used to estimate the array effects for each array in the non-uniformly handled dataset by calling the function estimate.handling.effect(). We will refer to the former as “virtual samples” and the latter as “virtual arrays.” For proof of principle, we use tumor type, endometrial versus ovarian, as the endpoint for classification. The level of biological signals can be adjusted by calling the function reduce.signal(); the extent of handling effects can be changed by calling the function amplify.handling.effect(). The 192 virtual samples are split randomly (balanced by tumor type) in a 2:1 ratio to a training set and a test set; the 192 virtual arrays are split nonrandomly, with the first 64 and last 64 arrays in the order of array processing for the training set and the middle 64 arrays for the test set. Data are then simulated through “virtual re-hybridization” by reassigning arrays to samples and then summing the biological effects for a sample and the array effects for its assigned array by calling the function rehybridize(). The array-to-sample assignment can follow either a confounding or a balanced design (via blocking, randomization, and stratification), which can be the same or different for the training set and the test set. Data for the test set with sample effects only (i.e., without adding array effects) are used to assess the accuracy of a classifier and serve as the benchmark. Data preprocessing consists of the following three steps: (1) log2 transformation; (2) normalization for training data and frozen normalization for test data (i.e., mapping the empirical distribution of each individual test set sample to the “frozen” empirical distribution of the normalized training data), with or without batch effect correction; and (3) probe-replicate summarization using the median. Our package currently includes the functions for (1) three normalization methods for training data, namely, median normalization, quantile normalization, and variance stabilizing normalization, plus no normalization as a reference; (2) seven normalization methods for test data: the aforementioned three normalization, either for test data alone or frozen toward training data, and pooled quantile normalization (where the combination of training data and test data is quantile normalized), plus no normalization as a reference; (3) seven methods for classifier building: Prediction Analysis for Microarrays (PAM) (Tibshirani et al., 2002), logistic regression with the Least Absolute Shrinkage and Selection Operator (LASSO) penalty for variable selection (Tibshirani, 1996), Classification to Nearest Centroids (ClaNC) (Dabney, 2006), Diagonal Linear Discriminant (DLDA) (Dudoit et al., 2002), K-Nearest Neighbors (kNN) (Keller et al., 1985), Random Forest (Cutler and Stevens, 2006), and Support Vector Machine (SVM) (Noble, 2006); and (4) two methods for classifier validation, namely, cross-validation and external validation. The aforementioned methods for normalization and classification are chosen because of their popularity in the literature on transcriptomics data analysis. Our package can also accommodate additional methods chosen by the user via functions uni.handled.simulate() and precision.simulate(). The overall goal is to assess the accuracy (measured as the proportion of misclassified samples) of a classifier across various normalization and classification methods and between the two validation methods, as well as the interactions among these three choices of methods. The full pipeline of the assessment is provided by the wrapper precision.simulate.multiclass().

Summary

In this study, we introduce an R package called PRECISION.array, which assesses the performance of data normalization methods in combination with various classification methods and and validation approaches under a number of sample-to-array assignment designs and a range of signal-to-noise ratios for miRNA arrays.

17 in total

Review 1. Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification.

Authors: Richard Simon; Michael D Radmacher; Kevin Dobbin; Lisa M McShane
Journal: J Natl Cancer Inst Date: 2003-01-01 Impact factor: 13.506

2. ClaNC: point-and-click software for classifying microarrays to nearest centroids.

Authors: Alan R Dabney
Journal: Bioinformatics Date: 2005-11-02 Impact factor: 6.937

Review 3. Bias as a threat to the validity of cancer molecular-marker research.

Authors: David F Ransohoff
Journal: Nat Rev Cancer Date: 2005-02 Impact factor: 60.716

Review 4. What is a support vector machine?

Authors: William S Noble
Journal: Nat Biotechnol Date: 2006-12 Impact factor: 54.908

5. On the design and analysis of gene expression studies in human populations.

Authors: Joshua M Akey; Shameek Biswas; Jeffrey T Leek; John D Storey
Journal: Nat Genet Date: 2007-07 Impact factor: 38.330

6. Repeatability of published microarray gene expression analyses.

Authors: John P A Ioannidis; David B Allison; Catherine A Ball; Issa Coulibaly; Xiangqin Cui; Aedín C Culhane; Mario Falchi; Cesare Furlanello; Laurence Game; Giuseppe Jurman; Jon Mangion; Tapan Mehta; Michael Nitzberg; Grier P Page; Enrico Petretto; Vera van Noort
Journal: Nat Genet Date: 2008-01-28 Impact factor: 38.330

7. Cautionary Note on Using Cross-Validation for Molecular Classification.

Authors: Li-Xuan Qin; Huei-Chung Huang; Colin B Begg
Journal: J Clin Oncol Date: 2016-11-10 Impact factor: 44.544

8. Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Authors: Robert Tibshirani; Trevor Hastie; Balasubramanian Narasimhan; Gilbert Chu
Journal: Proc Natl Acad Sci U S A Date: 2002-05-14 Impact factor: 11.205

9. Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results.

Authors: Mumtahena Rahman; Laurie K Jackson; W Evan Johnson; Dean Y Li; Andrea H Bild; Stephen R Piccolo
Journal: Bioinformatics Date: 2015-07-24 Impact factor: 6.937

10. Criteria for the use of omics-based predictors in clinical trials.

Authors: Lisa M McShane; Margaret M Cavenagh; Tracy G Lively; David A Eberhard; William L Bigbee; P Mickey Williams; Jill P Mesirov; Mei-Yin C Polley; Kelly Y Kim; James V Tricoli; Jeremy M G Taylor; Deborah J Shuman; Richard M Simon; James H Doroshow; Barbara A Conley
Journal: Nature Date: 2013-10-17 Impact factor: 49.962