Arief Gusnanto1, Charles C Taylor1, Ibrahim Nafisah2, Henry M Wood1, Pamela Rabbitts1, Stefano Berri1. 1. Department of Statistics, University of Leeds, Leeds LS2 9JT, United Kingdom, Department of Statistics, Faculty of Science, King Saud University, Riyadh, Saudi Arabia, Leeds Institute Cancer and Pathology, University of Leeds, Leeds LS9 7TF, UK and Illumina UK Ltd., Chesterford Research Park, Saffron Walden, CB10 1XL, UK. 2. Department of Statistics, University of Leeds, Leeds LS2 9JT, United Kingdom, Department of Statistics, Faculty of Science, King Saud University, Riyadh, Saudi Arabia, Leeds Institute Cancer and Pathology, University of Leeds, Leeds LS9 7TF, UK and Illumina UK Ltd., Chesterford Research Park, Saffron Walden, CB10 1XL, UKDepartment of Statistics, University of Leeds, Leeds LS2 9JT, United Kingdom, Department of Statistics, Faculty of Science, King Saud University, Riyadh, Saudi Arabia, Leeds Institute Cancer and Pathology, University of Leeds, Leeds LS9 7TF, UK and Illumina UK Ltd., Chesterford Research Park, Saffron Walden, CB10 1XL, UK.
Abstract
MOTIVATION: Current high-throughput sequencing has greatly transformed genome sequence analysis. In the context of very low-coverage sequencing (<0.1×), performing 'binning' or 'windowing' on mapped short sequences ('reads') is critical to extract genomic information of interest for further evaluation, such as copy-number alteration analysis. If the window size is too small, many windows will exhibit zero counts and almost no pattern can be observed. In contrast, if the window size is too wide, the patterns or genomic features will be 'smoothed out'. Our objective is to identify an optimal window size in between the two extremes. RESULTS: We assume the reads density to be a step function. Given this model, we propose a data-based estimation of optimal window size based on Akaike's information criterion (AIC) and cross-validation (CV) log-likelihood. By plotting the AIC and CV log-likelihood curve as a function of window size, we are able to estimate the optimal window size that minimizes AIC or maximizes CV log-likelihood. The proposed methods are of general purpose and we illustrate their application using low-coverage next-generation sequence datasets from real tumour samples and simulated datasets. AVAILABILITY AND IMPLEMENTATION: An R package to estimate optimal window size is available at http://www1.maths.leeds.ac.uk/∼arief/R/win/.
MOTIVATION: Current high-throughput sequencing has greatly transformed genome sequence analysis. In the context of very low-coverage sequencing (<0.1×), performing 'binning' or 'windowing' on mapped short sequences ('reads') is critical to extract genomic information of interest for further evaluation, such as copy-number alteration analysis. If the window size is too small, many windows will exhibit zero counts and almost no pattern can be observed. In contrast, if the window size is too wide, the patterns or genomic features will be 'smoothed out'. Our objective is to identify an optimal window size in between the two extremes. RESULTS: We assume the reads density to be a step function. Given this model, we propose a data-based estimation of optimal window size based on Akaike's information criterion (AIC) and cross-validation (CV) log-likelihood. By plotting the AIC and CV log-likelihood curve as a function of window size, we are able to estimate the optimal window size that minimizes AIC or maximizes CV log-likelihood. The proposed methods are of general purpose and we illustrate their application using low-coverage next-generation sequence datasets from real tumour samples and simulated datasets. AVAILABILITY AND IMPLEMENTATION: An R package to estimate optimal window size is available at http://www1.maths.leeds.ac.uk/∼arief/R/win/.
Authors: Anastasia Filia; Alastair Droop; Mark Harland; Helene Thygesen; Juliette Randerson-Moor; Helen Snowden; Claire Taylor; Joey Mark S Diaz; Joanna Pozniak; Jérémie Nsengimana; Jon Laye; Julia A Newton-Bishop; D Timothy Bishop Journal: Sci Rep Date: 2019-06-20 Impact factor: 4.379
Authors: Sarah Munchel; Yen Hoang; Yue Zhao; Joseph Cottrell; Brandy Klotzle; Andrew K Godwin; Devin Koestler; Peter Beyerlein; Jian-Bing Fan; Marina Bibikova; Jeremy Chien Journal: Oncotarget Date: 2015-09-22
Authors: Xiaoxuan Xia; William Ka Kei Wu; Sunny Hei Wong; Dabin Liu; Thomas Ngai Yeung Kwong; Geicho Nakatsu; Pearlly S Yan; Yu-Ming Chuang; Michael Wing-Yan Chan; Olabisi Oluwabukola Coker; Zigui Chen; Yun Kit Yeoh; Liuyang Zhao; Xiansong Wang; Wing Yin Cheng; Matthew Tak Vai Chan; Paul Kay Sheung Chan; Joseph Jao Yiu Sung; Maggie Haitian Wang; Jun Yu Journal: Microbiome Date: 2020-07-16 Impact factor: 14.650