Literature DB >> 26114398

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs).

M Defernez, E K Kemsley.   

Abstract

Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.

Mesh:

Year:  1999        PMID: 26114398     DOI: 10.1039/a905556h

Source DB:  PubMed          Journal:  Analyst        ISSN: 0003-2654            Impact factor:   4.616


  5 in total

1.  Classification of lactate dehydrogenase of different origin by liquid chromatography-mass spectrometry and multivariate analysis.

Authors:  Dan Bylund; Jenny Samskog; Karin E Markides; Sven P Jacobsson
Journal:  J Am Soc Mass Spectrom       Date:  2003-03       Impact factor: 3.109

2.  Global Property Prediction: A Benchmark Study on Open-Source, Perovskite-like Datasets.

Authors:  Felix Mayr; Alessio Gagliardi
Journal:  ACS Omega       Date:  2021-05-03

3.  Evaluation of Three Machine Learning Algorithms for the Automatic Classification of EMG Patterns in Gait Disorders.

Authors:  Christopher Fricke; Jalal Alizadeh; Nahrin Zakhary; Timo B Woost; Martin Bogdan; Joseph Classen
Journal:  Front Neurol       Date:  2021-05-21       Impact factor: 4.003

4.  A Novel Hybrid Model Based on a Feedforward Neural Network and One Step Secant Algorithm for Prediction of Load-Bearing Capacity of Rectangular Concrete-Filled Steel Tube Columns.

Authors:  Quang Hung Nguyen; Hai-Bang Ly; Van Quan Tran; Thuy-Anh Nguyen; Viet-Hung Phan; Tien-Thinh Le; Binh Thai Pham
Journal:  Molecules       Date:  2020-07-31       Impact factor: 4.411

5.  Prediction model of artificial neural network for the risk of hyperuricemia incorporating dietary risk factors in a Chinese adult study.

Authors:  Jie Zeng; Junguo Zhang; Ziyi Li; Tianwang Li; Guowei Li
Journal:  Food Nutr Res       Date:  2020-01-20       Impact factor: 3.894

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.