Literature DB >> 35463259

A Novel Encoder-Decoder Model for Multivariate Time Series Forecasting.

Huihui Zhang1,2, Shicheng Li3, Yu Chen3, Jiangyan Dai2, Yugen Yi3.   

Abstract

The time series is a kind of complex structure data, which contains some special characteristics such as high dimension, dynamic, and high noise. Moreover, multivariate time series (MTS) has become a crucial study in data mining. The MTS utilizes the historical data to forecast its variation trend and has turned into one of the hotspots. In the era of rapid information development and big data, accurate prediction of MTS has attracted much attention. In this paper, a novel deep learning architecture based on the encoder-decoder framework is proposed for MTS forecasting. In this architecture, firstly, the gated recurrent unit (GRU) is taken as the main unit structure of both the procedures in encoding and decoding to extract the useful successive feature information. Then, different from the existing models, the attention mechanism (AM) is introduced to exploit the importance of different historical data for reconstruction at the decoding stage. Meanwhile, feature reuse is realized by skip connections based on the residual network for alleviating the influence of previous features on data reconstruction. Finally, in order to enhance the performance and the discriminative ability of the new MTS, the convolutional structure and fully connected module are established. Furthermore, to better validate the effectiveness of MTS forecasting, extensive experiments are executed on two different types of MTS such as stock data and shared bicycle data, respectively. The experimental results adequately demonstrate the effectiveness and the feasibility of the proposed method.
Copyright © 2022 Huihui Zhang et al.

Entities:  

Mesh:

Year:  2022        PMID: 35463259      PMCID: PMC9023224          DOI: 10.1155/2022/5596676

Source DB:  PubMed          Journal:  Comput Intell Neurosci


1. Introduction

Time series is the sequence of arranged numbers according to the occurrence time, which is also called dynamic series. The time span can be years, quarters, months, hours, or other factors [1]. In recent years, time series are widely applied in various fields, such as economics, medicine, transportation, and environmental science, which has been attracted much attention [2]. According to the number of observed variables, time series data can be divided into univariate time series data and multivariate time series data [2]. Therefore, how to mine useful information from these time series data becomes a very important task in data mining, machine learning, artificial intelligence, and other fields [3]. As a key and crucial branch of time series data analysis, time series prediction aims to accurately predict or estimate the future events by exploring the past and current data of the single variable or several correlated variables [4]. The former is called univariate time series forecasting; the latter is called multivariate time series forecasting. For example, economists utilized the historical data of stock prices to forecast stock prices or trends [5], medical scientists made use of the biological time data to predict diseases [6], transportation departments explored the historical data of traffic flow to predict congestion [7], and environmentalists employed atmospheric timing data to estimate environmental climate changes [8], etc. Nevertheless, time series data not only contains abundant information but also appears to some complex characteristics such as high dimension, nonlinear, fluctuation, and spatiotemporal dependence, which make accurate time series data prediction become a challenging study hotspot [9]. In the past few decades, time series data prediction has been widely concerned and many methods have been proposed [10]. For instance, traditional statistics-based methods focused on relevant domain knowledge, while learning-based methods are introduced to learn temporal dynamics in a pure data-driven strategy. As a popular learning-based method, deep learning can learn the deep latent features from the input data comprehensively and has become a cutting-edge approach [11]. The traditional statistics-based methods include autoregressive (AR) [12], autoregressive moving average (ARMA) [13], autoregressive integrated moving average, and exponential smoothing models (ARIMA) [14]. Although the above methods can utilize statistical inference to describe and evaluate the relationship between variables, they assumed that the input data has a linear relationship between model structure and the constant variance [15]. Therefore, there are some limitations to dealing with complex time series data containing nonlinear and nonstationary structures, so they cannot effectively obtain accurate predictions. In order to solve the shortcomings mentioned above, many learning-based methods including support vector machine (SVM) [16], genetic algorithm (GA) [17], AdaBoost [18], and artificial neural network (ANN) [19], which can simulate the complex structures of time series data, have been widely applied to time series prediction task. For example, Dong et al. [16] discussed utilizing SVM for predicting building energy consumption in tropical regions, and they considered that it was superior to other neural networks from the views of performance and parameter selection. Yadav et al. [17] proposed a neuron model based on polynomial structure and used the Internet traffic and financial time series data to conduct forecast experiments, which showed that the neural network (NN) model not only achieved better performance but also greatly reduced the computational complexity and running time comparing with the existing multilayer neural networks. However, building an effective learning-based model needs a large amount of professional data, and the training process requires a high level of computer hardware equipment. Therefore, the application of traditional machine learning models is largely limited. In recent years, with the improvement of data acquisition and computing power, a novel learning-based method called deep learning has attracted much attention. Deep learning [20] can obtain a higher-level representation of the original input via designing simple and nonlinear modules, which was conducive to learning the feature representation. Convolutional neural network (CNN) [21], recurrent neural network (RNN) [22], and variant models have been successfully applied to time series prediction. Zhang et al. [23] proposed a deep spatiotemporal residual network model to predict the flow of people throughout the city. Jagannatha and Yu [24] developed a bidirectional recurrent neural network (BRNN) for medical events detection in electronic medical records. Nevertheless, RNN and BRNN are easy to suffer from the gradient vanishing and gradient exploding problems. To overcome the drawbacks, the long short-term memory network (LSTM) [25] and the gated recurrent unit (GRU) [26] were developed. Since both LSTM and GRU can keep the historical information for a longer time step, they are widely used in time series data analysis, prediction, and classification tasks. Compared with LSTM, the GRU has a simpler structure and fewer parameters, which can reduce the overfitting risk. For example, Shu et al. [27] presented a new neural network model based on improved GRU to predict short-term traffic flow. As an unsupervised method, Autoencoder (AE) is also widely applied to feature representation learning [28]. In order to extract better features, the RNN is frequently combined with AE. Xu and Yoneda [29] first used a stacked autoencoder (SAE) to encode the key evolution patterns of urban weather systems and then adopted the LSTM network to predict the PM2.5 time series of multiple locations in the city. Zhang et al. [30] proposed an encoder-decoder model for real-time air pollutant prediction, in which LSTM was the main network. The experimental results indicated that the model can fully extract the data correlations and obtain higher prediction accuracy. In addition, the attention mechanism (AM) [31] has attracted extensive attention in time series data analysis and prediction. Han et al. [32] combined LSTM with AM to predict time series, in which the AM can capture time correlation by calculating weights between nodes and neighboring nodes so that it achieved better performance and provided enlightenment for multivariate time series prediction simultaneously. Although abundant methods have been developed, their performances are limited since the high nonlinearity and nonstationarity of multivariate time series (MTS) data. To improve the prediction performance, a novel encoder-encoder prediction model is presented, and the contributions are as follows: The proposed model can sufficiently extract significant temporal features of MTS data. As a unit structure, the GRU is adopted to describe sequential characteristics which can reduce model parameters in the procedures of encoding and decoding. The AM is introduced into the decoding process for preferably acquiring the reconstructed MTS data. To strengthen the prediction performance, 1D-convolution operation and AM are further performed based on the reconstructed new MTS data, which possess discriminant and significant characteristics. The outline of this paper is as follows. Section 2 reviews the related works, and time series data preprocessing is introduced in Section 3. Section 4 describes the proposed network structure in detail. Section 5 illustrates extensive experiments to verify the effectiveness and feasibility of the proposed model. Section 6 provides some conclusions and future works.

2. Related Works

Recently, researchers have proposed extensive time series (TS) and multivariate time series (MTS) prediction methods, which are classified into two categories including machine learning and deep learning methods [9].

2.1. Machine Learning Methods

The basic assumption of the statistical methods is that the TS and MTS with simple structures are linearity and stationarity. However, in real applications, the TS and MTS data are collected with complex structures, which have high nonlinearity and nonstationarity and they make the TS and MTS forecasting very difficult. Meanwhile, the machine learning algorithms are usually helpful to improve the prediction accuracy [33], which can analyze the behavior of data over time and are independent of the statistical distribution assumption to extract complex nonlinear patterns. Specifically, Li et al. [34] firstly proposed a chaotic cloud simulated annealing genetic algorithm (CcatCSAGA), which was used to optimize the robust support vector regression (RSVR) parameters for improving the performance of ship traffic flow prediction. Sahoo et al. [35] designed a novel online multiple kernels regression (OMKR), which successively learned kernel-based regression in an extensible manner. Moreover, its effectiveness was demonstrated on real data regression and time series prediction tasks. Ahmed et al. [33], respectively, adopted multilayer perceptron (MLP), Bayesian neural networks (BNN), radial basis function (RBF), general regression neural network (GRNN), k-Nearest neighbors regression (KNNR), classification and regression tree (CART), support vector regression (SVR), and Gaussian process regression (GPR) to perform experiments. This study revealed significant differences between various methods in TS and MTS prediction, and the MLP and GPR methods were the best. Besides, in order to improve the performance, Domingos et al. [36], respectively, combined the ARIMA with MLP and SVR to predict time series. It showed that the hybrid model was better than the single model. Rojas et al. [37] presented a hybrid method integrating an artificial neural network and ARMA model, which achieved outstanding results.

2.2. Deep Learning Methods

The deep neural network can surpassingly learn complex data representation [38], which is widely utilized in many tasks, such as image classification, image segmentation, and natural language processing. A convolutional neural network (CNN) was originally designed to process static image analysis, which can obtain invariant local relations across spatial dimensions [39]. Recently, CNN and its variant methods were also developed for time series data prediction [40], classification [41], anomaly detection [42], clustering [43], and so on. For example, Ding et al. [44] applied the CNN model to stock market prediction. Wang et al. [45] introduced deep learning to develop a probabilistic wind power generation prediction model. In this model, a wavelet transform was used to decompose the raw wind power data into different frequencies. Then, a CNN model was used to learn nonlinear features in each frequency for improving prediction accuracy. Finally, the probability distribution of wind power generation was predicted. Different from the above methods, Oord et al. [46] proposed a new network model called WaveNet, which expanded convolution to improve the long-term dependence requirement of time series. Moreover, the size of the receptive field increased exponentially with the depth of layers. Afterward, Borovykh et al. [47] adopted the WaveNet for multivariate financial time series forecasting. A recurrent neural network (RNN) is also widely exploited for time series prediction [22]. Since there is a long-term dependence on RNN during the training, it will lead to related gradient explosion and gradient disappearance. Therefore, introducing the gating mechanism into RNN has drawn much attention to overcome these limitations and preserves long-term information of time series data, such as long short-term memory (LSTM) [25] and gated recurrent unit (GRU) [26]. The gated variants of RNN essentially preserve the internal state memory through their recurrent feedback mechanism, which makes them very suitable for modeling the time series data. Moreover, their ability to capture complex nonlinear dependence can be extended from short-term to long-term and cross different variables in multivariate systems. Therefore, the performance of these models is excellent in the time series prediction task. Li et al. [48] built a model combining ARIMA and LSTM to improve the prediction accuracy of high-frequency financial time series. Pan et al. [49] applied the model based on the LSTM network to predict urban traffic flow and greatly improved the prediction effect via the spatial correlation. Filonov et al. [50] proposed a model based on the LSTM network to monitor and detect faults in industrial multivariate time series data. Zhao et al. [51] established a two-layer LSTM model to learn gait patterns presenting in neurodegenerative diseases for diagnostic prediction. Jia et al. [52] developed a spatiotemporal learning framework with a dual memory structure based on LSTM to predict land cover. Huang et al. [53] proposed a sequence-to-sequence framework based on GRU to predict different types of abnormal events. Fu et al. [54] used LSTM and GRU to predict short-term traffic flow, which indicated that the RNN-based methods (such as LSTM and GRU) performed better than ARIMA. Zhang et al. [55] utilized four different neural networks, such as MLP, WNN, LSTM, and GRU, to monitor the small watercourses overflow. Furthermore, the models combining CNN with LSTM or GRU have been frequently applied to time series prediction. Wu et al. [56] explored the GRU network to encode the time mode of each sequence with low-dimensional representation and then combined it with a convolutional network for modeling behavioral time series. Shi et al. [57] presented a ConvLSTM network to predict nearby precipitation which can acquire spatiotemporal correlations well. Autoencoder (AE) has also been successfully applied in time series prediction and is generally combined with other deep learning methods [58]. Considering the inherent temporal and spatial correlation of traffic flow, Lv et al. [59] used AE as one of the modules to construct a deep learning model. Yang et al. [60] proposed a new host load prediction method, which utilized AE as the precyclic feature layer of the echo state network. Gensler et al. [61] combined AE with LSTM for renewable energy power prediction which was superior to the artificial neural network and physical prediction model. Recently, Prenkaj et al. [62] combined AE and GRU to propose a new strategy for predicting the student dropout e-courses.

3. Time Series Data Preprocessing

Generally, time series data are collected manually or automatically; it is difficult to avoid data redundancy, data missing, data error, and other unknown problems in the process of collection and transmission. Therefore, data preprocessing becomes a crucial and necessary procedure for time series data analysis. It mainly includes four stages, such as data clean, data normalization, data sliding window, and data split [63]. The details are illustrated in Figure 1.
Figure 1

The process of time series data preprocessing.

Data Cleaning. The purpose of data clean is to deal with missing values, outlier values, and redundant attributes in time series data. There are many ways to handle missing and outlier values. One way is to delete the data with missing and outlier values directly. However, when many attributes of data have missing and outlier values, it is very hard to remain adequate useful attributes and results in incomplete time series data, which will affect the learning and generalization ability of models. The other way considers outlier values as missing values and then the data filling technique is applied to solve the above problems. Data filling includes statistics-based and learning-based methods. The former generally adopts mean filling, while the latter adopts simple linear regression or a complex learning model (such as deep learning). In our work, the mean filling is utilized to process missing values and outlier values. Moreover, feature selection or feature extraction methods are generally adopted to solve redundant attributes. In particular, the proposed model in our work is based on a deep learning framework, which has a strong feature representation ability. Therefore, it is robust to deal with data containing redundant attributes. Data Normalization. Since the different attributes of data often have different measurement scales, the values collected may vary widely. For the sake of eliminating the influence of measurement scale and value range among different attributes, it is necessary to perform normalized processing which can scale data in a certain proportion, such as mapping data values to [−1, 1] or [0, 1]. The popular data normalization methods contain minimum-maximum normalization and zero-mean normalization. Minimum-maximum normalization is named deviation standardization, which maps the values of the original data to [0, 1] via a linear transformation. The formula is as follows: where max and min represent the maximum and minimum values of data, respectively. The method can preserve the relationships that exist in original data. Zero-mean normalization is known as standard deviation standardization. After processing, the mean value and the standard deviation of normalization data are 0 and 1, respectively. The formula is defined as where and σ are the mean and standard deviation of original data, respectively. Data Sliding Window. This operation mainly creates time series data by the predefined sliding window size and step for the original time series data. In other words, this operation is used to generate the predicted data for the next moment using historical data with a given interval. The specific operation of the data sliding window is shown in Figure 2 [64]. Given any time series data with length N, such as {1, 2, 3, 4, 5,…, N − 1, N}, when the sliding window size is set to L and the sliding step is 1, the N-L data sets with length L + 1 are formed. Particularly, the first L data of each set is regarded as training data and the value of the number L + 1 is the target value.
Figure 2

The process of data sliding window for creating a time series data.

Data Split. This stage divides the time series dataset into training data and test data. For example, the first 60% are used for training and the remaining 40% are used to test in the experiments.

4. The Proposed Method

In this work, a novel time series prediction model based on the encoding-decoding framework is designed, which integrates the recurrent neural module, convolutional module, attention mechanism, and fully connection module into a unified framework. As shown in Figure 3, the proposed model consists of three parts such as encoding, decoding, and prediction modules. In the encoding module, the gated recurrent unit (GRU) is taken as the main unit structure for extracting more effective time series features. In the decoding module, the attention mechanism (AM) is introduced to explore the importance of historical data collected at different times, so that it can obtain better new time series data. In addition, taking the influence of previous features on data reconstruction into account, feature reuse is realized by the skip connections based on the residual network. In the prediction module, the convolution layer is adopted to extract effective features from the reconstruction time series. Then, the AM is further performed on the convolution feature mapping owing to the influence of important information on prediction performance. Finally, a multilayer fully connected network is established for prediction.
Figure 3

The structure of the proposed network model.

4.1. Deep Autoencoder (DAE)

Autoencoder (AE) is an unsupervised deep learning method which is frequently used in feature representation, data compression, image denoising, and other tasks [28]. The structure of AE includes an encoder and a decoder, which only contain a fully connected hidden layer. To better extract features and reconstruct original data, Deep Autoencoder (DAE) [65] is designed that contains multiple hidden layers shown in Figure 4.
Figure 4

The structure of DAE.

4.2. LSTM and GRU

In general, DAE is a multilayer feedforward neural network, while it does not consider the importance of historical information of time series data to the prediction or classification of unknown data. As a specific network structure, a recurrent neural network (RNN) [22] can adeptly utilize the historical information of time series data, which adopts a backpropagation through time (BPTT) algorithm to train and learn parameters. However, RNN produced gradient vanishing or gradient exploding problems when it handled time series with long time intervals [25]. In particular, the longer the time interval, the more likely it is to appear severe gradient vanishing or gradient exploding, which will make it difficult to train effective RNN models for long interval sequences. To solve the above problems, other RNN variants (such as LSTM [25] and GRU [26]) are easier to capture the long-term dependence of time series data. LSTM uses the gate mechanism to control the information accumulation speed and can selectively update information and forget information accumulated. LSTM includes an input gate, forget gate, and output gate, which are displayed in Figure 5. The forget gate f controls which information needs to be forgotten derived from the internal state of the previous moment. The input gate i controls which information from the current candidate state needs to be retained. And the output gate o controls which information of the current internal state needs to be output.
Figure 5

The structure of the LSTM unit.

Different from LSTM, GRU is a simplified version of LSTM. It merges the forget gate and input gate into the update gate and retains the original reset gate, as shown in Figure 6. It can be observed that no additional memory units are needed in GRU. It is due to the fact that an update gate can control how much information needs to retain from the historical state and needs to receive from the candidate state for the current state. The calculation formula of GRU iswhere z and r represent update gate and reset gate, respectively. h is the state of the current moment t and indicates the candidate state. σ is the sigmoid activation function that can convert results to [0, 1]. tanh stands for hyperbolic tangent activation function. The symbol ⊙ is the dot product operation of corresponding elements. x represents the input of the neural network at time t. W , W, W and U, U, U represent the parameter matrix and recurrent weight of the model. b, b, and b are the offset vector. Compared with LSTM, GRU has a simple structure and fewer parameters because there are fewer gate structures of GRU. Therefore, GRU not only can reduce the model training time and avoid overfitting problems but also can achieve the same results as LSTM and even better than LSTM. In addition, BiGRU is a variant version of GRU. Although BiGRU has better performance than GRU in some cases, the parameter size of BiGRU is bigger than GRU. In order to overcome the overfitting problem, the GRU is adopted as the main unit structure of the autoencoder.
Figure 6

The structure of the GRU unit.

4.3. Attention Mechanism

Attention mechanism (AM) has been widely applied to natural language, computer vision, and other fields [66]. It is a resource allocation scheme that uses limited computing resources to process more important information for the information overload problem. Like artificial neural networks, AM originated from human vision and borrowed from human visual attention mechanisms. The core idea of AM is to select the more critical information and ignore the unimportant or irrelevant information to the current task from a large amount of information [66]. At present, plenty of attention mechanisms have been built to solve related tasks, such as spatial attention, channel attention, and mixed attention mechanisms [67]. In image understanding tasks including image segmentation and target detection, the channel attention (CA) [68] module is mainly adopted to explore relationships between feature maps of different channels, and its structure is shown in Figure 7. In the module, the feature map of each channel is taken as a feature detector that can determine which part of the features should be noticed more. It is well known that the time attribute is very important and also affects the prediction results. Therefore, we view each time attribute as a channel and the channel attention (CA) mechanism is integrated to mine the significance of time attributes in the proposed method.
Figure 7

The structure of CA.

4.4. Prediction Module

In the prediction module, a 1D-convolution is firstly explored to extract features from the time series data reconstructed by DAE. Then, in order to explore the different contributions of historical data for forecasting, the CA mechanism is performed on feature mapping by the previous layer. Finally, a multilayer dense network structure is constructed for prediction. The details are displayed in Figure 8.
Figure 8

The structure of the prediction module.

5. Experiments and Results Analysis

To verify the effectiveness of the proposed method, two series of experiments are conducted on public stock and shared bicycle datasets, respectively, and compared with some related methods. Many experimental results validate the effectiveness of our model.

5.1. Evaluation Metrics and Experimental Environment

In order to quantitatively analyze the accuracy and superiority, mean square error (MSE), mean square error (RMSE), mean absolute error (MAE), and mean average percentage error (MAPE) are adopted to evaluate the performance of the proposed model [69]. The calculation formulas are as follows:where X and X′ represent the actual and predicted values of the data and n is the number of samples. The smaller the above values, the more accurate the prediction result. The source codes of the proposed method and the compared methods are completed using Tensorflow with Python. The corresponding versions of the development software and the configurations of the hardware platform are listed in Table 1. Moreover, the settings of the key parameters during the training processing are shown in Table 2.
Table 1

The description of experimental environment.

Development software Version

Python3.6.0
Tensorflow2.7.0
SystemWindow 10 64 bit

Hardware platform Configurations

PC machineInter core i9 9900k
RAM32 GB
GPUGeForce RTX 2080 Ti GPU
Table 2

The settings of the key parameters in the training procedure.

DescriptionValue
Batch-size256
OptimizerAdam
Epochs400
Loss functionMSE

5.2. Stock Data Prediction

5.2.1. Stock Data Description

The stock data used in the experiment are Shanghai Composite Index 50 (SCI-50), CSI-300, and Shenzhen Component Index (SZCI). Each stock data records multiple attributes, such as the closing price, the highest price, the lowest price, the opening price, the previous day's closing price, change, and ups and downs. The closing price, the highest price, the lowest price, and the opening price represent the final price, the highest price, the lowest price, and the first trading price of one stock, respectively. The previous day's closing price is the final price at which a stock is traded on the previous day. Change is the difference between the closing price and the previous day's closing price of the stocks traded (i.e., closing price - previous day's closing price). The value of ups and downs is the change divided by the closing price of the stocks traded (i.e., change/closing price). The details of three stock datasets are listed in Table 3. Meanwhile, Tables 4to 6 give some instances of data and corresponding statistical information for each stock, including the number of records, minimum, maximum, mean, variance, 1/4 value, 1/2 value, and 3/4 value for each attribute. From Tables 3to 5, we can see that there are great differences and fluctuations in the stock data.
Table 3

The details of three stock datasets.

Stock nameStock codeStart and end timeNumber of records
SCI-500000162004.01.02–2021.06.234245
CSI-3003993002002.01.07–2021.03.174657
SZCI3990011991.04.04–2021.06.237349
Table 4

Some data and statistical information of SCI-50.

DataClosing priceHighest priceLowest priceOpening pricePrevious day's closing priceChangeUps and downs
2004/1/21011.3471021.568993.892996.996100011.3471.1347
2004/1/51060.8011060.8981008.2791008.2791011.34749.4544.8899
2012/2/11713.6841751.5581709.5361739.6381744.708−31.024−1.7782
2012/2/21761.9411761.9411714.2461719.9991713.68448.2572.816
2015/2/22332.5332376.4262329.1512337.1962405.38−72.847−3.0285
2015/2/32405.762413.0062335.1072362.4132332.53373.2273.1394
2021/6/213431.2523455.5653410.4033440.7443454.589−23.3363−0.6755
2021/6/223464.7063469.8083437.9553444.753431.25233.45350.975

Count4245424542454245424542454245
Mean2136.1732156.3552113.2422134.4592135.5910.5821770.043547
Std811.6843820.2789801.5931811.7199811.612840.361381.685313
Min700.434706.879693.528699.266700.434−296.696−9.4708
25%1600.2991614.0141586.0921599.4081599.012−13.545−0.7423
50%2127.2032150.0332101.0882127.8042127.0940.4930.0259
75%2692.542718.8842666.8172694.9522692.18116.220.8297
Max4731.8264772.9334688.2634726.0834731.826296.0779.6729
Table 6

Some data and statistical information of SZCI.

DataClosing priceHighest priceLowest priceOpening pricePrevious day's closing priceChangeUps and downs
1991/4/4983.11983.11983.11983.11988.05−4.94−0.5
1991/4/5978.27978.27978.27978.27983.11−4.84−0.4923
2010/1/413533.5413782.8113533.5413766.113699.97−166.433−1.2148
2010/1/513517.3813597.3613324.5613539.8313533.54−16.162−0.1194
2016/1/411626.0412659.4111625.4112650.7212664.89−1038.85−8.2026
2016/1/511468.0611687.4811063.6411116.911626.04−157.978−1.3588
2021/6/2114641.2914721.6914468.7414563.0514583.6757.62510.3951
2021/6/2214696.2914706.514564.514678.3714641.2954.99370.3756

Count7349734973497349734973497349
Mean6709.1846778.636628.6946704.2836707.3011.8853970.05939
Std4325.8424369.8264270.3344322.3354325.313153.72172.1302
Min402.5408.02397.67401.57402.5−1293.66−19.7807
25%3112.3363134.0553077.0973112.6373111.4−42.702−0.8978
50%4834.6144867.1424795.0434836.6374831.9890.3810.0112
75%10316.8210410.6510223.161031510315.7551.8130.9835
Max19531.1619600.0319203.1119554.5819531.161254.79526.1963
Table 5

Some data and statistical information on CSI-300.

DataClosing priceHighest priceLowest priceOpening pricePrevious day's closing priceChangeUps and downs
2002/1/71302.081302.081302.081302.081316.46−14.38−1.0923
2002/1/81292.711292.711292.711292.711302.08−9.37−0.7196
2009/3/22164.6662177.2942112.3362123.3672140.48924.1771.1295
2009/3/32142.1542168.2222100.6442109.8412164.666−22.512−1.04
2013/1/42524.4092558.5292498.8922551.8142522.9521.4570.0577
2013/1/72535.9852545.9692511.6032518.0472524.40911.5760.4586
2021/3/94970.9995094.3114917.9095066.1555080.025−109.025−2.1462
2021/3/105003.6125055.2794981.6165047.0594970.99932.61270.6561

Count4657465746574657465746574657
Mean2762.7112785.6042734.6022760.162761.8980.8126260.042789
Std1187.8771201.2011171.2181187.381187.57152.68161.65268
Min818.033823.86807.784816.546818.033−391.866−9.2398
25%1493.7761507.9721472.0011481.5821488.291−16.284−0.7247
50%2851.9152888.0932818.2482848.1552850.8291.33860.069
75%3607.9853648.0273560.6343605.3723606.92420.5340.8142
Max5877.2025930.9125815.6095922.0715877.202378.1799.3898

5.2.2. Parameters Analysis

Time interval (time step) is the significant factor affecting the prediction of time series data. Therefore, we test the performance of the proposed method with different steps. In the experiment, the time step is set to {5, 10, 15, 20, 25, 30}, and the experimental results are displayed in Tables 7–9. Obviously, in most cases, when the step increases, the value of each evaluation indicator decreases. It indicates that the performance of the proposed model improves with the increasing step. This is because long interval data provides more useful information for prediction. However, as the step continues to increase, the values of each evaluation indicator will increase. It indicates that the performance of the proposed model decreases with the increase of time step. The possible reason is that time series data with too long intervals contains redundant information and high volatility, which makes it difficult to capture more effective information for future data prediction.
Table 7

The results with different steps on SCI-50.

Time stepMSERMSEMAEMAPE
51682.93541.02427.188 1.023
10 1673.594 40.910 27.266 1.025
151736.98841.67727.5881.036
201757.06141.91728.3041.062
251752.67341.86528.0841.055
301780.63642.19828.5471.072

Bold in the table indicates the optimal results.

Table 8

The results with different steps on CSI-300.

Time stepMSERMSEMAEMAPE
53157.70956.19336.205 1.001
10 3085.284 55.545 36.214 1.005
153233.28456.86236.9041.025
203287.96457.34137.0261.026
253438.42958.63839.3901.082
303393.11158.25038.9401.069

Bold in the table indicates the optimal results.

Table 9

The results with different steps on SZCI.

Time stepMSERMSEMAEMAPE
534522.267185.802127.2341.186
10 33851.601 183.988 127.063 1.180
1534495.000185.730128.2991.190
2034767.899186.462128.9431.195
2536065.960189.910132.3941.226
3036287.302190.492132.8121.228

Bold in the table indicates the optimal results.

5.2.3. Convergence Analysis

In order to verify the convergence of our proposed method, we plot the curves of loss values (MSE) on the training set and validation set for each dataset. From Figure 9, we can see that our model reaches convergent very quickly on the training set. For the validation set, the loss values (MSE) of the proposed model fluctuate but basically maintains stability when the number of iteration (Epochs) is greater than 400.
Figure 9

The curves of loss values (MSE) on the training set and validation set of three stock datasets. (a) SCI-50. (b) CSI-300. (c) SZCI.

5.2.4. Performance Analysis

In order to further test the performance of the proposed method, we compare it with GRU, BiGRU, GRU-AE, BiGRU-AE, GRU-AE-AM, and BiGRU-AE-AM. Tables 10–12 show the results of different methods on three stock datasets. The following conclusions can be drawn from the experimental results:
Table 10

The results with step value of 10 on SCI-50.

MethodMSERMSEMAEMAPE
GRU2356.92548.54835.9581.328
BiGRU2267.46247.61835.4661.342
GRU-AE2371.06448.69437.0741.419
BiGRU-AE1964.47744.32231.1291.164
GRU-AE-AM2040.47745.17231.3341.164
BiGRU-AE-AM1814.95242.60228.4831.062
Our method 1673.594 40.910 27.266 1.025

Bold in the table indicates the optimal results.

Table 11

The results with step value of 10 on CSI-300.

MethodMSERMSEMAEMAPE
GRU4262.66465.28946.7991.264
BiGRU3614.21960.11841.6251.137
GRU-AE3382.45758.15938.5881.070
BiGRU-AE3828.39361.87444.6421.244
GRU-AE-AM3798.57561.63341.3921.127
BiGRU-AE-AM3726.03461.04139.8801.084
Our method 3085.284 55.545 36.214 1.005

Bold in the table indicates the optimal results.

Table 12

The results with step value of 10 on SZCI.

MethodMSERMSEMAEMAPE
GRU37269.796193.054137.3831.271
BiGRU35012.771187.114126.924 1.180
GRU-AE34737.291186.379128.8211.203
BiGRU-AE37163.139192.777134.7931.257
GRU-AE-AM35198.463187.613130.6111.218
BiGRU-AE-AM34946.619186.940131.0151.216
Our method 33851.601 183.988 127.063 1.180

Bold in the table indicates the optimal results.

The performances of traditional GRU and BiGRU models are lower than those of other comparison methods. Furthermore, BiGRU not only makes use of the useful information of historical data in the forward direction but also mines the dependence of current data on historical data in the reverse direction. Therefore, BiGRU has better performance than GRU. The performances of recurrent neural networks (GRU-AE and BiGRU-AE) are superior to the traditional recurrent neural network (GRU and BiGRU). It indicates that introducing encoding-decoding into the recurrent neural network is beneficial to improving the prediction performance of the proposed model. The performances of the recurrent neural network-AE model based on attention mechanisms (GRU-AE-AM and BiGRU-AE-AM) exceed the recurrent neural network-AE model (GRU-AE and BiGRU-AE). It demonstrates that introducing the attention mechanism into the recurrent neural network can mine significant information in time series data. The proposed model is based on the idea of integrating encoding-decoding and attention mechanisms simultaneously into the recurrent neural network. Different from GRU-AE-AM and BiGRU-AE-AM, the proposed method develops the attention mechanism in the decoding stage to capture the degree of importance between different intervals. Therefore, compared with other methods, the presented method establishes significant advantages on different evaluation indicators.

5.3. Demand Forecast for Shared Bicycle Data

5.3.1. Shared Bicycle Data Description

The datasets of this experiment are derived from the shared bicycle demand of three streets in Shenzhen, China, such as Longgang Central City, Pingshan Street, and Zhaoshang Street. Each data set contains the historical travel data of shared bicycles, time attribute data (such as hours, working day or not), and weather data (such as temperature, rainfall, wind speed, and humidity). The details are listed in Table 13.
Table 13

The description of shared bicycle datasets.

DatasetTimeQuantity by hour
Longgang central city2016.6–2017.8 (except Dec.)6935
Pingshan street2016.7–2017.8 (except Dec.)6935
Zhaoshang street2016.7–2016.112907

5.3.2. Parameters Analysis

In this experiment, the influence of the time step on the prediction performance is also analyzed adequately. The step size setting is consistent with stock price prediction experiments, and experimental results are shown in Tables 14–16. We can see that the effect of steps differs from the experimental results on the stock data. Firstly, with the increase of steps, the evaluation indicator values of the proposed method decreased on Longgang and Pingshan datasets. However, this trend does not always remain unchanged, and the opposite result will occur when the step continues to increase. Accordingly, the performance of the proposed model will also decrease. Secondly, the results are different from Tables 7–9 and Tables 14–16 on the Longgang Street dataset. When the step is set to the minimum (L = 5), the proposed method can obtain the optimal results. The cause is maybe that time series data has strong dependence and complex data structure.
Table 14

The results with different steps of shared bicycle data on Longgang.

Time stepMSERMSEMAEMAPE
5684.76426.16817.429102.576
10663.62925.76116.88189.556
15672.78025.93816.598 79.102
20 652.445 25.543 16.421 88.784
25726.19526.94817.69793.377
30695.37726.37017.800123.276

Bold in the table indicates the optimal results.

Table 15

The results with different steps of shared bicycle data on Pingshan.

Time stepMSERMSEMAEMAPE
5240.87015.52011.77820.356
10227.61815.08711.38617.991
15 222.815 14.927 11.247 17.931
20238.98115.45912.04622.808
25228.70515.12311.44919.670
30224.91014.99711.343 17.497

Bold in the table indicates the optimal results.

Table 16

The results with different steps of shared bicycle data on Zhaoshang.

Time stepMSERMSEMAEMAPE
5 1071.253 32.730 22.051 76.338
101084.64832.93422.286 58.965
151282.64335.81424.42363.207
201201.24634.65923.58663.110
251322.34036.36424.47762.043
301430.12537.81725.93667.414

Bold in the table indicates the optimal results.

5.3.3. Performance Analysis

Similarly, the proposed method is compared with other well-known methods, and the results are shown in Tables 17–19. On the whole, the experimental results are consistent with those of stock experiments, except for the data in Longgang. In particular, the proposed method can achieve better performance with a step value of 20. It indicates that the data structure is relatively simple, which is prone to overfitting for the complex model. Therefore, the evaluation metrics of the bidirectional recurrent neural network model (BiGRU, BiGRU-AE, and BiGRU-AE-AM) are higher than those of the recurrent neural network model with unidirectional structure (GRU, GRU-AE, and GRU-AE-AM).
Table 17

The results with step 20 of shared bicycle data on Longgang.

MethodMSERMSEMAEMAPE
GRU717.63426.78917.79176.668
BiGRU718.04926.79617.711 63.458
GRU-AE784.71328.01318.660106.267
BiGRU-AE904.71430.07822.273163.759
GRU-AE-AM740.38227.21017.72685.522
BiGRU-AE-AM828.92828.79118.10991.725
Our method 652.445 25.543 16.421 88.784

Bold in the table indicates the optimal results.

Table 18

The results with step 15 of shared bicycle data on Pingshan.

MethodMSERMSEMAEMAPE
GRU308.12117.55313.75821.750
BiGRU229.62715.15311.48117.345
GRU-AE275.27316.59112.69019.307
BiGRU-AE270.09516.43512.39717.419
GRU-AE-AM212.59214.58110.80015.397
BiGRU-AE-AM211.90314.55710.876 15.818
Our method 222.815 14.927 11.247 17.931

Bold in the table indicates the optimal results.

Table 19

The results with step 5 of shared bicycle data on Zhaoshang.

MethodMSERMSEMAEMAPE
GRU1174.32734.26824.86455.381
BiGRU1139.84033.76224.26355.748
GRU-AE1124.14233.52824.13486.665
BiGRU-AE1180.54134.35923.88272.879
GRU-AE-AM1241.61635.23723.77346.250
BiGRU-AE-AM1195.64334.57822.705 43.788
Our method 1071.253 32.730 22.051 76.338

Bold in the table indicates the optimal results.

6. Conclusions and Future Works

In this paper, to improve the accuracy of time series data prediction, the autoencoder, recurrent neural network, attention mechanism, convolution module, and full connection module are integrated to establish a novel prediction model based on an encoding-decoding framework. The prediction performances are evaluated for the stock price and the demand for shared bicycles on three stock datasets and three shared bicycle datasets, respectively. In addition, we compare it with many other related methods, which demonstrate that the proposed model has higher prediction accuracy from the views of multiple quantitative indicators (such as MSE, RMSE, MAE, and MAPE). The future works mainly include the following points. (1) We will try to apply the proposed model to prediction tasks of time series data in other fields (such as medical, energy, environment, and other industrial data). (2) Using the core idea, we will further extend it to solve the anomaly detection task of time series data. (3) We will intensively study how to combine the traditional multivariate time series method with deep learning to further improve the prediction performance in real applications.
  10 in total

Review 1.  Deep learning.

Authors:  Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal:  Nature       Date:  2015-05-28       Impact factor: 49.962

2.  A novel Encoder-Decoder model based on read-first LSTM for air pollutant prediction.

Authors:  Bo Zhang; Guojian Zou; Dongming Qin; Yunjie Lu; Yupeng Jin; Hui Wang
Journal:  Sci Total Environ       Date:  2021-01-05       Impact factor: 7.963

3.  Deep Learning for Time Series Forecasting: A Survey.

Authors:  José F Torres; Dalil Hadjout; Abderrazak Sebaa; Francisco Martínez-Álvarez; Alicia Troncoso
Journal:  Big Data       Date:  2020-12-03       Impact factor: 2.128

4.  An Experimental Review on Deep Learning Architectures for Time Series Forecasting.

Authors:  Pedro Lara-Benítez; Manuel Carranza-García; José C Riquelme
Journal:  Int J Neural Syst       Date:  2021-02-16       Impact factor: 5.866

5.  Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review.

Authors:  Waseem Rawat; Zenghui Wang
Journal:  Neural Comput       Date:  2017-06-09       Impact factor: 2.026

6.  Multitask Air-Quality Prediction Based on LSTM-Autoencoder Model.

Authors:  Xinghan Xu; Minoru Yoneda
Journal:  IEEE Trans Cybern       Date:  2021-04-15       Impact factor: 11.448

7.  Nonpooling Convolutional Neural Network Forecasting for Seasonal Time Series With Trends.

Authors:  Shuai Liu; Hong Ji; Morgan C Wang
Journal:  IEEE Trans Neural Netw Learn Syst       Date:  2019-09-04       Impact factor: 10.451

8.  Bidirectional RNN for Medical Event Detection in Electronic Health Records.

Authors:  Abhyuday N Jagannatha; Hong Yu
Journal:  Proc Conf       Date:  2016-06

9.  A Convolutional Deep Clustering Framework for Gene Expression Time Series.

Authors:  Ozan Frat Ozgul; Batuhan Bardak; Mehmet Tan
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2021-12-08       Impact factor: 3.710

10.  Prediction of Heart Disease Using a Combination of Machine Learning and Deep Learning.

Authors:  Rohit Bharti; Aditya Khamparia; Mohammad Shabaz; Gaurav Dhiman; Sagar Pande; Parneet Singh
Journal:  Comput Intell Neurosci       Date:  2021-07-01
  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.