Skoči na glavni sadržaj

Izvorni znanstveni članak

https://doi.org/10.7906/indecs.18.4.6

Forecasting Stock Market Indices using Machine Learning Algorithms

Žmuk Berislav ; University of Zagreb – Faculty of Economics and Business
Jošić Hrvoje ; University of Zagreb – Faculty of Economics and Business


Puni tekst: engleski pdf 953 Kb

str. 471-489

preuzimanja: 258

citiraj

Preuzmi JATS datoteku


Sažetak

In recent years machine learning algorithms have become a very popular tool for analysing financial
data and forecasting stock prices. The goal of this article is to forecast five major stock market
indexes (DAX, Dow Jones, NASDAQ, Nikkei 225 and S&P 500) using machine learning algorithms
(Linear regression, Gaussian Processes, SMOreg and neural network Multilayer Perceptron) on
historical data covering the period February 1, 2010, to January 31, 2020. The forecasts were made by
using historical data in different base period lengths and forecasting horizons. The precision of machine
learning algorithms was evaluated with the help of error metrics. The results of the analysis have
shown that machine learning algorithms achieved highly accurate forecasting performance. The overall
precision of all algorithms was better for shorter base period lengths and forecast horizons. The results
obtained from this analysis could help investors in determining their optimal investment strategy.
Stock price prediction remains, however, one of the most complex issues in the field of finance.

Ključne riječi

machine learning, neural networks, stock market indices prediction

Hrčak ID:

255406

URI

https://hrcak.srce.hr/255406

Datum izdavanja:

30.10.2020.

Posjeta: 850 *




INTRODUCTION

In recent years incredible progress in the field of artificial intelligence, machine learning and neural networks has been seen. Machine learning, as a subset of artificial intelligence, is closely related to computational statistics related to the building of algorithms which learn on training data to make predictions. It is is a current application of artificial intelligence concerned with the discovery of models, patterns and other regularities in data[1]. Machine learning is one of the most exciting recent technologies in artificial intelligence and has many applications that’s we make use of daily such as virtual personal assistants, videos surveillance, social media services, online customer support, etc.[2]. Machine learning algorithms have become a very popular tool for analysing financial data and forecasting stock prices in the last few years[3]. The rapid growth of information technology and the Internet lead to the fast development of computer science methods. Neural networks are efficient methods for stock market prediction mostly implemented in forecasting stock prices and returns. The backpropagation algorithm is most frequently methodology used. The benefits of the artificial neural network are their ability to predict stock price movements even in situations with uncertain data[4]. The prediction of stock market prices and indexes is, however, a difficult task because various factors affect the stock price formation. The goal of this article is to forecast stock market indexes using machine learning algorithms. Weka is a collection of machine learning algorithms for various data mining tasks such as data pre-processing, classification, regression, clustering, associate rules, visualization and forecasting. The algorithms are Linear regression, Gaussian Processes, SMOreg and neural network Multilayer Perceptron. In the article, the prediction of five major stock market indexes (DAX performance-index (DAX), Dow Jones Industrial Average (Dow Jones), NASDAQ Composite (NASDAQ), Nikkei 225 and S&P 500) will be made using historical data from February 1, 2010, to January 31, 2020. The prediction will be made separately for each of observed major stock market indexes using historical (training) data for three different periods (ten, five and one years) using machine learning algorithms. The forecast will be made for 5, 10, 15 and 20-time units (days) in the future. In that way, it will be inspected how well selected forecasting approaches are performing for different forecasting horizons. The forecasting precision of machine learning tools will be evaluated using MAE, MSE and MAPE error metrics. It is expected that machine learning algorithms will have a high level of precision in predicting the future values of major stock market indexes. The novel in this article in regards to the previous research is more rigorous analysis of stock market indices forecasting using machine learning algorithms. In the article the comparison of machine learning algorithms’ efficiency was made using historical training data on a longer and medium time period (10, 5 and 1 year) and by deviding the evaluations on training and held-out training 0,3 data for five stock market indices (DAX, Dow Jones, NASDAQ, Nikkei 225 and S&P 500). The robustness of the analysis is evident in using various error metrics (MAE, MSE and MAPE) for evaluation of forecasting precision of machine learning algorithms in five, ten, fifteen and twenty day horizon forecasts in the future. In this way the efficiency of machine learning algorithms was examined in a more comprehensive way in regards to the previous research. Article is structured in five chapters. After the introduction, literature review elaborates on the application of machine learning techniques in stock market prediction. In the methodology and data section, the main characteristics of data and data sources are explained. Besides, the preparation of data for the analysis in a detail are explained the main features of machine learning algorithms. In the results and discussion section, descriptive statistics of data is displayed first after which the individual forecasting performance for the market indices and comparison of forecasting results between the market indices is presented. The final chapter presents concluding remarks, gives limitations of article and guidelines for future research.

LITERATURE REVIEW

In[5] built a model using a decision tree classifier and historical data of three major companies listed in the Amman Stock Exchange (ASE). The proposed model could be a helpful tool for investors in the stock market to decide when to buy or sell stocks. The stock market price prediction ability of artificial neural networks before and after demonetization in India[6] by observing nine stocks and CNX NIFTY50 index was investigated. Multilayered neural networks were trained by the Levenberg-Marquardt algorithm. The networks proposed efficiently predicted the close price and worked best for high volatile market conditions. A predictive study of the principal index of the Brazilian stock market[7] with the help of artificial neural networks and adaptive exponential smoothing method was performed. The objective was to compare the forecasting performance of both methods by evaluating the accuracy of both methods to predict stock market returns. The results showed that both methods produced similar results in predicting the index returns. In[8] the ensemble learning algorithm to increase predictive efficiency developed. Twelve indicators are ranked by market participants using the VIKOR method. The importance of each indicator was based on specified nose and output values. The results have shown that OBV, CCI and EMA indicators are very important. Furthermore, the SVM method of machine learning showed the superiority of the results in prediction accuracy. Using Rapidminer tool[9] examined and applied different prediction models techniques using stock market historical prices giving recommendations for buying or selling in the stock market. Comparing different predictive functions they found that deep learning function predicted stock price more accurately than other functions. According to[10] different techniques for stock prediction were classified categorically in time series, neural network and its variations and hybrid techniques (the combination of neural network with different machine learning techniques). It was shown that the neural network was the best technique to predict stock prices, especially in the case when de-noising schemes are applied with the neural network. Five methods of analyzing stocks to predict day’s closing price[11] were combined. Those are Typical Price (TP), Bollinger Bands, Relative Strength Index (RSI), CMI and Moving Average (MA). The results showed that algorithms predicted closing price in more than 50 % of cases with a high level of significance. Recurrent neural networks with character-level language model pre-training for both intraday and interday stock market forecasting were explored[3]. It was shown that the use of character-level embeddings was promising and competitive with other complex models which use technical indicators and event extraction methods. The authors[12] predicted the Turkish stock market BIST 30 Index using deep learning where features are selected from common important technical indicators. They trained and tested their model to outperform other techniques such as an artificial neural network (ANN) concluding that deep learning has proved itself as a promising solution for complex problemsolving. A comprehensive survey of more than 150 articles on machine learning application to financial markets forecasting was made[13]. Machine learning algorithms tend to outperform traditional stochastic methods in financial market forecasting. Moreover, on average recurrent neural networks outperformed feed-forward neural networks as well as support vector machines.The profitability of artificial neural networks on the Taiwan Weighted Index and in the S&P 500 was investigated[14]. The authors created an efficient and inexpensive method for investors to ensure a good investment return and found that the trading rule based on artificial neural networks generates higher returns than the buy-hold strategy. Neural networks to forecast S&P and Gold futures in the period of 90 months were employed[15]. The forecasted parameters for the networks relied on 15 months of patterns while network forecast performance was tested and evaluated over a period of 75 months. The networks were able to correctly predict the sign of the price change in 61 % and 75 % of the times for gold trade and the S&P index. A method of feature selection for stock indexes and deep learning model to do sentiment analysis was proposed[16]. An accurate stock trend prediction method chosen was LSTM (Long Short-term Memory). Two approaches for measurement and forecasting of realized variance are Heterogeneous AutoRegressive model (HAR-RV) and Feedforward Neural Networks (FNNs),[17]. The application was made for the DAX index. Compared to traditional models FNN-HAR-type models had better accuracy but only on the sample data. Conditional Value-at-Risk (CVaR) method was applied for the Croatian stock market on th sample of 29 stocks grouped into 8 sectors in three different periods. The results have shown that sectors that are risky in the period of economic growth are not the same sectors that are risky during the period of economic crisis or stagnation,[18]. In this article a comprehensive approach for forecasting of stock market indices will be made by applying machine learning algorithms. The methodology applied in the article builds on previous attempts in the empirical literature by imploying a comprehensive and extensive approach to analysis of major stock market indices. The comparison of machine learning algorithms’ efficiency was made by using longer historical time-data series for five stock market indices, dividing the data on training and held-out training dataset, expanding the forecast horizon from five to twenty days and implementing different error metrics (MAE, MSE and MAPE).

DATA AND METHODOLOGY

Following the research aim of the article to forecast major stock market indexes using machine learning techniques and in order to inspect the successfulness and usability of different forecasting approaches, two main requirements should be fulfilled. The first requirement is the availability of long enough time series. The second requirement is that there are no many time series breaks or periods with no data availability. In order to meet both criteria, it has been decided that in the article data related to five major world market indices are going to be observed and analysed. Following five market indices are chosen: DAX performance-index (DAX), Dow Jones Industrial Average (Dow Jones), NASDAQ Composite (NASDAQ), Nikkei 225 and S&P 500. The data for the selected market indices are taken from the Yahoo! Finance web page[19][20][21][22][23]. Despite the fact that all data are taken from the same source, the observed market indices values are given in the national currencies. So, DAX is given in euros, Dow Jones, NASDAQ and S&P 500 are in US dollars, whereas Nikkei 225 is given in yens. The analysis in the article is based on historic data of various lengths. The reason for using historic time series data of different lengths is to inspect the accuracy of used forecasting approaches when a different number of training data is used as a base for calculating forecasts. Overall, three database periods are observed in the article: long, medium and short. The long base period includes historical data for the period of 10 years, the medium base period includes data for five years, whereas the short base period includes historical data from just one year. Here the 10 years base period covers historical data from February 1, 2010, to January 31, 2020, the five years base period includes data from February 1, 2015, to January 31, 2020, and the one-year base period includes historical data from February 1, 2019, to January 31, 2020. It has to be emphasized that in the article daily close prices adjusted for splits are observed only. For the analysis, the data are converted from .xls and .csv formats into .arff format. The process of preparation of DAX index data for .arff format is presented in a few rows of commands in Figure 1.

@relation DAX

@attribute date date "yyyy-MM-dd"

@attribute close numeric

@data

2010-02-01,5654.479980

2010-02-02,5709.660156

2010-02-03,5672.089844

2010-02-04,5533.240234

2010-02-05,5434.339844

2020-02-01,13204.76953

2020-01-28,13323.69043

2020-01-29,13345.00000

2020-01-30,13157.12011

2020-02-02,12981.96972

Figure 1. Preparation of DAX stock market index data.

In Figure 2 is illustrated system framework for major stock market indexes prediction containing input stock market data, timestamp, periodicity, lag and overlay which are inserted into the system to forecast data.

Figure 2. System framework.
indecs-18-471-g2.png

The process of forecasting is conducted in Weka, version 3.8.4 software[24] with installed timeseriesForecasting package version 1.0.27[25]. In the basic configuration window, it has been chosen that 5, 10, 15 and 20-time units (periods) should be forecasted in the future. In that way, it will be inspected how well selected forecasting approaches are performing for different forecasting horizons. As a timestamp the option „Use an artificial time index“ is used but under periodicity, the option „Daily“ is selected. Under the advanced configuration window, four base learner configurations or four different forecasting approaches are selected. To enable comparability and repeatability of the research default settings of the base learner configurations are used. Following four base learner configurations are used: Gaussian processes for regression (Gaussian processes), linear regression for prediction (Linear regression), the backpropagation to learn a multi-layer perceptron to classify instances (Multilayer perceptron) and support vector machine for regression (SMOreg).Gaussian process for regression is a Bayesian or nonparametric approach to regression which becomes often used in the area of machine learning[26]. When the Gaussian process for regression is applied, the prior of the Gaussian process should be specified. Here the prior mean is assumed to be equal to the training data’s mean. On the other hand, linear regression is one of the most commonly used traditional predictive models[27]. In the linear regression models, the association between the output variable and explanatory variables is assumed to be estimated with a linear line. It is assumed that the distances of actual data values and the regression line is minimized. Here the simple linear regression model is assumed in which output variable is the close price of an observed market index whereas the explanatory variable is time. The multilayer perceptron is a supervised learning algorithm that learns by training on a dataset. A multilayer perceptron has consisted of an input layer and an output layer. Between those two layers, it can be found one or more nonlinear layers which are called hidden layers.

Figure 3. Prediction of DAX stock market index Multilayer perceptron (made by authors using WEKA interface)
indecs-18-471-g3.png

The backpropagation is a learning algorithm which is often used at multiplayer perceptron for finding the minimum error function. A detailed explanation of steps in multilayer perception with a hidden layer is shown in[28]. The sequential minimal optimization (SMO) is an iterative algorithm for solving regression problems by using support vector machine proposed by[29]. The analysis is conducted in two ways. The first way is to include all data as base or training data. In a second way, 30 % of the training data has been held out from the end of the series to form an independent test set. To evaluate used forecasting approaches following forecasting errors are used: mean absolute error (MAE), mean squared error (MSE) and mean absolute percentage error (MAPE). All forecasting errors are calculated by observing actual and forecasted values in the certain forecast horizon (5, 10, 15 or 20 days). By observing forecast errors in different forecast horizons, it will be inspected how the precision of a certain forecasting approach is changing with the change in the forecast horizon. In that way, it will be possible to conclude whether it is appropriate to use certain forecasting approach for forecasting more periods in the future or should it should be used only for short forecasting horizons.Mean absolute error is calculated as an average of absolute differences between actual and forecasted values. Mean squared error takes into account an average of squared differences of actual and forecasted values. Mean absolute percentage error is calculated as an average of absolute differences between actual and forecasted values divided by actual values and multiplied by 100. Because the observed market indices are not all given in the same units, mean absolute error and mean squared error are going to be used to evaluate forecasting approaches for each market index separately. On the other hand, the mean absolute percentage error is going to be used to compare results between the market indices as well. In Equations 1-3 are presented formulas for calculation of MAE, MSE and MAPE values.

indecs-18-471-g4.png
indecs-18-471-g5.png
indecs-18-471-g6.png

where ӯi is the predicted value and Yi is the observed value for the number N of observations. In Table 1 the interpretation of MAPE values according to the range of observed errors is explained.

Table 1. Interpretation of MAPE values
MAPE valueInterpretation
<109 377*
10-20Good forecasting
20-50Reasonable forecasting
> 50Inaccurate forecasting

The value of MAPE lower than 10 can be interpreted as highly accurate forecasting, the value of MAPE in the range of 10-20 can be interpreted as good forecasting, the value in the range of 20-50 is reasonable forecasting while the value of MAPE higher than 50 can be interpreted as inaccurate forecasting.

RESULTS AND DISCUSSION

DESCRIPTIVE STATISTICS The close values of the five market indices (DAX, Dow Jones, NASDAQ, Nikkei 225 and S&P 500) are observed in three different periods. Therefore, three descriptive statistics analyses have been conducted. The descriptive statistics results are shown in Tables 2, 3 and 4. In Table 2 descriptive statistics results by observing a period of 10 years is given whereas in Table 3 descriptive statistics results are given by taking into account period of 5 years. In Table 4 descriptive statistics results are presented for taking into account close daily values of the observed market indices in the period of one year. The descriptive statistics analysis results, which are given in Table 2, include close daily values in the 10 years from 1.2.2010 to 31.1.2020. Therefore, those results are presenting the situation in the long term. Due to a different number of working days, the count of daily data is different among the observed market indices. However, there are no large differences in the data count between the given market indices. Still, the coefficients of variation values reveal that in the long term the close daily prices of market indices have high variability (or volatility) level. The highest variability level in the observed period had NASDAQ (41 %) whereas the lowest variability level had DAX (25 %). The distributions of close daily prices for stock market indices DAX, Nikkei 225 and S&P 500 seem to be approximately symmetric whereas data distributions of Dow Jones and NASDAQ seem to be weak and positively asymmetric. All five observed data distributions are flatter than the standardized normal distribution is.

Table 2. Descriptive statistics of the five observed stock market indices daily values, close price, data from 1.2.2010. to 31.1.2020.
StatisticsDAXDow JonesNASDAQNikkei 225S&P 500
Count2,5332,5182,5182,4492,518
Mean9,59317,7554,80015,8871,980
Standard deviation2,3875,2071,9094,959596
Coefficient of variation2529403130
Median9,78317,0784,65416,3861,995
Minimum5,0729,6862,0928,1601,023
Maximum13,57729,3489,40224,2713,330
Skewness–0,140,430,45–0,120,22
Kurtosis–1,32–0,96–0,96–1,39–1,06
Table 3. Descriptive statistics of the five observed stock market indices daily values, close price, data from 1.2.2015. to 31.1.2020.
StatisticsDAXDow JonesNASDAQNikkei 225S&P 500
Count1,2631,2591,2591,2241,259
Mean11,64421,9236,38120,1672,473
Standard deviation1,0763,7881,3132,192368
Coefficient of variation917211115
Median11,81521,7536,31420,3072,453
Minimum8,75315,6604,26714,9521,829
Maximum13,57729,3489,40224,2713,330
Skewness–0,360,060,22–0,300,21
Kurtosis–0,81–1,47–1,28–0,82–1,16

The descriptive statistics analysis results, given in Table 3, are calculated based on close daily market indices values in the period from 1.2.2015 to 31.1.2020. The results are showing that the variability level of close daily prices is much lower in this 5-year period than in the 10-year period. The lowest variability level in this 5-year period had DAX (9 %) whereas the highest variability level had Nikkei 225 stock market index (25 %). All observed stock market indices had distributions of close daily prices almost symmetric expect DAX index for which data distribution is weak and negatively asymmetric. As in the 10-years period, all data distributions are flatter in comparison to the standardized normal distribution. In Table 4 the results of the descriptive statistics are given for the period of just one year where the close daily prices are observed from 1.2.2019 to 31.1.2020. In this short term,

Table 4. Descriptive statistics of the five observed market indices daily values, close price, data from 1.2.2019. to 31.1.2020.
StatisticsDAXDow JonesNASDAQNikkei 225S&P 500
Count251252252241252
Mean12,31926,7738,12821,9482,969
Standard deviation6581,0464961,048150
Coefficient of variation54655
Median12,26426,5738,03421,6172,941
Minimum10,90724,8157,26420,2612,706
Maximum13,57729,3489,40224,0843,330
Skewness0,190,570,790,570,59
Kurtosis–0,92–0,440,16–0,88–0,32

coefficients of variation values are showing that close daily prices have low variability for all five observed market indices. However, all five data distributions of close daily prices are more skewed than they were in the medium (5 years) and long-run (10 years). Only the distribution of close daily prices for NASDAQ stock market index is less flat than the standardized normal distribution whereas the other four data distributions are flatter. INDIVIDUAL FORECASTING PERFORMANCE FOR THE MARKET INDICES In this chapter for each observed stock market index the most precise forecasting approach is emphasized. The best forecasting approaches are listed separately according to the mean absolute error and the mean squared error criteria. In other words, in Tables, A1-A5 in the Appendix are given forecasting approaches for which the lowest error values for different situations are achieved. The best forecasting approaches are listed by taking into account forecast horizons of 5, 10, 15 and 20 days. Furthermore, base period lengths of 1, 5 and 10 years have been taken into account as well. Finally, the fact of whether all historic data or just 70 % of them has been involved in the calculation of forecast values has been also observed. It has to be emphasized that the exact results of mean absolute errors and mean squared errors are not given here due to article length limitations but the data are available upon request. In Table A1 in Appendix the best forecasting approaches for DAX are given. Both observed errors, mean absolute error and mean squared error, led to the choice of the same forecasting approach in all cases but the last one. If all data are used to calculate forecasts, SMOreg approach has shown to be the best solution if data from 10 years are observed. On the other hand, multilayer perceptron turned out to be the most precise forecasting approach when data only from one year are observed. If 30 % of the training data has been held out from the end of the series, it turned out that linear regression is the most precise when data from 10 years are used, multilayer perceptron is the best solution for time series of 5 years, whereas SMOreg is the most appropriate for short time series with a length of one year. In Table A2 in Appendix the best forecasting approaches for Dow Jones are listed. It has been shown that, when all data are observed, multilayer perception is the most precise forecasting approach when the base period length is 5 and 10 years. However, this is valid only if the forecast horizon is shorter than 20 days. On the other hand, when 30 % of the training data has been held out from the end of the series SMOreg turned out to be the most appropriate forecasting approach in most cases. According to the results from Table A3 in Appendix, where the best forecasting approaches for NASDAQ are shown when all data are observed, multilayer perception is the best solution when forecasts are based on a long period (10 years), SMOreg for the medium-long period (5 years) and Gaussian processes for short period (one year). When 30 % of the training data has been held out from the end of the series, Gaussian processes turned out to be the most precise forecasting approach for short period whereas in other situations SMOreg seems to be the best choice. Table A4 in Appendix contains a list of best forecasting approaches for Nikkei 225. When all data are used, it can be concluded that SMOreg is the best forecasting approach when forecasts are based on data from medium-long period. However, no other pattern can be recognized. On the other hand, when 30 % of the training data has been holding out from the end of the series, Gaussian processes turned out to be the most precise forecasting approach when forecasts are based on data from the short period (one year), SMOreg for forecasts based on data from the medium-long period (5 years) and linear regression is the best solution for forecasts based on data from the long period (10 years). In Table A5 in Appendix the best forecasting approaches for S&P 500 are listed. It turned out that, when all data as a base for forecasts are used, multilayer perceptron is the best solution when forecasts are based on data from the long period (10 years). In other cases, Gaussian processes approach seems to be the most precise. On the other hand, when 30 % of the training data has been holding out from the end of the series, Gaussian processes turned out to be the most precise forecasting approach when forecasts are based on data from the short period (one year), linear regression is appropriate for forecasts based on data from the medium-long period (5 years) and linear regression is the best solution for forecasts based on data from the long period (10 years). COMPARISON OF FORECASTING RESULTS BETWEEN THE MARKET INDICES To compare the best forecasting approaches between the observed market indices, the mean absolute percentage error was used. The main reason for that can be found in the fact that not all observed market indices are given in the same units (US dollars, euros, yens). In this way, the direct comparison between the observed market indices can be made. In the following tables, Tables 5-10, mean absolute percentage error values for the observed market indices for different base period lengths (1, 5 and 10 years) are given. Besides, the demarcation between situations when all data as base or training data are used and when 30 % of the training data has been holding out from the end of the series is observed as well. In the aforementioned tables, the lowest values of mean absolute percentage errors for each observed market index and four forecast horizons are bolded.

Table 5. Mean absolute percentage errors for the five observed market indices, evaluation based on all data from 1.2.2010. to 31.1.2020. Base period length is 10 years, bolded values are the lowest values of market indices for certain forecast horizon.
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes4,768,085,4013,456,10
Linear regression3,092,324,371,563,43
Multilayer perceptron7,140,721,251,111,06
SMOreg2,752,143,141,322,24
10 days
Gaussian processes39,7717,6811,4274,7916,36
Linear regression4,202,845,971,984,61
Multilayer perceptron11,210,571,400,841,08
SMOreg3,682,534,101,372,87
15 days
Gaussian processes269,6326,1916,80412,0032,04
Linear regression4,672,706,411,744,77
Multilayer perceptron14,251,441,631,501,22
SMOreg3,972,37 4,101,322,83
20 days
Gaussian processes2.043,1731,6321,922.504,6955,50
Linear regression4,324,065,313,114,54
Multilayer perceptron14,774,604,113,863,87
SMOreg4,064,104,353,333,97

According to the results from Table 5, it can be concluded that multilayer perceptron should be used as the most precise forecasting approach when long base periods are used. However, this choice is justified only for short forecast horizons. Furthermore, it should be mentioned that this conclusion is valid for four out of five observed market indices. Namely, in this case, SMOreg turned out to be the best choice for forecasting DAX.

Table 6. Mean absolute percentage errors for the five observed market indices, evaluation based on all data from 1.2.2015 to 31.1.2020. Base period length is 5 years, bolded values are the lowest values of market indices for certain forecast horizon.
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes1,087,304,975,550,46
Linear regression2,782,243,811,742,61
Multilayer perceptron2,031,334,512,312,14
SMOreg2,662,233,121,562,22
10 days
Gaussian processes1,2613,8210,2311,650,61
Linear regression3,742,715,102,443,45
Multilayer perceptron2,371,266,253,882,80
SMOreg3,602,724,181,982,97
15 days
Gaussian processes2,8318,2214,4718,060,76
Linear regression4,092,565,312,233,41
Multilayer perceptron2,311,266,804,432,80
SMOreg3,912,584,281,752,98
20 days
Gaussian processes6,7721,4718,5124,273,00
Linear regression4,064,054,773,154,07
Multilayer perceptron3,403,645,633,743,85
SMOreg4,024,034,313,053,87

In Table 6 base period length is reduced from 10 to 5 years and conclusions became not so straightforward. For NASDAQ and Nikkei 225 the most precise forecasting approach turned out to be SMOreg whereas for Dow Jones that is multilayer perceptron and for S&P 500 Gaussian processes. Those conclusions remained the same for all four observed forecast horizons. By reducing the base period length to one year the general conclusion is even more difficult to bring. The results from Table 7 are not consistent across the observed market indices. In the short forecast horizons, Gaussian processes and SMOreg forecasted well. However, for longer forecast horizons multilayer perceptron turned out to be the most precise forecasting approach. In Table 8 the values of mean absolute percentage errors are given when the base period length is 10 years but when 30 % of the training data has been holding out from the end of the series. The results are consistent through all forecast horizons. For Dow Jones, NASDAQ and S&P 500 the most precise forecasting approach is SMOreg whereas for DAX and Nikkei 225 the most precise forecasting approach is linear regression.

Table 7. Mean absolute percentage errors for the five observed market indices, evaluation based on all data from 1.2.2019. to 31.1.2020. Base period length is one year, bolded values are the lowest values of market indices for certain forecast horizon.
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes3,441,221,265,290,81
Linear regression1,821,271,561,171,00
Multilayer perceptron0,922,962,401,272,11
SMOreg1,531,001,721,001,04
10 days
Gaussian processes3,231,201,626,430,58
Linear regression2,351,221,770,981,01
Multilayer perceptron1,283,302,861,412,17
SMOreg1,840,752,051,581,14
15 days
Gaussian processes2,691,671,806,980,93
Linear regression2,431,291,901,341,15
Multilayer perceptron1,492,882,851,291,95
SMOreg1,661,262,113,201,25
20 days
Gaussian processes5,224,713,899,633,79
Linear regression3,463,974,143,533,74
Multilayer perceptron2,844,834,002,843,64
SMOreg3,404,294,226,403,83
Table 8. Mean absolute percentage errors for the five observed market indices, evaluation based on 0,3 training data from 1.2.2010. to 31.1.2020. Base period length is 10 years, bolded values are the lowest values of market indices for certain forecast horizon.
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes8,3423,24 154,2892,7280,44
Linear regression3,8814,0917,881,6113,40
Multilayer perceptron6,7419,2035,157,9517,34
SMOreg6,106,036,824,306,65
10 days
Gaussian processes88,06123,351.156,68666,65607,60
Linear regression5,3618,1022,912,1116,93
Multilayer perceptron7,6824,1040,829,2122,02
SMOreg9,539,4210,647,5110,49
15 days
Gaussian processes926,64645,289,869,295.383,525.128,98
Linear regression6,0519,4424,751,8718,09
Multilayer perceptron8,1725,7741,799,8023,86
SMOreg12,0911,7313,399,5813,15
20 days
Gaussian processes11.624,54 3.757,3797.448,9150.289,9950.370,14
Linear regression5,2118,2524,023,1416,80
Multilayer perceptron10,8624,8440,5812,2323,01
SMOreg12,4411,9814,1110,2213,66
Table 9. Mean absolute percentage errors for the five observed market indices, evaluation based on 0,3 training data from 1.2.2015. to 31.1.2020. Base period length is 5 year, bolded values are the lowest values of market indices for certain forecast horizon
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes71,2380,7075,7395,2349,86
Linear regression14,033,108,174,210,44
Multilayer perceptron0,6619,425,848,204,39
SMOreg6,573,531,212,171,69
10 days
Gaussian processes119,6986,2464,42124,7444,66
Linear regression20,895,039,605,960,59
Multilayer perceptron0,7622,277,4212,015,70
SMOreg10,085,022,293,402,10
15 days
Gaussian processes140,4771,4053,23116,9037,52
Linear regression25,407,0610,718,141,22
Multilayer perceptron1,0123,077,7013,195,65
SMOreg12,585,613,715,562,07
20 days
Gaussian processes138,7672,7759,48107,4041,92
Linear regression27,2111,1614,0111,714,05
Multilayer perceptron3,4021,676,3212,205,12
SMOreg12,774,767,269,333,81

According to the results from Table 9, when the base period length is reduced to 5 years, SMOreg forecasting approach turned out to be the most precise in most cases through all four observed forecast horizons. When the base period length is reduced to one year and when 30 % of the training data has been holding out from the end of the series, the results from Table 10 are suggesting that Gaussian processes should be used as the most precise forecasting approach. However, for forecasting DAX the most precise forecasting method turned out to be SMOreg. From the aforementioned results, it can be concluded that machine learning algorithms achieved highly accurate forecasting performance although in some cases the precision could be classified as good forecasting. The exception is the Gaussian processes which showed some incompatibility with data predicted. However, the precision of this algorithm was better for shorter base period lengths and forecast horizons, ie. 1 year base period and 5 days forecast horizon. The precision of all algorithms was expectedly better for shorter base periods and shorter forecast horizons. Furthermore, the precision of all algorithms was much better when all data were included in the analysis concerning the evaluations based only on 0,3 training data. Results obtained from this analysis are in line with other research in this field, machine learning algorithms and neural networks can be characterized as efficient methods for stock market index prediction.

CONCLUSIONS

The goal of the article was to forecast stock market indexes using machine learning algorithms. The results of the analysis have shown that machine learning algorithms achieved highly accurate forecasting performance but in some cases, Gaussian processes specifically, the precision was less than high accurate. This could be explained with the algorithm’s incompatibility

Table 10. Mean absolute percentage errors for the five observed market indices, evaluation based on 0,3 training data from 1.2.2019 to 31.1.2020. Base period length is one year, bolded values are the lowest values of market indices for certain forecast horizon.
Forecast horizon / Forecasting approachMarket index
DAXDow JonesNASDAQNikkei 225S&P 500
5 days
Gaussian processes3,102,315,480,880,67
Linear regression1,874,628,024,446,30
Multilayer perceptron17,563,7010,2122,0510,00
SMOreg0,964,527,301,015,35
10 days
Gaussian processes3,222,806,021,190,71
Linear regression3,126,1611,106,978,43
Multilayer perceptron21,094,4812,0324,9712,50
SMOreg0,736,4210,530,92 7,65
15 days
Gaussian processes3,013,235,382,211,08
Linear regression4,326,4412,399,939,12
Multilayer perceptron20,104,1712,5724,8812,58
SMOreg0,916,9812,182,238,60
20 days
Gaussian processes3,806,235,374,863,91
Linear regression7,835,4011,2014,387,52
Multilayer perceptron16,574,5611,0423,5310,43
SMOreg3,725,8211,395,277,20

with data predicted. The overall precision of all algorithms was better for shorter base period lengths and shorter forecast horizons as well as when all data were included in the analysis regarding the evaluations based on only on 0,3 training data. Limitations of the article are related to the use of only historical data for the prediction of stock market index values. This is, however, the most common approach in forecasting stock price movements. The use of historical data corresponds to the technical analysis of the stock market. Technical analysis studies historical market data, including prices and volumes in the form of chart patterns and technical indicators. In this article, the fundamental analysis was left out of the framework. Another important limitation of the use of machine learning algorithms for prediction of stock market indexes is in the case of unexpected events or Black swan events such as the spread of COVID-19 when the precision of forecast could not be the most accurate. The achieved performance of machine learning algorithms evaluated in this article could be improved with the inclusion of fundamental analysis as a measure of security’s intrinsic value by examining related economic and financial factors. Recommendations for future research could be related to further optimization of algorithms used and investigation of COVID-19 impact on stock market indexes. Stock price prediction remains one of the most complex issues in finance because the factors that influence stock price formation are complex and hard to predict. The optimal prediction method based on machine learning algorithms could help investors in determining their actual best buy-sell strategy and maximizing their profit.

Appendices

Appendix

Table A1. The best forecasting approaches for DAX close daily values, mean absolute error and mean squared error criteria.
Forecast horizon/Base period lengthMean absolute errorMean absolute error
Mean absolute errorMean squared errorMean absolute errorMean squared error
5 days
10 yearsSMOregSMOregLinear regressionLinear regression
5 yearsGaussian processesGaussian processesMultilayer percMultilayer perc
1 yearMultilayer percMultilayer percSMOregSMOreg
10 days
10 yearsSMOregSMOregLinear regressionLinear regression
5 yearsGaussian processesGaussian processesMultilayer perc.Multilayer perc.
1 yearMultilayer perc.Multilayer perc.SMOregSMOreg
15 days
10 yearsSMOregSMOregLinear regressionLinear regression
5 yearsMultilayer perc.Multilayer perc.Multilayer perc.Multilayer perc.
1 yearMultilayer perc.Multilayer perc.SMOregSMOreg
20 days
10 yearsSMOregSMOregLinear regressionLinear regression
5 yearsMultilayer perc.Multilayer perc.Multilayer perc.Multilayer perc.
1 yearMultilayer perc.Multilayer perc.SMOregGaussian processes
Table A2. The best forecasting approaches for Dow Jones close daily values, mean absolute error and mean squared error criteria.
Forecast horizon/Base period lengthMean absolute errorMean absolute error
Mean absolute errorMean squared errorMean absolute errorMean squared error
5 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsMultilayer perc.Multilayer perc.Linear regressionLinear regression
1 yearSMOregSMOregGaussian processesGaussian processes
10 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
1 yearSMOregSMOregGaussian processesGaussian processes
15 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
1 yearSMOregLinear regressionGaussian processesGaussian processes
20 days
10 yearsLinear regressionLinear regressionSMOregSMOreg
5 yearsMultilayer perc.SMOregSMOregSMOreg
1 yearLinear regressionMultilayer perc.Multilayer perc.Multilayer perc.
Table A3. The best forecasting approaches for NASDAQ close daily values, mean absolute error and mean squared error criteria
Forecast horizon/Base period lengthMean absolute errorMean absolute error
Mean absolute errorMean squared errorMean absolute errorMean squared error
5 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsSMOregSMOregSMOregSMOreg
1 yearGaussian processesGaussian processesGaussian processesGaussian processes
10 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsSMOregSMOregSMOregSMOreg
1 yearGaussian processesGaussian processesGaussian processesGaussian processes
15 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsSMOregSMOregSMOregSMOreg
1 yearGaussian processesGaussian processesGaussian processesGaussian processes
20 days
10 yearsMultilayer perc.SMOregSMOregSMOreg
5 yearsSMOregSMOregMultilayer perc.Multilayer perc.
1 yearGaussian processesMultilayer perc.Gaussian processesGaussian processes
Table A4. The best forecasting approaches for Nikkei 225 close daily values, mean absolute error and mean squared error criteria.
Forecast horizon/Base period lengthMean absolute errorMean absolute error
Mean absolute errorMean squared errorMean absolute errorMean squared error
5 days
10 yearsMultilayer perc.Multilayer perc.Linear regressionLinear regression
5 yearsSMOregSMOregSMOregSMOreg
1 yearSMOregSMOregGaussian processesGaussian processes
10 days
10 yearsMultilayer perc.Multilayer perc.Linear regressionLinear regression
5 yearsSMOregSMOregSMOregSMOreg
1 yearLinear regressionLinear regressionSMOregSMOreg
15 days
10 yearsSMOregSMOregLinear regressionLinear regression
5 yearsSMOregSMOregSMOregSMOreg
1 yearMultilayer perc.Multilayer perc.Gaussian processesGaussian processes
20 days
10 yearsLinear regressionLinear regressionLinear regressionLinear regression
5 yearsSMOregLinear regressionSMOregSMOreg
1 yearMultilayer perc.Multilayer perc.Gaussian processesGaussian processes
Table A5. The best forecasting approaches for S&P 500 close daily values, mean absolute error and mean squared error criteria
Forecast horizon/Base period lengthMean absolute errorMean absolute error
Mean absolute errorMean squared errorMean absolute errorMean squared error
5 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsGaussian processesGaussian processesLinear regressionLinear regression
1 yearGaussian processesGaussian processesGaussian processesGaussian processes
10 days
10 yearsMultilayer perc.Multilayer perc.SMOregSMOreg
5 yearsGaussian processesGaussian processesLinear regressionLinear regression
1 yearGaussian processesGaussian processesGaussian processesGaussian processes
15 days
10 yearsMultilayer percMultilayer percSMOregSMOreg
5 yearsGaussian processesGaussian processesLinear regressionLinear regression
1 yearGaussian processesLinear regressionGaussian processesGaussian processes
20 days
10 yearsMultilayer perc.SMOregSMOregSMOreg
5 yearsGaussian processesSMOregSMOregSMOreg
1 yearMultilayer perc.Multilayer perc.Gaussian processesGaussian processes

References

1 

Kulkarni A.D. 2016. Formulation of a Prediction Index with the Help of WEKA Tool for Guiding the Stock Market Investors. Oriental journal of Computer science & technology, 9, 3 212-225.

2 

Das S. 2015. Applications of Artificial Intelligence in Machine Learning: Review and Prospect. International Journal of Computer Applications, 115, 9 31-41. https://doi.org/10.5120/20182-2402http://dx.doi.org/10.5120/20182-2402

3 

Pinheiro L.. Stock Market Prediction with Deep Learning: A Character-based Neural Language Model for Event-based Trading. Brisbane: Proceedings of the Australasian Language Technology Association Workshop, 2017.

4 

Zekić. Network Applications in Stock Market Predictions - A Methodology Analysis. 1st May 2020. http://www.machine-learning.martinsewell.com/ann/Zeki98.pdf

5 

Al-Radaideh L.. Predicting stock prices using data mining techniques. Brisbane: The International Arab Conference on Information Technology, 2013.

6 

Chopra S. 2019. Artificial Neural Networks Based Indian Stock Market Price Prediction: Before and After Demonetization. International Journal of Swarm Intelligence and Evolutionary Computation, 8, 1 1-17.

7 

De Faria E.L. 2009. Predicting the Brazilian stock market through neural networks and adaptive exponential smoothing methods. Expert Systems with Applications, 36, 10 12506-12509. https://doi.org/10.1016/j.eswa.2009.04.032http://dx.doi.org/10.1016/j.eswa.2009.04.032

8 

Emami S. 2018. Predicting Trend of Stock Prices by Developing Data MiningTechniques with the Aim of Gaining Profit. Journal of Accounting & Marketing, 7, 4 1-12.

9 

Garg P. 2019. An Efficient Prediction of Share Price using Data Mining Techniques. International Journal of Engineering and Advanced Technology (IJEAT), 8, 6 3110-3115. https://doi.org/10.35940/ijeat.F9085.088619http://dx.doi.org/10.35940/ijeat.F9085.088619

10 

Iqbal Z. 2013. Efficient Machine Learning Techniques for Stock Market Prediction. International Journal of Engineering Research and Applications, 3, 6 855-867.

11 

Kannan K.S.. Financial Stock Market Forecast using Data Mining Techniques. Hong Kong: Proceedings of the International MultiConference of Engineers and Computer Scientists I, IMECS 2010, 2010.

12 

Raso H. 2019. Predicting the Turkish Stock Market BIST 30 Index using Deep Learning. International Journal of Engineering Research and Development, 11, 1 253-265.

13 

Ryll, L.. Evaluating the Performance of Machine Learning Algorithms in Financial Market Forecasting: A Comprehensive Survey. Papers, 2019.

15 

Grudnitski G. 1993. Forecasting S&P and gold futures prices: An application of neural networks. Journal of Futures Markets, 13, 6 631-643. https://doi.org/10.1002/fut.3990130605http://dx.doi.org/10.1002/fut.3990130605

16 

Xu. Stock Market Trend Prediction with Sentiment Analysis based on LSTM Neural Network. (2019) Proceedings of the International MultiConference of Engineers and Computer Scientists 2019 IMECS. 1st May 2020. https://www.semanticscholar.org/paper/Stock-market-trend-prediction-with-sentiment-based-Jia-wei-Murata/54827c06228b557ef64f7a14dfc092a3b31e2cf9

17 

Arnerić J. 2018. Neural Network Approach in Forecasting Realized Variance Using High-Frequency Data. Business Systems Research, 9, 2 18-34. https://doi.org/10.2478/bsrj-2018-0016http://dx.doi.org/10.2478/bsrj-2018-0016

17 

Aljinović Z. 2018. CVaR in Measuring Sector’s Risk on the Croatian Stock Exchange. Business Systems Research, 9, 2 8-17. https://doi.org/10.2478/bsrj-2018-0015http://dx.doi.org/10.2478/bsrj-2018-0015

19 

Yahoo! Finance. DAX performance-index (^GDAXI). 1st May 2020. https://finance.yahoo.com/quote/%5EGDAXI/history?p=%5EGDAXI

20 

Yahoo! Finance. Dow Jones Industrial Average (^DJI). 1st May 2020. https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI

21 

Yahoo! Finance. NASDAQ Composite (^IXIC). 1st May 2020. https://finance.yahoo.com/quote/%5EIXIC/history?p=%5EIXIC

22 

Yahoo! Finance. Nikkei 225 (^N225). 1st May 2020. https://finance.yahoo.com/quote/%5EN225/history?p=%5EN225

23 

Yahoo! Finance. S&P 500 (^GSPC). 1st May 2020. https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC

24 

University of Waikato. WEKA: The workbench for machine learning. 1st May 2020. https://www.cs.waikato.ac.nz/ml/weka

25 

Pentaho Data Mining. Time Series Analysis and Forecasting with Weka. 1st May 2020. https://wiki.pentaho.com/display/DATAMINING/Time+Series+Analysis+and+Forecasting+with+Weka

26 

Rasmussen C.E.. Gaussian Processes for Machine Learning. MIT Press, 2006.

27 

Balaji Prabhu B.V. 2018. Performance Analysis of the Regression and Time Series Predictive Models using Parallel Implementation for Agricultural Data. Procedia Computer Science, 132, 198-207. https://doi.org/10.1016/j.procs.2018.05.187http://dx.doi.org/10.1016/j.procs.2018.05.187

28 

Popescu M.C. 2009. Multilayer Perceptron and Neural Networks. WSEAS Transactions on Circuits and Systems, 8, 7 579-588.

29 

Smola A.J. 2004. A tutorial on support vector regression. Statistics and Computing, 14, 199-222. https://doi.org/10.1023/B:STCO.0000035301.49549.88http://dx.doi.org/10.1023/B:STCO.0000035301.49549.88


This display is generated from NISO JATS XML with jats-html.xsl. The XSLT engine is libxslt.