A Dynamic Systems Model for an Economic Evaluation of Sales Forecasting Methods

: Sales forecasts are essential for a smooth workflow and cost optimization. Usually, they are assessed using statistical error measures, which might be misleading in a business context. This paper proposes a new dynamic systems model for an economic evaluation of sales forecasts. The model describes the development of the inventory level over time and derives the resulting overstock and shortage costs. It is tested on roughly 3,000 real - world time series and compared with the commonly used approach based on statistical measures. The experiments show that different statistical measures have no coherent evaluation, making their usage even less suitable for a practical economic application.


INTRODUCTION
Working with precise sales forecasts is crucial for the supply chains of production companies or retailers [1]. It enables a smooth workflow and reduces waste along the value chain [2]. The longer the lead times of products are, the more important it is to accurately plan ahead [3]. Due to current crises, the lead times of most products have increased substantially [4], amplifying the relevance of reliable forecasts. The challenge is to find the most suitable forecasting method among plenty of existing ones, each with its own benefits [5]. Usually, a forecasting method is chosen based on statistical accuracy measures, also called error metrics. However, solely choosing a sales forecast method based on statistical accuracy measures can have some disadvantages, as the following toy example illustrates: We assume a given demand for a product and we want to compare two forecasting models that forecast the demand for five periods. The forecast of Model 1 is one unit below the actual demand for the first three periods and one unit above the actual demand for the last two periods (see Fig. 1). The forecast of Model 2 is alternating one unit above and then one unit below the actual demand (see Fig. 2). Both models had the same absolute deviations in every period. Thus, statistical measures as the Root Mean Square Error (RMSE) or the Mean Average Percentage Error (MAPE) are the same for both models, suggesting an equally good performance.  Nevertheless, from a business perspective, Model 2 outperforms Model 1 if the goods are non-perishable and their value does not decrease over time. If the company had acted upon Model 1, it would not have been able to meet the customers' demand in the first three periods (see Fig. 3). If the company had relied upon Model 2, it would have satisfied the customers' demand in every period. Thus, considering the costs of product shortage (which would have been higher for Model 1) and the costs of overstock (which would have been Situation 1 Situation 2 Possible Sales equal) over time seems to be a more suitable way for evaluating sales forecasts. Some papers already considered costs in the context of forecasting, but never the development of stock levels over time. In order to fill this gap, we introduce an intuitive dynamic systems model for an economic evaluation of sales forecasting methods, considering overstock and shortage costs that derive from the development of inventory levels.
The new method is tested on a real-world dataset of about 3,000 product sales time series of a German raw materials wholesaler. Sales forecasts are created and evaluated using the dynamic systems model but also statistical measures. The evaluations are compared with regard to the similarity and preference for a certain forecasting method.

RELATED WORK 2.1 Sales Forecasting Methods
There are many different methods for sales forecasting [5]. They rely on time series analysis and forecast future demand based on historic sales [6]. External data can be integrated into forecasts to improve them [7]. The forecasting models are usually univariate point forecasts. The models can either be statistical time series models or machine learning models. The statistical models are easy to implement and intuitively explainable. Among what [8] considers classical sales forecasting methods are ARIMA (Autoregressive Moving Average) and Holt-Winters. As the name suggests, the ARIMA model combines an autoregressive component, that links past and present values in a similar way as autocorrelation is computed, and a moving average component [9]. Holt-Winters is a seasonal smoothing method that can capture both trends and seasonal behavior within a time series [10,11]. Recently, a lot of research has been done on forecasting with machine learning methods [12]. They can provide forecasts with higher accuracy [13], but require more run time and are not intuitively explainable. However, as the no-free lunch theorem suggests, there is no single best method [14]. Depending on the characteristics of a dataset, one method achieves more precise forecasts than another method. Subsequently, it is advisable to test several forecasting algorithms and choose the most suitable one.

Statistical Sales Forecast Evaluation
The performance of sales forecasts can be evaluated by applying point forecast error metrics to a test dataset or multiple test datasets in case of (rolling) time series crossvalidation. Ref. [15] conducted a survey on forecast error measures and found that 23 different measures are in use (see Tab. 1). A list providing the full names of the error measures, whose acronyms are given in the table, can be found in the appendix.
In the context of sales forecast, the most commonly applied error measures are the RMSE [13,8,16,17,18], the MAPE [17,18,19] and MAE (Mean Absolute Error) [16,18,20]. The MAPE is easy to interpret but cannot be computed if the time series contains zeros [21]. As a remedy, its symmetric version sMAPE (Symmetric Mean Absolute Percentage Error) can be applied. It has the further advantage of penalizing under-forecasts more severely than overforecasts. As the time series for our experimental evaluation (see Section 4) contain zeros, we focus on the three error measures RMSE, MAE and sMAPE. The error measures are computed based on the actual sales and the forecasted sales � at time t [15]: (1) (3) Despite the challenge of choosing an error measure, the problem arises that none of them can provide us with a unique best forecasting method. What is an excellent RMSE value for one forecasting method in one dataset, might be a mediocre value for another dataset. Depending on the demand stability, some products are easier to forecast than others. Thus, to achieve an objective evaluation of a forecasting method, it is recommendable to create a baseline forecast, to which other more advanced methods can be compared to in the process of benchmarking [5]. A very common baseline method is naïve forecasting, which simply assumes the future sales to be equal to the most recently observed sales [22].

Inventory-Related Costs
Holding inventory entails different kinds of costs. In order to minimize the overall costs, the different kinds need to be balanced. Literature on reducing inventory-related costs focuses on capital commitment costs, order costs, and shortage costs. Sometimes, costs for the inventory control system are also considered [23]. The capital commitment costs are usually priced with the weighted average cost of capital (WACC) [24]. Storage costs can usually be disregarded because most companies own their warehouses, making the storage costs fixed [23]. Only if companies are charged fees based on a variable level of inventory the companies should consider these fees when deciding on inventory order politics [24]. Shortage costs measure the costs that occur from not being able to sell a product. These include the lost profit margin and also reputational damage and a decrease in customers' loyalty [25].

Cost-Considering Sales Forecast Evaluation
The evaluation of sales forecasts with costs has only been done rarely. The authors of [26] focused on reducing overstock costs with the help of statistical forecasting methods but disregarded shortage costs. Ref. [27] compared the sales forecasts of five neural networks in terms of supply chain-related costs. They derived the costs by looking at the deviations of the forecast and the actual demand for every period individually, not considering the development of inventory levels.
However, the question of how many goods to order to maximize the profit has been investigated a lot. The so-called newsvendor problem is a popular problem in logistics and many approaches to solving it have been published [28]. It also considers shortage costs and tries to optimize the order quantity [25] but it only regards one period. This is why it is not applicable in this paper.

THE DYNAMIC SYSTEMS MODEL
A dynamic systems model is an analytical model that describes the changes in a system over time [29]. The proposed dynamic systems model aims to describe the changes in inventory level and the resulting costs over time. To use the model as an economic evaluation for sales forecasting methods, some assumptions had to be made: • Goods are non-perishable and their value does not decrease over time • Storage costs are fixed costs • Goods are ordered at the beginning of each period, order costs do not change • The ordered goods are delivered at the beginning of period t + LT (lead time) • The lead time per product is fixed • The costs for holding a safety stock are calculated into the product price, thus only costs for an inventory level above the safety stock are considered overstock costs.
Ref. [30] reviewed and compared deterministic and statistical methods for the calculation of a safety stock. They found the statistical methods to be more accurate. Thus, we rely on the statistical method, for which the aforementioned assumptions hold, e.g. that the lead time is fixed. The dynamic systems model determines the safety stock based on a service factor Z, which depends on the desired service level, the lead time , and the standard deviation of the product's demand : In the first step, the dynamic systems model describes the inventory level development, later the resulting costs. For every period t, the model calculates the inventory level at the end of that period ( ) by subtracting the actual sales ( ) from the inventory level at the beginning of that period ( ). Naturally, an inventory level can never be negative. Thus, the formula to calculate is given by Here, depends on the inventory level at the end of the prior period ( −1 ) and the goods that had been ordered and are delivered at the beginning of each period ( ) via The number of goods delivered equals the number of goods that have been ordered in a previous period. The time gap between order and delivery is determined by the lead time of a product: The number of goods ordered depends on the predicted sales for the period in which the goods will be delivered ( � + ), the predicted sales for the period in which the goods are ordered ( � ), the safety stock and the inventory level at the moment of order ( ): The derivation of this formula can be found in the Appendix. As the dynamic systems model aims to describe the inventory level development over a certain period of time, periods prior to t = 0 need to be modeled to calculate the number of goods that are delivered in t = 0. For the periods prior to t = 0 the following assumption holds: Moreover, due to the mutual dependency of and , an assumption needs to be made about the initial inventory level: Tab. 2 displays another toy example of how the development of the inventory level is calculated. The initial inventory level −2 is the sum of the predicted sales for the period � −2 , which are 51, and the safety stock , which is 634. The number of goods ordered in t = -2 are calculated as sum of the predicted sales for that period, the predicted sales of the period in which the goods are meant to arrive (t = 0), and the safety stock. From this sum, only the inventory level at the beginning of the regarded period needs to be subtracted. Thus, −2 = 51 + 50 + 634 − 685 = 50. −2 results from subtracting 40 from 685.
The example shows a sharp increase in sales in period t = 1. This leads to a stockout at the end of that period. However, due to the application of the naïve forecasting method, the increase in sales in t = 1 also leads to very large order and to nearly 7,000 goods being delivered in period t = 4.
In the second step, the dynamic systems model calculates overstock and shortage costs based on the previously calculated inventory level development. Overstock costs per period t are calculated by multiplying the difference of the average inventory level of that period and the safety stock with an overstock cost rate w, most likely the WACC. If the average inventory level is below the safety stock, which implies the lack overstock, the product becomes negative. Therefore, the product is set to be zero at minimal: The shortage costs per period t are calculated by multiplying a shortage cost rate, e.g. the product's profit margin with the difference of the actual sales that could have taken place and the inventory level at the beginning of that period . Again, the product is limited to zero as there are no shortage costs, if the inventory level is higher than the sales: Finally, the overall costs per period t are calculated as sum of overstock and shortage costs: Tab. 3 displays the continuation of the second toy example from Tab. 2. Based on the inventory levels at the beginning and end of each period, and , the overstock and shortage costs are calculated. The enormous increase in sales in period t = 1, which could have been observed in Tab. 2, leads to shortage costs of 327.24. The delivery of nearly 7,000 goods in period t = 4 however, leads to overstock costs in period t = 4 and the following periods.

EXPERIMENTAL SET-UP
The dynamic systems model is empirically tested using data from a German raw materials wholesaler. The company provided documentation of its sales from 1 st September 2016 until 31 st July 2022. Every product sale can be assigned to one of five material divisions. For each division, estimated profit margins were given, which serve as shortage costs. The WACC was used to measure the overstock costs. Because some products were sold very infrequently, filters were applied to the data. [31,32] found that for time series forecasts on a monthly basis, at least 24 observations as training data are required in order to create reasonable forecasts. Rolling time series cross-validation was applied and for all time series, the test data were the sales from August 2021 until July 2022 (the last year of observations). The training data were accordingly the sales prior to the test data, the exact time period also depended on the lead time of the products.
Time series were included if they met the following criteria: • Date of last sale was between May and July 2022 (to exclude products that are not sold anymore) • Average sale frequency ≥ twice per month, meaning that the number of sales has to be higher than twice the number of months between the first and last sale (to exclude products with highly intermittent demand) • The time series contained at least 36 observations plus the amount of lead time in observations (thus, at least 24 observations could be used for training the forecasting model).
These requirements limited the dataset to time series from 2,911 different products. Tab. 4 displays how many product time series from which division and with which lead time were included. For these products, one-step rolling forecasts were created using the statistical forecasting methods naïve forecasting, ARIMA, and Holt-Winters' additive approach. They are considered classical sales forecasting methods and are easy to implement. As we propose a new dynamic systems model and focus on the comparison of forecast evaluation measures and not forecasting methods, we only consider statistical time series approaches. If forecasts happened to be negative, they were adjusted to zero. All forecasts were rounded to integers due to the application context. The forecasts were evaluated using the dynamic systems model and the statistical measures RMSE, MAE, and sMAPE. The evaluation with the dynamic systems model was performed assuming three different service levels and thus three different safety stocks per product. Considered service levels were 90%, 95%, and 99%, corresponding to service factors Z equal to 1.3, 1.6, and 2.3 [30].

RESULTS
Each evaluation metric assessed one forecasting method to be most suitable for one product. Tab. 5 displays how often these assessments were coherent and how often different evaluation measures came to different results. The upper part of the table displays the assessments for all 2,911 time series. The middle and lower part show the results only for the divisions with either the lowest or highest profit margins. The highest coherence among the statistical measures can be found between the RMSE and the MAE. This is plausible because both measures are absolute forecasting errors. However, even these two similar measures found different best methods for a quarter of the time series. The coherence among the dynamic systems model's evaluations with different safety stock is higher, up to 92.75%. This might imply that the safety stock level does not have a huge impact on the choice of the most suitable forecasting method. When comparing the dynamic systems model's evaluation with those of the statistical measures, it strikes that they rarely come to the same conclusions. The coherence between the RMSE and MAE and the dynamic systems model's evaluation ranges between 54% and 57%. However, for the sMAPE, it is even lower with values between 41% and 44%.
The lower part of Tab. 5 displays the coherence for the divisions with the lowest and the highest profit margin/ shortage costs. The comparison shows that the higher the profit margin, the more the results diverge. For the division with the lowest margin, there is a 92.90% coherence between the dynamic systems model with a service level of 95% and 99%. In the division with the highest profit margin, this value is 84.35%, considerably lower. Also the coherence with the statistical measures decreases with increasing profit margins.
Tab. 6 displays the average of the ranks the evaluation measures assigned to the forecasting methods. The lower the rank, the better the performance of the method. Among all evaluation measures and all-time series, ARIMA performed best. Interestingly, ARIMA is given a better evaluation by the statistical measures and a slightly worse evaluation by the dynamic systems models. The opposite happened for the Holt-Winters method, it was ranked better by the dynamic systems models than by the RMSE, MAE, and sMAPE. The naïve method is clearly the least accurate forecasting method. This is not surprising because it was used as a simple benchmark method.
The comparison of the ranks for different lead times shows that the naïve method performs better for a short lead time measured in all measures except the sMAPE. The change is intuitive because the naïve method assumes the next sales to be equal to the last observed sales. If the lead time is 1 and the demand quite stable, the naïve prediction can be reasonable. Moreover, it can be observed that the higher the lead time, the better performs Holt-Winters compared to ARIMA. For l = 5, the dynamic systems models rank Holt-Winters better than ARIMA. However, the statistical measures still rate ARIMA best method. Fig. 4 displays the development of the average overstock and shortage costs per forecasting method with an increased service level. Naturally, the overstock costs increase with the service level and the shortage costs decrease. However, the decrease in shortage costs seems higher than the increase in overstock costs, especially going from a service level of 95% to 99%. Also striking is that the overstock costs are more than double the shortage costs. And the difference between the overstock costs of different forecasting methods is considerably higher than the difference in the shortage costs.
When comparing the forecasting methods, it is striking that the naïve method has both the highest overstock and shortage costs. ARIMA has the lowest overstock costs, but the second highest/ lowest shortage costs. Holt-Winters has subsequently the lowest shortage and the second highest/ lowest overstock costs. TECHNICAL JOURNAL 17, 3(2023), 397-404

DISCUSSION
The experiments showed a weakness of the traditional statistical evaluation measures. Even the commonly used RMSE, MAE, and sMAPE did not make coherent statements about the most suitable forecasting method for a time series. Moreover, for roughly half the time series, the costconsidering dynamic systems model came to another conclusion about the best method than the evaluations based upon statistical measures. However, considering costs is crucial for every business context. The dynamic systems model describes the development of the inventory level over time and derives the resulting shortage and overstock costs. Nevertheless, the shortage costs might have been underestimated because in the realworld dataset, purchases that customers wanted to make but could not due to stockouts of the raw material wholesaler were not documented. Another weakness is that the model does not include order-related costs. It just assumes regular orders and fixed order costs. In practice, order costs vary and for example, quantity discounts could encourage ordering a larger number of goods at once.
Moreover, the assumptions stated in section 3 could be questioned. One assumption is e.g. that lead times are fixed. Recently during the pandemic, we have seen that lead times can significantly change and disturb a steady good supply.

CONCLUSION
Literature provides many different methods for evaluating a sales forecast. We have empirically shown on a real-world dataset of 2,911 time series that evaluations by the most commonly used measures RMSE, MAE and sMAPE (as an adaption of MAPE) differ considerably. Moreover, it has introduced a dynamic systems model to evaluate sales forecasts based on shortage and overstock costs. The costs are measured as the percentage of a product's price and can be adapted to the company's needs. In the experiments in this paper, the shortage costs were assumed to be equal to a product's profit margin and the overstock costs to a company's WACC. However, they could also be adapted to include penalty shortage costs for reputation loss in case of a stockout.
The main advantage of the dynamic systems model is that it considers the development of the inventory level over time. It can reproduce the costs that would arise from ordering based on a certain sales forecast. Thus, it helps to choose the most suitable forecasting method, minimizes inventory-related cost, maximizes profit and enables a smooth inventory flow.
In the experiments, three different service levels were regarded. It was shown that the evaluations of the forecasting methods do not differ considerably for the service levels. But, especially for products with a high profit margin, they have an impact on the evaluation. In future work, the service level could be integrated into the dynamic systems model as a tuning parameter to optimize the overall costs. Moreover, the dynamic systems model could be extended into a more complex model that considers variable lead times, storage, and order costs.