Application of the bootstrap method on a large input data set - case study western part of the Sava Depression

The bootstrap method is a nonparametric statistical method that provides through resampling the input data set to obtain a new data set that is normally distributed. Due to various factors, deep geological data are difficult to obtain many data set, and in most cases, they are not normally distributed. Therefore, it is necessary to introduce a statistical tool that will enable obtaining a set with which statistical analyses can be done. The bootstrap method was applied to field "A", reservoir "L" located in the western part of the Sava Depression. It was applied to the geological variable of porosity on a set of 25 data. The minimum number of resampling's required for a large sample to obtain a normal distribution is 1000. Interval estimation of porosity for reservoir "L" obtained by bootstrap method is 0.1875 to 0.2144 with 95% confidence level.


Introduction
Deep geological data are characterized by a relatively small set of data (<20), for which in most cases, input data for analysis is not normally distributed. The consequence of the uneven distribution of input data is a relatively small number of drilled wells in the analyzed area, lack of logging measurements, obtaining geological data from correlations with neighboring wells, etc. In the case of small oil and gas fields, very often due to complex geological structures and pronounced tectonics, hydrocarbons are obtained from smaller hydrodynamic units, which results in a smaller input data set for the analysis of geological variables. In order to obtain the most reliable data on geological variables: porosity, permeability, fluid saturation, which are crucial in the geological development of the reservoir, it is necessary to apply a reliable static tool. The bootstrap method is a method that is applicable in the case of estimating the reliability of the intervals of individual geological variables. The bootstrap method has a wide application in various branches of science (Novoa and Mendez, 2009 The authors analyzed the porosity and the cost of injection of formation water in the reservoir "K" for a small input data set. In this paper, a set of input data for the geological variable of porosity (25 data) in the field "A", reservoir "L", which is located in the western part of the Sava Depression, is analyzed. The number of resamplings will be determined until the normality of the distribution for the input data set is obtained. The normal distribution will be tested with statistical tests of Anderson-Darling (AD) and Kolmogorov-Smirnov (K-S) after each specified number of resamplings. After determining the number of resamplings (obtaining a normal distribution of data), the interval value of the porosity of reservoir "L" will be estimated.

Methods
The materials and methods of this paper describe the geological setup of the investigated area, the mathematical settings of the bootstrap method, and testing the existence of a normal distribution of the data set. These analyses are needed to see the purpose and application of the bootstrap method on a large sample of data whose data are not normally distributed.

Geological settings of the investigated area
The investigated field "A" is located in the western part of the Sava Depression within the Croatian part of the Pannonian Basin System (CPBS). The area of the western part of the Sava Depression is 8000 km 2 (Malvić et al., 2020a), and in the western part of the Sava Depression are the oldest and largest number of oil fields in the entire CPBS. The position of the Sava Depression and field "A" within the CPBS is shown in Figure 1.
Their total thickness in the deepest part of the Sava Depression reaches up to 800 meters, while it is 100-200 meters in the margins of the depression (Vrbanac et al., 2010). Neogene turbidites with lacustrine pelitic sedimentation formed thick heterogeneous sequences of sandstones and marls (totalling several hundreds to some thousands of metres in thickness in different depressional parts) of Upper Miocene age in northern Croatia (Malvić, 2016). The source of the turbidite material of today's deposits in the Sava Depression is the Eastern Alps (Malvić, 2012). Hydrocarbon reservoirs have been confirmed in all formations, except in the youngest Lonja Formation (see Figure 2). Most hydrocarbons were produced from the Upper Pannonian and Lower Pontian reservoirs in the Sava Depression.
In the research structure, hydrocarbon reservoirs were discovered in the Upper Miocene sandstones from which they are still obtained by the secondary method of hydrocarbon production. The porosity of the "L" reservoir in oil field "A" was analyzed in the paper (see Figure 2). Porosity data were obtained by a combination of laboratory core measurements and interpretation of logging measurements.

Mathematical settings of the bootstrap method
The bootstrap method is a nonparametric statistical method that provides an interval estimate of the value of the analyzed variable by the method of random repeated causation of the input data set. A small input set of values is considered a data set <20 (Malvić et al., 2019a;. A sufficient number of resampling of the input set of data is an ideal statistical tool for obtaining a normal distribution of the analyzed variable (provided that the variable by its nature exhibits such a property). Ensuring a normal distribution, then it is possible to make reliable basic statistical calculations of interval estimation, expectations and variance, and parametric statistical tests. The procedure for calculating the bootstrap method is shown in Figure 3. There are several types of bootstrap methods, and they are: Bayesian bootstrap, smooth bootstrap, parametric bootstrap, wild bootstrap, etc. In this paper, the smooth bootstrap method is applied, this method is applicable for the analysis of geological variables (Ivšinović et al., 2021). In the smooth bootstrap method, the input set does not change its size. Resampling randomly replaces data in a new set from the input data set. The mean value of the new data set is calculated (the same number of data remains as the original) which will be an integral part of the bootstrap data set. The number of realizations depends on the nature of the input data set.
The mean value of the resampling input data set and the bootstrap sample is calculated according to the mathematical equations described in the paper by Where: S m -standard deviation of bootstrap, -arithmetic bootstrap mean, z -value from the normal distribution, m -number of the resampling data set.
The usual set reliability of the estimate of the interval is 95% (Dogan, 2017). The steps are repeated as many times as necessary for the input data set that is not normally distributed in the new bootstrap sample to become normally distributed.

Mathematical settings of data normality tests
In order to determine the moment of obtaining the normal distribution, it is necessary to test the data sets obtained by the bootstrap method on the normality of the data. For the control and analysis of data, the following tests were applied for the existence of a normal distribution: Anderson-Darling (A-D) test and Kolmogorov-Smirnov (K-S) test.

Anderson-Darling (A-D) test
Where: AD* -correction value of the Anderson-Darling test, AD -value of the Anderson-Darling test, p -probability. The correction value of the A-D test for the large sample is negligible. The minimum number of test data sets is 20. The minimum "p-value" for checking the A-D test is 0.10.

The Kolmogorov-Smirnov (K-S)
The Kolmogorov-Smirnov (K-S) test is the most applicable statistical test for proving the normal distribution of nonparametric input data. The expression for the value of the K-S test is (Lopes et al. 2007 F(x) -empirical distribution function, P(x) -cumulative function of the theoretical distribution of the K-S test. In the case of a distribution normality test, the samples are standardized and compared with the standard normal distribution. The advantages of the method are ease of application and allows the calculation of descriptive statistics for variables, which are not possible without the application of this method. The disadvantages of the method are, in the case of non-representativeness of the sample, a large expenditure of time on processing the data themselves without specific results.

Results and discussion
The data used in this paper are taken from a paper by Malvić et al., 2019b. The analyzed variable is the porosity of the reservoir "L" of the field "A". The number of analyzed porosity data set values is 25, which is a large data set. The input data set needs to be tested for distri-     Table 1.
The number of repeated resamplings applied in this paper is: 500, 1000, 1050, 1100 and 1250. The test results of the normal data distribution are shown in Table 2.
How can it be applied from Table 2 since the statistical K-S test is not applicable when testing numbers greater than 1000 because no test value is obtained (test limitation of macro in Microsoft Excel)? When testing 500, 1000 and 1050, an increase in the A-D test and approaching the 0.10 limit for test acceptance is observed. After that, the value of 1250 was tested and the value of A-D increased to 0.64, which is an indication of the existence of normal distribution. An additional 1100 resamplings were tested and the test value of the A-D test was 0.20. The normality of the input data distribution for the porosity of the "L" reservoir is between 1050 and 1100 of resamplings. The calculated interval estimate of the "L" reservoir porosity expectation for resampling cases of 1100 and 1250 is shown in Table 3.
According to the estimate of the confidence interval of the porosity of the reservoir "L", it is visible that the difference on the fourth decimal place between the lower value of porosity for the realizations 1100 and 1250. The negligible difference between the values of the estimated intervals for 1100 and 1250 leads to the conclusion that it is not necessary to do an estimate for 2000 resamplings. A graphical representation of the results of the bootstrap method is shown in Figure 4. Figure 4 shows the change in the histogram according to the appearance of the normal distribution curve (the red line). The number of classes in the case of 500 realizations is 22 (width of 0.001818 part of units), and in the case of 1000, 1050, 1100, 1250 realizations made, it is 32 (width of 0.00125 part of units). The difference between the realized realizations 1050 and 1100 is very clearly seen when there is a change in the normality of the data obtained by the bootstrap method. This can be seen from Figure 4 how the blue columns less exceed the normal distribution boundary (red line) in cases 1100 and 1250 in which most of the blue columns are near or below the red curve. This was confirmed by the A-D test, with which the normality of data distribution is obtained after 1100 realized realizations.

Conclusions
The minimum amount of resamplings for a large sample on the example of the porosity of reservoir "L" is 1100. The normality of the input data was obtained between 1050 and 1100 realizations.
When testing the normal distribution of a large sample obtained by the bootstrap method, it is recommended to use the Anderson-Darling (A-D) statistical test, because the Kolmogorov-Smirnov (K-S) statistical test is not applicable to a sample larger than 1000.
Interval estimation of porosity (reservoir "L") obtained by the bootstrap method is 18.75% to 21.44% with a 95% confidence level.
The bootstrap method is applicable to a large sample, which is visible from the results of the porosity of the "L" reservoir and is therefore applicable to the entire area of the Sava Depression with similar geological characteristics as the "L" reservoir. It is used to determine the primary value of reservoir porosity and is applicable to Kloštar-Ivanić Formation reservoirs.