Application of machine learning models in predicting initial gas production rate from tight gas reservoirs

Driven by advancements in technology, tight-gas ﬁ eld development has become a signi ﬁ cant source of hydrocarbon to the energy industry. The amount of data generated in the process is immense as most platforms are now being digitized. Machine learning tools can be used to analyse this data in order to build patterns between several dependent and independent variables. Forecasting initial gas production rates has important implications in the planning production/processing facilities for new wells, a ﬀ ects investment decisions and is an important component of reporting to regulatory agencies. This study is based on the analysis of reservoir rock/ ﬂ uid properties and selected well parameters to build decision-based models that can predict initial gas production rates for tight gas formations. In this study, two machine learning predictive models; Arti ﬁ cial Neural Network (ANN) and Generalized Linear Model (GLM), were used to determine the expected recovery rate of planned new wells. Production data was retrieved from 224 wells and used in developing the model. The results obtained from these models were then compared to the actual recorded initial gas production rate from the wells. Results from the analysis carried out revealed a Mean Square Error (MSE) of 1.57 on a GLM model whereas the ANN model gave an MSE of 1.24. Key Performance Index for the ANN model revealed that reservoir thickness had the highest (36.5%) contribution to the initial gas production rate followed by the ﬂ owback rate (29%). The reservoir/ ﬂ uid properties contribution to the initial gas production rate was 53% while the hydraulic fracture parameters contribution to the initial gas production rate was 47%.


Introduction
As technological advancements continue to improve daily in the oil and gas sector, spurred by advancements in shale oil and gas development, smart fi eld development, cheaper and more reliable data storage technologies have led to an increase in the amount of data captured in the industry. For example, in developing tight gas formations, hydraulic fracturing is used to produce fractures in rock formations which stimulate the fl ow of natural gas. Reservoir modelling in such systems is an extremely complicated task, given the need to simulate fl uid fl ow in a network of induced natural fractures coupled to geo-mechanical effects and other processes such as water blocking, non-Darcy fl ow in nano-scale pores, and adsorption/desorption (Cipolla et al., 2010 andDing et al., 2014). Tight gas refers to natural gas trapped in a reservoir with a matrix permeability lower than 0.1×10 -3 μm 2 , which usually has no natural deliverability or lower natural deliverability than the industrial standard, so stimulation or special treatment wells must be used to obtain commercial gas fl ow. (National Energy Administration, 2011). Tight gas reservoirs can be divided into two types based on reservoir characteristics, reserves, and structural positions; Continuous-type and Trap-type tight gas reservoirs (Da et al., 2012).
Oil and gas production companies use thousands of sensors installed in the subsurface and surface facilities to provide continuous data collection, real-time monitoring of assets and the environmental conditions (Abdelkadir and Luc, 2014). This data comes in structured, semi-structured and unstructured forms. According to Gupta (2016), analytics reveal patterns and relationships in this data in order to improve decision making. Analytical techniques are used to identify patterns in historical and even specifi c data which can then be correlated to current or future data to identify risk and opportunities (Bravo et al. 2014). Machine learning in recent times has been successfully employed in different fi elds where huge amounts of data are prevalent to gen-The Mining-Geology-Petroleum Engineering Bulletin and the authors ©, 2019, pp. 29-40, DOI: 10.17794/rgn.2019.3.4 erate data driven models for operation and business decision making purposes (Hastie et al. 2001).
The use of neural networks in data analysis is not new to the petroleum industry. Malvić and Prskalo (2007) applied a back propagation neural network in the processing of three seismic attributes: amplitude, phase and frequencies from 14 wells. The results were subsequently used to predict reservoir porosity. Cvetković et al. (2009) used two types of neural networks: supervised learning-multilayer perceptron and the radial basis function neural network to successfully predict the lithology and the hydrocarbon saturation of the Upper Pannonian sediments and Lower Pontian deposits in the Kloštar fi eld. Malvić et al (2010) utilized supervised neural algorithms for well log and seismic data analysis in three fi elds. The algorithms used in their work mainly consist of multi-layer perceptron architecture and the activation function used was sigmoid or log-sigmoid. However, a radial basis function was also used as an activation function for one network. This implies that different types of back propagation architecture and activation can be used. Šapina (2016) in his work made an interesting comparison between mapping using artifi cial neural network (ANN) and the ordinary kriging method. Although from his work, the ordinary kriging method had a lower mean square error, this was attributed mainly to the fact that ANN utilized a relatively small amount of data in comparison with kriging.
In this paper, machine learning is used to generate data driven models for business operational and business decision purposes in the oil and gas sector especially in the unconventional reservoirs. Some of the most commonly used machine learning algorithms include but are not limited to linear regression algorithms, support vector machines, artifi cial neural networks, clustering analysis, principal component analysis, fuzzy logic (Trent, 2016). The selection of any of these algorithms depends in part on the type of data, the type of problem (regression, and classifi cation) and whether the problem is a supervised or an unsupervised learning problem. An accurate forecast of the initial production rate of a well is necessary for estimating reservoir performance, and for designing production systems. According to Zhou et al. 2017, two methods used to forecast the production rate of a well producing from an unconventional reservoir are: i. Simulation: Simulation is one of the best means of forecasting the initial production rate of a given well in a reservoir. However, running a successful simulation takes time to build a representative model. More data, lots of loops and iterations are also required. Application of machine learning models in predicting initial gas production rate from tight gas reservoirs ii. Analytical Method (Material Balance): The mathematical models that govern the fl ow of fl uid in an unconventional reservoir are too complex to compute analytically. Moreover, the use of material balance models requires previous production data to estimate gas production rates. Considering the limitations of the two methods of predicting the initial production rate for a tight gas formation (as stated above), this paper seeks to explore an alternative method (predictive analytics) to forecast an initial gas production rate. The choice of this method stems from the ability to use data obtained during the drilling and exploration phase to predict initial gas production rate, without prior production from a well.
The objective of this research is to build a predictive model that can estimate the initial gas production rate of a well in the fi eld. This paper seeks to predict the initial gas recovery rate of a newly planned well. The approach used in this research is based on the analysis of reservoir rock and fl uid properties and selected well parameters to build decision based models that can predict oil recovery.

Geological setting of the reservoir
The Ordos Basin is China's second-largest sedimentary basin and covers an area of 370,000 km 2 across Shaanxi, Gansu and Shanxi provinces and Ningxia and Inner Mongolia in the mid-western region of China (China National Petroleum Corporation, 2008) as shown in Figure 1. The bottom of the basin is composed of crystalline rocks and metamorphic rocks of Middle-Lower Proterozoic and Archaeozoic rocks. The sedimentary cover roughly underwent fi ve phases: aulacogen structure in Middle-Late Proterozoic, epicontinental sea in Early Paleozoic, continental-marine transition in Late Paleozoic and fault depression in the Cenozoic (Tingting et al, 2014). Presently, three hydrocarbonbearing sequences have been found in the basin, including Lower Palaeozoic, Upper Palaeozoic and Mesozoic, of which the Upper Triassic Yanchang Formation and Jurassic Yanan Formation are the main oil-bearing formations as shown in Figure 2.
In this study, data from the Sulige, Daniudi, Yulin, Zizhou, and Wushenqi gas fi elds were used. All reser-

Methods
The methodology used in this work is summarized in Figure 3. 224 data sets were acquired and used for this analysis consisting of a data frame of 10 variables. The main variable of focus; prod_rate, represents the initial gas production rate of each well in the fi eld. The other variables in the data set were divided in two sets below: i. Reservoir/fl uid properties variables: reservoir thickness, shale content, porosity, permeability of the formation and gas saturation. ii. Well design variables: volume of fracture fl uid, fracture pressure of the formation, fl uid fl ow-back rate, and hydraulic fracture liquid pump rate. A brief section of the data set used in this research is shown in Table 3. Table 3 is the production rate (initial gas production rate of each well). For the purpose of this study, the reservoir/fl uid properties and the well design variables were referred to as the 'explanatory variables' while the initial gas production rate (production rate) was referred to as the 'response variable' in the predictive models.

Correlation analysis
In correlation analysis, simple statistical methods are used to explore the variables in the data set to establish the relationships that exit between each variable in the data set and to know the degree of signifi cance of each relationship among the variables (Schuetter et al. 2015). In examining the relationship between the variables, a correlation containing two outputs was generated: (i) the correlation matrix which shows the coeffi cient of correlation between the variables as shown in Table 4 and (ii) the p-values which show the degree of signifi cance of the correlations as shown in Table 5. A p-value greater than 0.05, indicates a signifi cant positive correlation between the two variables.
The correlation coeffi cient was used in the analysis of design parameters to determine how the production rate can be improved in remedial operations. As observed in Table 4, the correlation between gas saturation and permeability revealed the largest negative correlation in the  plot which means that an increase in the values of permeability leads to a decrease in the value of the specifi c gravity of the oil and vice-versa and this can be seen with a signifi cant value of -0.72. It should be noted, however, that the smaller the p-value, the more significant the relationship, whereas the larger the correlation coeffi cient, the stronger the relationship.

K-Mean clustering analysis
K-mean clustering is an unsupervised machine learning algorithm and is one of the most commonly used clustering methods which have been studied for many decades and this means that it stands as a basis for many new sophisticated clustering algorithms (Lantz, 2015). In unsupervised learning, the result of the cluster analy-sis is not privy to any known shape or pattern that may be present in the data. However, for this analysis, a semisupervised cluster was conducted. The numbers 5 and 3 were assigned to the k-value which may not correspond to the value of the optimum number of clusters (k).
The prod_rate (initial gas production rate) of the wells in the tight gas formation was used in the clustering. The idea is to subset only the cumulative production of the wells from the data set and using the value of k = 5 and k = 3, group the wells into categories that represent the following: (i) poor, (ii) average, (iii) good, (iv) very good and (v) excellent. The second clustering with k = 3 is group under the following group with each group representing each cluster group as: (i) poor, (ii) average and (iii) good.     The scatter plot of the production rate cluster for k = 5 is given in Figure 4 while the plot for k = 3 is given in Figure 5. The colour code in the cluster scatter plot showed the different cluster groups in the cluster and which observation belonged to each cluster. From the scatter plots, the cluster with 3-cluster groups revealed a clear demarcation between each cluster group more than the 5-cluster group. Observation of the 5-cluster group shows that demarcating (separating) the groups 1-3 in the 5 cluster group seems very problematic as compared with the 3 cluster group which displays a somewhat clear demarcation between the three groups.
The importance of the cluster lies in identifying the wells that are producing within the expected design conditions or below expectation. The results from the cluster analysis were further evaluated during the Look-Back Analysis to determine which of the wells actually fell into the categories presented above in the presence of other variables.

Predictive model analysis
The next phase involves the prediction of the initial gas production rate of the wells using other numerical explanatory variables. In building the prediction models, two machine learning algorithms were employed. The fi rst was using an Artifi cial Neural Network (ANN) and the second involves using a Generalized Linear Model (GLM). The idea was to evaluate which of the machine learning algorithm better forecast the initial gas production.

Artifi cial Neural Network model
The ANN model was built to forecast the initial gas production rate given the reservoir parameters and well design parameters. The data set contained a total number of 10 variables with 224 observations. In training the model, the last variable in the data set; prod_rate is the output variable while the other nine variables: 'reser-voir_thickness', 'shale content', 'perm', 'porosity',  'gas_sat', 'frac_fl uid', 'Pump_rate', 'frac_press' and 'fl owback' were the input variables. Sampling was used to split the data into training validation and test sets in a ratio of 80:10:10 which gave input observations of 179 for the training data set, a validation data set of 22 wells and a test data set of 23. The input set was used to train the model; the validation set was used to scale the model to ascertain the prediction rate while the test set was used to make the actual predictions of the well.
A neural network works best when the input values to the network are scaled on a scale of 0-1 or normalized on a scale of 0 to 1 or -1 to 1. The scale of using normalization or scaling depends mostly on the nature of the data. For the purpose of this work, the normalization function in Equation (1) was used. (1) Where: x -the observation at a particular point in a variable min(x) -minimum observation value for each viable max(x) -maximum observation value of each variable.
The model was trained over a range of hidden layers with different hidden nodes in order to select the best model. Table 6 shows some of the hidden nodes and their prediction rate when used on the validation and test data set using the Mean Squared Error (MSE) method as the measure of the quality of fi t of the model. After running the model over a range of hidden layers and nodes, and using cross validation, the model train with 1 hidden layer and 1 node was selected as having the best prediction rate when tested with the test data set.     The MSE for the selected model (model with one hidden layer and 1 node is 1.2411 and thus was used for the remaining part of this research. Figure 6 shows the schematics of the ANN model.

Generalized Linear Model (GLM) model
The second model was built using the generalized linear algorithm. The same procedure used in the neural network was employed for the GLM model. The only difference is that scaling or normalization was not necessary in this case. The model gave a test MSE of 1.57 which is above the MSE value for the ANN model. This means that the ANN model performed better than the GLM model.

Results
The results of the fi rst 9 predictions of the ANN and GLM Model using the validation and the test data set are presented in Table 7. The results of the model fi t the data considerably well as observed in Table 7.
The plot of the ANN model predictions against the actual initial gas production rate for each well using both the test and validation data set is given in Figure 7. The line through the plot is the 45 o line used to measure the goodness of the fi t model. The closer the plotted points are to the 45 o line, the better the model performance is applicable. The goodness fi t for GLM is given by Figure 8.
In the sensitivity analysis, only the variables that could be altered mechanically (hydraulic and well design parameters) were considered, for example; fl owback, frac_press, frac_ fl uid, and pump_rate. The sensitivity plot was produced with quantiles of (0, 0.1, 0.5, 0.9, 1).
To illustrate the importance of the sensitivity analysis, the variable; fl owback for well 1 was analysed. The normalized fl owback rate for well 1, as used in building the model is given as 0.50, the actual fl owback rate before normalization is 82.5 while the actual initial production rate for the well is 4.92. Table 8 shows the expected production rates for the different quantiles for well 1. To get the exact value, the extracted value from the sensitivity model was denormalized since the data set used to train the ANN model was normalized before training the model.

Key Performance Index (KPI)
The variable importance plot showed the level of contribution of each input variable to the model. The result from the Key Performance Index was summarized in two different parts; (i) reservoir and fl uid properties and (ii) well design properties (hydraulic fracture design properties). The reason for this classifi cation is to quantify the impact of design properties on the well performance in the event of alterations. This helps to select a better design model for remedial operations or for the proper selection of well design properties for a new well in the formation.
To estimate the key performance index of the models, the following algorithms were used: i. Garson Algorithm (ANN); ii. Variable Importance (GLM). The Garson algorithm is only used for a neural network with one hidden layer and a single response variable (Gevrey 2003). The relative importance of a specifi c explanatory variable can be determined by identifying all weighted connections between the nodes of interest (from input to output). The connection weights are tallied for each input variable that describe the relationship which gives a single value for each input variable. The algorithm originally indicates an absolute magnitude of the explanatory (input) variable from 0 to 1. The result of the variable importance using Garson algorithm is given in Figure 9.   Figure 9 shows that reservoir thickness has the highest KPI in the model followed by the fl owback rate. The result of the percentage contribution of each explanatory variable is given in the table below. The cumulative sum of the model KPI in the plot equals to 1 or 100%. Table  9 shows the summary of the percentage contribution of the reservoir and fl uid properties and the contribution of the well design parameters to the model. selected as the best model goes also to suggest that the model with the lowest MSE produces the best KPI. Thus the KPI generated using the Garson algorithm is selected as the best KPI that best explains the data set.

Look-back analysis
The look-back analysis was conducted to determine the relative performance of each well with regards to the production rate and to determine the expected recovery in the event that some of the production parameters used in the model design were altered. In general, explanatory variables are analysed to determine if the well is under performing, performing as expected or exceeds the expectation rate. It is a general idea that a certain reservoir and fl uid properties do not change with time. For example, the reservoir thickness remains constant throughout the producing life of a fi eld.
In conducting the look-back analysis, more emphasis was placed on the well design parameters (hydraulic fracturing parameters) of the well that were used to train the model. The aim was to generate a set of random variables for the design parameters of the well and keeping the reservoir properties constant, feed the new data set into the model and get a new forecast rate for the each well using the simulated data.

Conclusions
Two Machine Learning models have been presented in this study for the prediction of the initial gas production rate for tight gas reservoirs using selected reservoir and well parameters. The value that the predictive analytics can add in tight gas production management is of particular importance. The ANN model with one hidden layer was built by cross-validation with a minimum Mean Square Error of 1.24 while the GLM model gave a Mean Square Error of 1.57. This means that the ANN model outperformed the GLM model in forecasting gas production and as such was used for the look-back analysis. The ANN model was also used to rank the KPI of the explanatory variables employed in the predictive model. The KPI shows that the reservoir thickness has the highest contribution to the initial gas production rate followed by the fl owback rate. The reservoir/fl uid properties contribution to initial gas production rate is 53% while the hydraulic fracture parameters contribution to initial gas production rate is 47%. This study concludes   The results of KPI of the variables used in training the GLM model are given in Figure 10. Table 10 shows the percentage contribution of the individual variables grouped under reservoir/fl uid properties and well design properties. The result affi rmed the fact that the initial reservoir thickness contributed the most to the initial gas production. In selecting the best KPI, the MSE rule was used. The idea that the model with the lowest MSE is