Determining residuary resistance per unit weight of displacement with Symbolic Regression and Gradient Boosted Tree algorithms

Determining the residuary resistance per unit weight of displacement is one of the key factors in the design of vessels. In this paper, the authors utilize two novel methods – Symbolic Regression (SR) and Gradient Boosted Trees (GBT) to achieve a model which can be used to calculate the value of residuary resistance per unit weight, of displacement from the longitudinal position of the center of buoyancy, prismatic coefficient, length-displacement ratio, beam-draught ratio, length-beam ratio, and Froude number. This data is given as results of 308 experiments provided as a part of a publicly available dataset. The results are evaluated using the coefficient of determination ( R 2 ) and Mean Absolute Percentage Error (MAPE). Pre-processing, in the shape of correlation analysis combined with variable elimination and variable scaling, is applied to the dataset. The results show that while both methods achieve regression results, the result of regression of SR is relatively poor in comparison to GBT. Both methods provide slightly poorer, but comparable results to previous research focussing on the use of “black-box” methods, such as neural networks. The elimination of variables does not show a high influence on the modeling performance in the presented case, while variable scaling does achieve better results compared to the models trained with the non-scaled dataset.


Introduction
Artificial Intelligence is a commonly used tool in today's scientific and engineering practice, as its' modeling capabilities are extremely high, and may allow for the creation of high-precision models for many complex problems [1]. These techniques have been applied in many areas of maritime research. Examples include optimization of exergy analysis of internal ship systems [2] for steam turbines [3], modeling of propulsion system parameters [4,5], ship modeling [6], vessel route optimization [7], and vessel detection [8].
There are many more examples of artificial intelligence applications. Oslebo et al. (2020) [9] apply machine learning for fault detection of pulsed-energy mission loads. Authors address the issue of discerning faults from the sudden heavy loads commonly present during the operation. Through the classification using the proposed novel machine learning method, the authors manage to achieve 99.8% accuracy in waveform classification and 100% accuracy in general fault detection. Berghout et al. (2021) [10] demonstrate a supervised deep learning approach for addressing the problem of condition-based maintenance of naval propulsion systems. Authors manage to achieve highly precise models through the application of the socalled extreme learning machine. Jeong et al. (2020) [11] demonstrate the application of machine learning for shipbuilding master data management. The authors demonstrate how machine learning can be applied to address the problem of an ever-increasing amount of data present in modern shipbuilding. Shaeffer et al. (2020) [12] apply machine learning in early-stage hull form design. Authors demonstrate that, as in many other industries, shipbuilding can apply data-driven models for determining the basic parameters of the hull forms. Barua et al. (2020) [13] review applications of machine learning for the problem of international freight transportation management. Authors review the most successful approaches and conclude that the development of this kind of system should continue, as their uses are highly beneficial. In this paper the authors will consider the possibility of applying machine learning techniques on the modeling of yacht hydrodynamics, specifically modeling of residuary resistance per unit weight of displacement. Previous approaches have been made using neural networks [14] such as multilayer perceptron [15]. As both of those methods are so-called "black-box" methods, which experience the issue of inexplicability. The high complexity of these models does not allow for the interpretation of the models. Additionally, the neural network models tend to require a specific programming language, or even a specific library, to be re-used and applied. The methods that authors apply -GBT and Symbolic Regression are explainable methods that have equation and tree-shaped models, respectively. This allows them to more easily be implemented in various tools, without requiring specific function libraries. The novelty of this paper lies in the determination of the usability of the two used methods.
In the paper, first, the used dataset will be presented, followed by brief descriptions of methods, and used machine learning methodology. Finally, results will be presented and discussed, with conclusions drawn.

Methodology
The utilized methods are presented in this section. First, the analysis of the dataset is presented -using correlation and distribution analysis.

Dataset
The used dataset is the Delft yacht hydrodynamics data set, which was collected at the Delft Ship Hydromechanics Laboratory [16]. It consists of 308 full-scale experiments with 22 different hull forms. The dataset consists of 6 input variables and one input variable. The input variables are: • The longitudinal position of the center of buoyancy • Prismatic coefficient, • Length-displacement ratio • Beam-draught ratio • Length-beam ratio, and • Froude number.
All the input variables, as well as the output variable, are adimensional. The output variables describe the residuary resistance per unit weight of displacement, which determines the resistance a ship hull form experiences in regards to the displacement of the hull [16].
Before machine learning is applied, three analyses are performed -first is the determination of standard statistical descriptors, second is the distribution determinators, and finally the correlation analysis. Standard statistical descriptors are calculated for each variable and include the minimal and maximal values of the variable, range of the variable, the median value of the variable, and standard deviation of the median. The distribution of the variables is then plotted for each of the variables. This is achieved by plotting histograms of the variables and is used to determine if there are any outliers in the data that may cause issues in the creation of the regression models. Finally, correlation analysis is performed to determine which variables influence the output of the dataset. This can be useful for eliminating the variables which do not have a high influence on the input, which may assist with easier model regression [17]. Table 1 shows the standard statistical measures for each variable. We can see that each variable has a different range, which can negatively affect the performance of the used regression algorithms [18]. It should be also noted that the standard deviations, when compared to median value and range of the variable, are relatively low -except in the case of our output, residuary resistance per unit weight of displacement. This can indicate that the data in question has a relatively uneven distribution across its range, meaning that more data points are located on one end of the range [19,20]. This can be confirmed by viewing the distributions of the variables, by plotting the histograms of the data, which is shown in Figure 1. Figure 1 demonstrates the distribution of each input variable contained in the dataset, while Figure 2 shows the distribution of the output variable. It can be seen that  Source: Authors all of the input variables have similarities to the normal or uniform distributions. This is a good quality, which is commonly wanted within datasets used in machine learning applications [21]. The output is distributed exponentially, as shown in Figure 2. meaning that a larger amount of data is contained at the lower ranges of the dataset. This can cause issues with the models being better fitted for that data, as opposed to the general data [22]. Another element of note when observing Figures 1 and 2 is that the data is continuously distributed across each of the histogram bins, signifying that there are no outliers contained within data of each variable, meaning that the analysis and removal of those values are unnecessary. The final performed analysis is the correlation analysis. Correlation analysis provides information on the inter-influence of individual variables within the dataset on one another. If x and y represent two datasets of length n for which we are trying to determine the correlation, then the correlation coefficient r is calculated according to equation [23,24]: The values for each variable achieved using the described methodology are given in Table 2. The table in question can allow us to perform variable elimination on the dataset to allow for easier regression using machine learning methods used in the presented work. The dataset is finally split into two subsets -the training and the testing set. The training set is used for the training of the regression models, while the testing set represents the previously unseen data for the models. This unseen data is used to evaluate the models, according to the metrics described in the following sections. In the presented research, the dataset was split into a 90:10 train/test ratio. This means that 277 data points have been used for the training part of the dataset, and 31 data points have been used for the testing part of the dataset.
The first variable is the longitudinal position of the center of buoyancy, which describes the position of the buoyancy center to the length of the vessel. The second variable is the prismatic coefficient which describes the distribution of displacement along a hull. The third input variable is the length-displacement ratio which describes the proportion of the vessel length and its displacement, while the fourth input -the beam-draught ratio describes the relationship between the waterline beam and the vessel draft. The fifth input variable is the length-beam ratio which describes the proportion between the vessel length and beams. The final input variable is the Froude number which defines the ratio of the flow inertia to the external field. The output, residuary resistance per unit weight of displacement describes the amount of resistance experienced in the dependence with vessel displacement expressed in unit weights [15].

AI regression
In this section the pre-processing applied to the data is described, followed by a brief overview of the used methods and their application. Finally, the method evaluation metrics are given.

Pre-processing
Two types of pre-processing are applied to the dataset in an attempt to improve the results. These are the scaling of the values in the dataset and the elimination of values that show a poor correlation to the output variable.
The min-max scaling is performed by taking the maximal and minimal values of each variable (given in Table 1) and transforming the value based on them to set the range of the variable to [0,1]. For a variable x, transformation into the scaled variable x' is done with [25,26]: The variable elimination process is based on the values of the correlation coefficient between variables, as seen in Table 2. Due to the low correlation of most input variables with the output variable (observe last row or column), the elimination will be done in two cases. First, only variables with |r| ≥ 0.05 will be kept, while in the second, the variables with |r| ≥ 0.01 will be kept. Observing Table 2, it can be noted that for the first elimination only the Froude number will be kept, while in the second case the variables Froude number, beam-draught ratio, prismatic coefficient, and the length-beam ratio will be kept and used in the regression modeling.
One of the common steps in the data science application is the use of cross-validation, in which the data is split into multiple folds, and the training is performed multiple times with each of the folds being used for the training [27]. The reason this process has not been applied in the presented research is two-fold. The first reason is that the data shows distributions that are close to the normal or uniform distributions, considering the small amount of data. The second is to allow for a more direct comparison to the previous research in the application of machine learning methods on the dataset used in this research -as the research in question has not used the cross-validation method, but instead used the standard train-test data split [15,16], used in the presented research.

Random Search procedure
The selection of hyperparameters for both methods is performed through a random search. This means that the selection of each hyperparameter is performed uniformly randomly across the given range, the model is trained with the randomly selected parameters and the quality of the trained model is evaluated. This process is then repeated until either a satisfactory quality is achieved, or the execution is terminated due to the number of iterations elapsed reaching the pre-set value, which was set to 500 in the presented research. The selection of this procedure, as opposed to more strictly defined hyperparameter searches such as grid search or similar, was done due to the nature of the algorithms used. As opposed to algorithms such as neural networks which may have a highly discrete hyperparameter space, the hyperparameter space of the SR and GBT are more finely granular, which means that comparatively, small hyperparameter changes could result in significant model performance changes [27,28]. The hyperparameter values used as bounding ranges for individual hyperparameters have been selected according to previous research and common practices [27][28][29][30][31].

Symbolic Regression
SR, also known as Genetic Programming (GP), is a method that utilizes the principles of evolutionary computing to develop regression models [27,28]. SR creates the initial set of random solutions, called population, formatted as tree-shaped equations. The fitness of each of the solutions is determined. This means that the quality of each solution is ascertained, according to how well it models the data. Then, three different operations are applied to this set of random solutions -crossover, mutation, and reproduction. Crossover is applied to two separate candidate solutions. The tree-shaped equations are split and the new, child solutions, are accomplished via the recombination of the split parts. The equations for crossover operation are selected with the probability proportional to their fitness. This means that the probability of being selected for crossover and producing a child solution is higher for better solutions. Applying this operation repeatedly should, in theory, by combining the higher-quality solutions lead to better solutions being found from one generation to the next [29,30]. Still, just the crossover application may cause some issues -such as narrowing the solution space search area and converging into a locally optimal solution [31,32]. For this reason, two other operations are applied. The mutation will randomly modify a single, randomly selected solution -and include it into the next population iteration. The modification will either be done to a single node of the solution (point mutation), the subtree of the solution (subtree mutation), or through the subtree removal (hoist mutation). The reproduction will in turn simply copy an existing solution into the next population iteration to guarantee the gene pool health [37]. The solution selected for reproduction won't be selected fully randomly, but proportionally to the fitness. The probabilities of each of these operations being performed are the key hyperparameters of the SR method. The ranges of the hyperparameters used are given in Table 3.
In addition to evolutionary operations probabilities described previously, hyperparameters include population size, which is the initial set of the solutions, number of generations that controls the number of iterations in which the evolutionary applications are applied, and initial tree depth, which describes the maximal size of the equation trees in the initial population. It can be noted that the evolutionary operations are derived from the probabilities of other operations. This is done because those operations probabilities need to add up to 1, as one of the operations needs to be performed in each of the iterations to allow for the models to achieve better regression [38].

GBT
GBT is a tree-based AI ensemble method [39]. It is also based on trees such as SR, but instead of those trees describing equations, they describe decision paths [40]. Each node of the tree describes a split into two paths depending on the value of a parameter, which leads down to tree leaves that contain regressed values. GBT being an ensemble method means it uses a voting system, in which many trees are generated, and the output value of the model is calculated not based on a single tree but as a weighted average of all trees in the ensemble. Gradient boosting is a process in which the models are trained based on the residual error of the models. The error is calculated in each of the iterations and the gradient is adjusted based on it. The training speed is then adjusted proportionally to the error [41]. This approach allows faster model convergence and avoids the problem of skipping the possible optimal solutions due to the training process slowing down when solutions near optimum are found [42].
Random search is also applied for the hyperparameters of GBT, with the possible values given in Table 4.
Across the hyperparameters number of estimators represent the number of tree models included in the ensemble. Maximum features describe the algorithm used to calculate the maximum number of nodes depending on tree depth, the maximum of which is contained in the hyperparameter value Maximal Tree Depth. Minimal samples for leaf and split describe the minimal number of data points needed for the creation of tree split or leaf within the model Finally, the training algorithms describe the algorithm used for the calculation of gradients [43].

Quality determination
Quality determination has been performed using two metrics -coefficient of determination (R 2 ) and Mean Absolute Percentage Error (MAPE). Both compare the real dataset values y to the set of predicted dataset values y �. R 2 is calculated according to the equation [40,41]: Using the same notation, MAPE is calculated using the equation [46]: where n is the number of elements in the test set. R 2 is an adimensional value that defines the amount of variance from the real data that is contained within the predicted data [43,44]. This value is commonly used in the evaluation of regression models, as it provides a good description of how well the model reacts to variation of the output variable in the real dataset. MAPE in turn is given as the percentage of the range of the variable. It describes the average error the model achieves across the test set. The benefit of MAPE is that, as the error is given as a percentage, it is easy to interpret and understand its value. Beyond that, its performance is extremely similar to more commonly used MAE [48]. Figures 2 and 3 show the results achieved by the algorithms. In both figures, the scores are given for both GBT and SR algorithms, across all possible variations of dataset pre-processing (scaling and variable elimination).

Results and Discussion
As it can be noticed from Figure 2 the GBT achieves higher R 2 scores than SR. The highest R 2 score is achieved by GBT with data scaling applied, regardless of the variable elimination criterion applied. Higher R 2 scores are achieved on the scaled data in both algorithms, while variable elimination does show somewhat improved R 2 scores, although those differences are not as visible as with data scaling.
The MAPE scores are presented in the same manner as the R 2 scores. The best error achieved, of 1.48%, corresponds to the highest R 2 score, as it is achieved by the GBT method on scaled data, regardless of the variable elimination correlation criterion.
Tables 5 and 6 represents hyperparameters that were used by the best solutions. It has to be noted that many solutions achieved similar scores, as minor hyperparameter variations can lead to extremely similar models which may    Figures 2 and 3, so the achieved scores have not been repeated.
Observing the hyperparameters for best models for SR it can be noted that all models utilized relatively high crossover probability, equal to or higher than 0.9. High values have also been used for the population, generations, and initial tree depth. Using the higher range of hyperparameters for best solutions suggests that the regression problem is relatively hard.
From Table 6 it is noticeable that relatively high values are used for the number of estimators and maximal tree depth. As it was with SR, this also tends to suggest a relatively hard regression problem. Interestingly, all the best solutions use square root for calculating the maximum number of features, and GBTree for the training algorithm.

Conclusion
The results suggest that the SR and GBT can be used for regression of the residuary resistance per unit weight of distribution. Out of the two methods, the models regressed with GBT show a higher quality regression. While in comparison to previous research [15] the methods achieved are of a lower quality, the findings point out that both methods, especially GBT, could be used to address the presented, or similar, problems. Observing the hyperparameters selected during the training process, it can be noted that they tended towards the higher end of the hyperparameter range. This suggests that further increasing the hyperparameter range could be used to achieve better results.
Analysis of the dataset shows that the output values of the dataset, for the value of the residuary resistance per unit weight of displacement, are not uniformly or normally distributed which could be part of the issues causing the SR and GBT methods to fail to regress the problem with higher quality. In the use of neural networks, this can be addressed by using the large number of weighted neuron connections, which successfully make up for the lacking correlation information between the input and output variables. But, due to the lower complexity of SR and GBT methods, the same is not the case when regression is performed using them. Future research should focus on the analysis of the dataset and its composition, so it could be more easily determined if further statistical analytics and data pre-processing could be applied to reduce the problem complexity.
An important note in the dataset composition is that variable elimination does not show a significant lowering of the scores -even when a total of five out of six variables are removed. This confirms the performed correlation analysis and suggests that the Froude number may be a key factor in AI modeling of residuary resistance of the ship hull forms, while the other input variables from the dataset, especially length-beam and length-displacement ratios may not be necessary for modeling. Future work could focus on applying further validation on the variableremoved dataset, for example using k-fold cross-validation or similar approaches, to test the generalization properties of the models.
Finally, it can be concluded that explainable models can be used to solve relatively complex problems in the maritime environment and modeling, and such approaches should be given consideration by the researchers. Future work in the field may include the application of SR, GBT, or similar AI modeling techniques in the modeling of energy [49], environmental effects [50], or internal systems [51] within the domain of maritime applications.

Funding:
The presented research has not been funded by external sources.