INTRODUCTION
The Tourism sector has been majorly transformed by the digitization of processes (Gretzelet al., 2015) and, thus, aspiring travelers inquiring multiple hotels for quotations and picking the best offer is commonplace. Correspondence management has therefore become one of the most important tasks in hotel management. However, creating and sending offers is timeconsuming causing high employee costs. The potential to optimize the sales and acquisition process is reinforced by the fact that only a small percentage of offers made actually turns into bookings. This means that most of the effort on creating and writing offers is wasted on non-booking customers. Thus, the ability to accurately predict whether a guest will book or not, given a request for quotation, can help to set priorities of the correspondence management workload and increase conversion rates (i.e., the share of customers booking an offer). Depending on the likelihood of actual bookings, appropriate decisions can be taken to increase the chances to convert a request into a booking.
Nevertheless, such a problem seems to be under-explored in current tourism-related scientific literature (Egger, 2022). Thus, we aim at bridging this research gap by proposing an experimental investigation of automated algorithms to predict the likelihood of a request for quotation turning into a booking, based on its features. We built our methodological research upon previous works on booking cancellation prediction (Antonio et al., 2017) and revenue and booking forecasting (Pan et al., 2012,Li et al., 2017). We leveraged real-world logged data to evaluate the classification effectiveness of different algorithms (i.e., tree-based, bayesian, and neural network algorithms) and techniques (i.e., class subsampling and feature selection), as well as to provide qualitative insights on customers behavior.
In particular, we collaborated with an IT company, which provided access to a data log with anonymized past interactions with their correspondence management software. The software is employed by tourism service providers such as accommodation facilities to collect, manage and reply to incoming requests for quotation in an effective and practical manner. The requests may originate from different channels such as e-mails, phone calls or web portals. We focused our analysis on a specific hotel with a long-lasting usage of the correspondence software, namely a 4-star hotel with both summer and winter tourism.
The major goal of our research concerns the application of machine learning data-driven algorithms to predict whether a customer asking for a request for quotation will turn into a booker. Furthermore, this problem naturally suffers from a value imbalance for the class we want to predict (i.e., whether a request will convert into a booking or not).
Namely, booking conversion represents a small percentage of all the requests for quotation processed by the software and included in our data log. We investigate how this problem impacts our classification performances and how to cope with it (Ali et al., 2013). For the purpose of analysis, we focus on request features (i.e., the independent variables), such as duration of stay and customer demographics like number of adults and children, in order to predict actual conversion into booking (i.e., the dependent variable). More specifically, we want to assess the effectiveness of different machine learning techniques on the booking prediction task, as well as the impact of different features, inspired by the work byCezar & Ogüt (2016). Thus, we are guided by the following research questions:
• RQ 1 Is it possible to predict whether a request for quotation will turn into a booking via supervised classification algorithms?
• RQ 2 How can we address the class imbalance (i.e., the proportion of non-booking requests in the data is much higher than bookings) to improve classification performance?
• RQ 3 Can we discover important features that mostly contribute to classification accuracy?
• RQ 4 Which feature values best discriminate between booking and non-booking requests?
The rest of the paper is organized as follows: in Section 1 we present some related applications of machine learning techniques in the area of tourism; in Section 2 we describe the data used in experiments, the context from which they are generated, and the applied pre-processing steps; in Section 3 we describe the machine learning experiments and discuss results, before drawing conclusions.
1. RELATED WORK
The optimization of sales processes and correspondence management in the accommodation industry with the help of machine learning techniques is clearly under-investigated in scientific research (Egger, 2022). Therefore, we survey machine learning applications for similar tasks in tourism-related scientific literature. The works presented in this review, categorized by topic, are reported in Table 1.
Booking prediction in order to manage overbooking is one of the key techniques of revenue management. After the Airline Deregulation Act of 1978 deregulated pricing policies revenue management started to play an important role in the aviation industry. However, as early as in 1966 American Airlines developed their first computer-based revenue management system (Chiang et al., 2007). Inspired by the aviation industry, the hotel industry also started to implement revenue management and prediction of cancellations (Chiang et al., 2007,Kimes & Wirtz, 2003). A related research conducted byAntonio et al., (2017) demonstrated the possibility to predict booking cancellations with an accuracy of 90% in a real-world setting. The experiments were performed using the data of four hotels in the resort region of Algarve, Portugal. Similarly to the available features in our research, they also used information such as the duration of stay, number of guests and demographic information, such as nationality, as well as characteristics of the booking process, such as the time prior to arrival of the booking itself.
In the recommendation context, specifically, for flight-based travel booking, an application of a cascaded machine learning model was proposed (Thomas et al., 2019). The goal was to select the best set of hotels to recommend to the traveler, based on the estimated probability of conversion for each candidate hotel and on analysis of the feature importance derived from flight information. Differently, machine learning was used to determine users’ e-trust in adopting web-based travel intermediaries (Çiftçi & Cizel, 2020). Specifically, hierarchical linear regression was applied to investigate which factors affect tourists’ e-trust perception. Similarly,Wong et al., (2020) applied partial least squares and importance-performance map analysis to identify the relationship between service quality and hotel guest satisfaction.
In the tourism domain, some effort has also been dedicated to better understand the booking conversion problem with respect to specific browsing activity features. For instance,Cezar & Ogüt (2016) proposed an analysis on the impact of web browsing-related dependent variables, namely, review ratings (on location and service aspects), recommendation, and search rankings to the conversion of a browsing activity into a booking. Similarly,Xie & Lee (2020) investigated how informational cues displayed in an online hotel search process, including quality indicators, brand affiliation, incentives (discounted price and promotion) and position in search results, influenced consumer conversion at different stages of their browsing session. Finally, the importance of environmental and social media features in the tourism attraction prediction was studied byKhatibi et al., (2020).
Another relevant problem to be tackled in our work and in general in tourism-related machine learning classification is the one of class imbalance (Ali et al., 2013). Generally, the data points related to a positive event (i.e. booking) are much more rare than the corresponding negative event (i.e. non-booking). Thus, data-driven classifiers can struggle to correctly recognize positive instances. Many different techniques were proposed in the literature to address such a problem (Thabtah et al., 2020), for instance,Adil et al., (2021) implemented an oversampling technique to the minority class, in order to facilitate supervised classification of hotel booking cancellations. Contrary to our proposal, where instead a down-sampling of the majority class at different ratios is analyzed.
A similar topic concerns the forecasting of revenue management (Webb et al., 2020,Fiori & Foroni, 2020) and occupancy/ booking (Schwartz et al., 2016,Pan et al., 2012) by means of automated algorithms. Indeed, a large share of related work exists in the context of forecasting tourism demand. This refers, for instance, to the regression problem of predicting the number of arrivals in vacation destinations based on past collected data (Assaf et al., 2019,Pan & Yang, 2017). In (Chen & Wang, 2007) the authors compared support vector regression, with neural networks and autoregressive integrated moving average for the task of forecasting tourism demand in China in the years 1985 to 2001. Related works byYang et al., (2015) andLi et al., (2017), combined search engines data and visitors volume to boost the accuracy of tourism demand forecasting in popular locations in China. Similar research was conducted byClaveria & Torra (2014), where autoregressive methods are compared with artificial neural networks to forecast tourism arrivals in Catalonia from 2001 to 2009.Höpken et al., (2017) exploit mining approaches (i.e., nearest neighbors and linear regression) on big data, such as web search traffic, in order to predict tourist arrivals in Are (Sweden) for the period of 2005-2012. Finally, hotel profitability forecasting was addressed by means of partial least squares and clustering techniques (Lado-Sestayo & Vivel-Búa, 2018), as well as through deep neural network regressors (Lado-Sestayo & Vivel-Búa, 2019).
2. DATA DESCRIPTION
In this section we explain the context, the structure and the format of the data used for the analysis. Moreover, we describe the software and the procedure from which the data is generated during the booking procedure.
2.1. Booking Procedure
The correspondence software, from which the data for the booking prediction is derived, is a hotel management software. The software is used by a total of 92 accommodation companies. It is used by 44 3-star, 13 4-star and one 5-star hotels, by 9 guest houses, 7 garni and 18 residences. Most of them are located in South Tyrol, Italy in ski and/or hiking destinations. Its database contains log data that spans from 2014 up to the first months of 2019.
The procedure for booking a room through the correspondence software is displayed in the sequence diagram of Figure 1. The procedure for booking a room through the correspondence software consists of the following steps.
1. The guest generates a request for quotation which may come from a web portal, an e-mail or a direct call to the hotel.
2. The hotels responsible to handle the request send a letter (i.e., an e-mail) that contains one or multiple offers.
3. The guest may choose one of the received offers and confirm the reservation (i.e., she books the offer).
4. The guest may decide to cancel a booked offer, if allowed by the hotel policy.
Step 1. is the key step where most of the data used in our analysis is collected. In this step the customers provide demographic information about themselves and their fellow travelers, the period of stay, and some specific requests, described in free text. Furthermore, meta-information, such as the source of communication and the preferred type of communication, is collected. In Step 2. the receptionist of the hotel generates a set of offers to answer each specific request. The added information for each offer is related to the number of rooms, the room types, the boarding types and the price. Finally, in the optional steps 3 and 4 the status of the requests are updated based on the guests’ responses.
2.2. Dataset
After a pre-processing and data cleaning phase we came up with the final dataset, used in our experiments. Table 2 lists all features included in the dataset, Table 3 summarizes the descriptive statistics for the numeric independent variables. For the categorical variables, the number of unique values in the data are: 66 for “CG_CountryCode”, 30 for “CR_SourceOfBusiness”, 4 for “CG_Gender” (man, woman, company, group), 3 for “CG_LanguageId” (German, Italian, English), and 2 for “CR_ Season” (winter, summer).
The experimentation focused on the data of a specific 4-star hotel with winter and summer season targeting skiing and hiking tourists. By restricting our analysis to a single hotel with an intense and long-lasting usage history of the correspondence management software we keep the workflow of the booking process under control. On the other hand, we are aware of the fact that by constraining the analysis to a single hotel we may incur some biases (e.g. the origin country of the enquirer is geographically conditioned on the location of the hotel), which might represent a limitation of the proposed approach. We stress the fact that data were anonymized, such that it was not possible to consider personalized and historical characteristics of returning customers.
The dataset therefore contains a total of 57054 requests after pre-processing out of which 4919 ended up in a booking while 52135 did not. Thus, 8.62% of the incoming requests finally turned into bookings, i.e., belonging to the positive class, and the remaining 91.38% to the negative class. It is clear that this dataset suffers from a serious problem of class imbalance. In order to cope with this problem, class imbalance subsampling was applied and its effect analyzed.
3. EXPERIMENTS AND RESULTS
The main focus of our set of experiments is to evaluate and compare the effectiveness of various machine learning techniques applied to the prediction of booking requests. For the sake of clarity, we divided the experiments into three main sub-phases: (i) we compare a set of classification algorithms to select the best models for the task; (ii) we evaluate the effect of sub-sampling to alleviate the class imbalance problem; (iii) we apply feature selection to understand the importance of different features and their impact on the classification task. In order to ensure replicability and transparency, for each of these steps we will describe in detail how the experiments were set-up, and we present and discuss results. For all experiments the Python library scikit-learn (https://scikit-learn.org/stable/) was used for data processing, classification, and evaluation.
3.1. Model Selection
For the model selection task we provided a comprehensive analysis of the different classification approaches in the literature (Tan, 2005), hence comparing their classification performances over a large plethora of algorithms, each with its own peculiarity. Specifically, 16 different classifiers were tested. Out of these classifiers, the five best were selected for deeper analysis. In Appendix B, we fully compare the mean and standard deviation results for all 16 classifiers. For the sake of clarity we report and discuss just the results of the five best performing models. Namely, Random Forest and Extra Tree (Rokach & Maimon, 2014), Gaussian Naive Bayes (Murphy, 2012), Multi-layer Perceptron (MLP) (Gurney, 1997) and Support Vector Classifier (SVC) with radial basis function kernel (Schölkopf & Smola, 2001). The results were also benchmarked with two baselines, i.e., dummy classifiers with most frequent and stratified strategies. All classifiers were tested and evaluated with default parameters provided by scikit-learn library, please refer to the documentation (https://scikit-learn.org/stable/) for the complete list of parameters and their values.
A brief description of each of those five algorithms is provided in Appendix A, together with a formal definition of the metrics used in the evaluation.
We designed a robust evaluation protocol to test and compare the performance of each classifier. This protocol guarantees a high degree of generalization and significance of the reported results. Namely, for each model we performed 10 times a 10- fold cross-validation on the whole dataset. The data were shuffled at the beginning of each iteration, before the 10 folds were sampled. To ensure the original class distribution, we apply a stratified random sampling over the folds (Tan, 2005). The mean and standard deviation of different classification performance scores, namely, precision, recall, F1-score, and accuracy, for the 10 replications are shown in Table 4. The speed index indicates how fast each of the five classifiers is trained in comparison to the others. Please notice that one-hot encoding was applied to categorical features in order to fit every type of training algorithm. The selected classifiers on which the analysis of the results is conducted show a diverse behavior that outperforms all others with respect to at least one metric.
From the analysis it immediately emerges that there is not a clear winner among the different classifiers. Each one has its strengths and weaknesses. The Extra Tree classifier, for instance, produces balanced precision and recall (around 33% and 34%) and the second best F1 score. Furthermore, it is among the fastest. Random Forest achieves the best F1 score of 35%, good precision (48%) and accuracy (91%) but below average recall. Naive Bayes has the highest recall of 65%, but a low precision of 21%. The opposite holds for the MLP, which has a high precision (66%) and the highest accuracy (together with SVC) of 92%, but a low recall (20%). The support vector classifier outperforms all other classifiers with respect to precision (75%) and accuracy, but it generates the lowest recall and is much slower compared to the other classifiers.
It can, therefore, be concluded that all five classifiers provide significantly better results than the baselines and therefore have the potential to be implemented in a real-world setting, depending on the desired metrics to optimize. To sum up the results, the following list shows the best classifier for each measure:
• Precision: Support Vector Classifier
• Recall: Gaussian Naive Bayes
• F1: Random Forest
• Accuracy: Multi-layer Perceptron
Since this machine learning application is built for the specific task of predicting booking events in the tourism domain, we focus on the peculiarity of this application. In general we can say that this is a low-risk task, since, wrongly predicting a positive instance (i.e., to consider as a booker a person who will not convert her request into a booking) is not so harmful and would only waste a few minutes to send an offer. Instead, missing a positive instance and ignoring a potential booker, without taking proper actions, would strongly hurt the hotel’s revenue. Thus, it is reasonable to consider recall (i.e., the proportion of positive cases correctly retrieved) as the most important measure to optimize. Thus, naive bayes turns out to be the most promising classifier to be tested in a real-world scenario. Furthermore, in the next section we apply the subsampling technique to cope with the imbalanced class proportion in the original data and optimize for the recall metric.
3.2. Class Imbalance Subsampling
We study the effect of the subsampling technique to cope with the class imbalance problem (Japkowicz & Stephen, 2002). In a real-world scenario the positive class (i.e., the class we want to predict) is mostly outnumbered by the negative class. In most cases it is therefore more relevant to correctly retrieve the positive instances rather than the negative ones. Nevertheless, due to the imbalanced number of examples belonging to the two classes, many classifiers will be biased towards the prediction of the majority class. To counter this problem, the proportion of positive and negative classes in the train set can be adjusted in order to remove this bias. As already mentioned, the analyzed dataset suffers strongly from class imbalance, i.e., only 8.62% of instances belong to the positive class (booking).
Hence, in order to improve the recall of the positive class we ran a set of experiments by enforcing different class proportions on the training set. Namely, the training set was modified in order to achieve a more balanced ratio of positive to negative cases, i.e., 1:5, 1:3, 1:2 and 1:1. For each ratio, the performances of the classifiers were tested using a 75/25 hold-out approach, repeated for 10 times (Tan, 2005). In each iteration the ratio of positive to negative class values of the 75% training set is modified, by randomly filtering out negative instances. The classifier is then trained on the modified set and finally tested on the remaining 25% test set with the original proportion of positive/negative observations. As for the previous experimental setup, averaging the results over 10 repetitions with random train/test splitting ensures more robust and significant measurements.
In Figure 2, we plot the results based on subsampling with different positive/negative ratios to alleviate the class imbalance problem. The results on the original dataset without subsampling are also reported for comparison. If we focus on precision and accuracy, it can be observed that all five classifiers performed best on the original dataset. By removing negative observations, the performances of precision and accuracy decrease, reaching the worst scores with the ratio of 1:1, i.e., the same amount of positive and negative observations. However, in terms of recall all classifiers perform worst on the originally balanced data and best with perfect class balance. The same trend can be observed for all five classifiers, i.e. removing negative entries increases recall as expected. When comparing methods for highest recall values (on the balanced proportions) across all five classifiers, we notice that naive bayes and the tree-based approaches achieve the lowest results, between 73% and 76%. Instead, MLP and SVC greatly improve their recall measure, compared to the original data distribution, scoring 83% and 87% respectively. Whereas the scores of precision decrease due to subsampling the original data, while just the gaussian naive bayes classifier performs best on the original proportion. For four out of five classifiers, the F1 score reaches the best performance with the proportion 1:3, i.e., the trade-off between precision and recall is the highest among the class ratios of positive and negative examples.
To summarize, even though subsampling deteriorates precision and accuracy, it emerges that this technique, i.e., finding the best ratio of positive to negative instances, could help in improving the positive recall and getting a better recall/precision trade-off, while increasing the correctly predicted positive examples. The most balanced ratio (i.e., 1:1) produces the highest recall for each classifier, with the overall optimal achieved by MLP and SVC (i.e. above 82%). This combination of methods is recommended if we target to optimize recall, as discussed in Section 4.1. The ratio of 1:3 should, instead, be considered as the optimal proportion to train the classifiers in order to improve the recall/precision balance. In general, for random forest and MLP highest F1 scores were produced (around 42%).
The major business insight derived from this analysis is that in an ideal scenario, fitted to the collected results, we have approximately 10% of the incoming requests turning into a booking and we are able to apply a classifier that achieves a recall of 80% and a precision of 20%. This means that, by using our approach as a pre-filter and by prioritizing 40% of the incoming requests (automatically selected by the classifier), the receptionist would correctly serve 80% of all actual bookers, saving, at the same time, 60% of his working time. This intuition is schematized in the graph in Figure 3.
3.3. Feature Selection
Feature selection is a technique that reduces the number of features in the dataset with the aim of improving the classification performance by counter- acting overfitting (Liu & Motoda, 2007). Two main categories of feature selection exist: filter approach and wrapper approaches. In the filter approach, the redundant/irrelevant features are removed before applying the classification algorithm (Guyon & Elisseeff, 2003). Features are filtered, regardless of the classifier, using score functions such as chisquared, mutual information (entropy), or Gini index. Wrapper methods, instead, select the best combination of features by comparing the performances of different feature subsets for a specific classification model (Kohavi & John, 1997).
In our work, we focused on the filter approach applied on the five selected classifiers. We evaluated the classification performances over the i best features with i = 1, 2, . . ., n, where n is the total number of features in our dataset. Furthermore, we discuss the feature ranking that emerges from the filter method in order to provide more insights on the data. The experiments in this section were performed using a standard time-based 75/25 hold-out split, i.e., 75% of the older data were used as a training set to create a ranking over the features and learn the model on the selected subset of features, while the remaining 25% of the more recent data were used to test the trained classifier on unseen data. This evaluation protocol was required to speed-up the high computational time requested for replicating the experiments for all the feature sets. We decided to always use the same portion of the dataset to learn the feature ranking and train each model. This technique ensures to be consistent across different models and to generate a unique set of features at each step. Furthermore, since the feature selection is performed by an external measure (i.e., independent from the classifiers), the feature ranking is fixed across all the classification algorithms.
In Table 5, we report the ranking of the features based on the entropy score computed on 75% of the less recent data. This feature importance score is derived from the decision tree representation of the classification task. The score is calculated as the decrease in node entropy after the splitting, weighted by the probability of reaching that node. The score is then averaged for each feature and normalized such that the scores for all the features sum up to 1. It can be seen that the feature “CR_ RequestedDaysBeforeArrival” clearly outscores the other features, with an importance score of 0.31. The feature “CG_ CountryCode” also gets a reasonably high score of 0.2. Furthermore, both “CR_Duration” and “CR_SourceOfBusiness” still perform reasonably with respect to the less important features (around 0.12 of entropy-based score). The features from position 5 to 16, in contrast, score poorly in terms of information gain (below 0.1), that means they provide a relatively small contribution to discriminate between the classes.
In Figure 4, we show the graphs representing the classification performances of the five selected classifiers trained on the best i features (with i = 1, 2, . . ., n, displayed on the x-axis), derived from the entropy-based ranking. The results are produced in a single run of the time-based hold-out evaluation protocol.
It is interesting to notice the different behaviors of the classifiers that emerge from the graphs with respect to the feature selection. Decision tree classifiers (i.e., random forest and extra tree) are less affected by the selection of the features, as depicted in Figures 4a and 4b: from 4 up to 16 features they reach a stable and optimal performance on the metrics related to the positive class. The only exception is that random forest gets the absolute best precision with just one feature. This is due to the fact that very few positive instances are predicted by the classifier, because of the strong class imbalance, and thus it is much easier to get a higher precision. A similar pattern is observed for the support vector classifier in Figure 4e. In this case a significant boost in precision (around 30%) occurs when the 8 best features are selected, and afterwards it stays stable up to the complete set of features. The two other classifiers instead similarly show that their performances are much more affected by feature selection. For both multi-layer perceptron and naive bayes (shown in Figures 4c and 4d), the best results of F1 measure are achieved between 8 and 14 features. MLP presents a more stable behavior across the optimal range of features, while NB behaves in a more idiosyncratic way, with a peak in precision for 9 features and a peak in recall for 13 features.
(a) Extra Tree
(b) Random Forest
(c) Gaussian Naive Bayes
(d) Multi-layer Perceptron
(e) Support Vector Classifier
3.4. Feature Value Analysis
We qualitatively explore the semantics behind the features selected as important in the feature selection phase. In particular, we aim at enhancing the comprehension of the booking process and increasing the business value of the analysis. Thus, we inspect differences between feature values of observations belonging to the two class bins (i.e., “Book” and “NotBook”).
In Table 6, we report the average value of the numeric features for the positive and the negative observations. We notice that the average of some of the values between the positive and the negative records differ. The average number of adults “CR_Adults” is, for instance, slightly lower for the booked requests than for the non-booked requests. Similarly, the number of children and all three age spans (“CR_Children”, “CR_Age_0-3”, “CR_Age_4-10”, “CR_Age_11-17”) show a slightly lower average for booked requests. This highlights the general tendency of larger groups of people (or bigger families) to be more exploratory in their requests, showing less propensity to actual booking than smaller ones.Furthermore, the average duration of the stay (“CR_Duration”) of the booked requests is more than half a day shorter than the average duration of the non-booked requests, corroborating the intuition that requests for longer (thus more expensive) stays are in general less probable to be converted into bookings. The most significant difference in the table can be observed on the average number of days between request and arrival date (“CR_RequestedDaysBeforeArrival”). In fact, we observe that the positive requests were sent on average more than 8 days later than the negative ones, thus the closer the request date to the trip date the more likely the conversion.
In Table 7, we report significant class frequency counts for categorical features, i.e., we compute the percentage of bookings given some specific feature values. It becomes evident that significant differences exist in the booking probability for some of these categorical values. For instance, if we consider the feature “CG_CountryCode” we notice a different behavior with respect to the measured booking rate. The country code 65 (unknown) corresponds to a booking-rate of just 1%. This implies that a request in which the country code is not specified by the user is very likely to not be converted into a booking. Vice versa, specific codes largely increase the probability of conversion. For the country codes 17 (Germany) and 34 (Italy) a booking-rate of 18% was registered in the database. The increase is even higher for country code 28 (Croatia), with a 30% booking-rate, but with less observations than the other mentioned countries. Instead, the attribute “CG_Gender” slightly impacts the booking probability. In particular, the booking-rate conditioned on each gender type (namely, 0: male, 1: female, 2: company, 3: family) is close to the baseline of 8%. Similar conclusions can be drawn by observing the “CR Season” attribute, whereas the bookingrate is slightly higher for season 0 (winter) at around 10%. Furthermore, we can see that the language (“CG_LanguageId”) selected for the communication with the hotel has a larger impact on the booking behavior. Indeed, German-speaking (id: 1) and Italian-speaking (id: 2) persons have a booking-rate of 7% and 9% respectively (in line with the average of the dataset), whereas English-speaking (id: 3) persons 21%. This analysis might prove that booking requests coming from outside the surrounding area of the region (i.e., not from German or Italian speakers) have a much higher chance to turn into an actual reservation. Looking at the booking-rate of the feature “CR_SourceOfBusiness” we see the most diverse results. Specifically, the largest difference can be noticed between categories 6 and 20. Indeed, category 6 has a booking-rate of 91% while category 20 a booking-rate of just 1%. The former is mapped to a source of request that is manually used by receptionists; thus, it might represent requests coming in via phone call. The latter is mapped to a request portal related to a specific skiing area. Finally, the last three listed features are related to specific parameters of the system. In particular, when a recall e-mail was sent (“L_ RecallEmailSended” = 1) the booking-rate is lower. The same applies when the customer is allowed to send reminder emails (“L_SendNachhakEmail” = 1). We can also see that the booking rate for traditional e-mail offers (“L_UseWebTamplate” = 0) is higher than for offers in the form of a website (“L_UseWebTamplate” = 1). From these observations it could be inferred that more traditional and personal communication channels result in a higher likelihood of an actual booking.
DISCUSSION AND CONCLUSION
We described a real-world application of machine learning techniques and data analysis in the context of e-tourism. To the best of our knowledge, this is a first attempt to leverage automated data mining to predict the likelihood of an incoming request of quotation converting into a booking. Nevertheless, our work was inspired and grounded on previous research on booking cancellation prediction (Antonio et al., 2017) and tourism demand forecasting (Pan et al., 2012).
Namely, we conducted an extensive experimental study on a dataset constructed from requests for quotation collected by a 4-star hotel in South Tyrol (Italy) during the period 2014-2019. The task was to predict whether a request will convert into a booking or not, given a set of 16 engineered and semantically meaningful features. A large exploratory study with 16 different supervised classification models is conducted to select the 5 best performing ones according to different classification metrics, to be further exploited for more insightful analysis. The best performing models in this analysis achieved convincing results in the booking prediction task (RQ 1). Specifically, Multi-layer Perceptron and Random Forest resulted as the most promising models overall, registering the highest F1 score (around 42%), derived from a high recall (between 50% and 60%), against a reasonable precision (above 30%). Those results clearly outperformed benchmark techniques, and confirmed the findings byAntonio et al., (2017) of tree-based and neural network algorithms being the most suitable for tourist behavior prediction.
We further optimize the classification performance towards the important business metric of recall (i.e., percentage of booking events correctly retrieved), by running a set of experiments centered on majority class subsampling, to cope with class imbalance (RQ 2). Indeed, as discussed in Section 3.2, the most important insight derived from this analysis is that a MLP classifier trained on a equally re-balanced dataset is able to optimally support the receptionist work by prioritizing 80% of the correctly identified bookers, saving 60% of her working time. This trend is mildly corroborated in different case studies in the related field of tourism demand and revenues forecasting. Indeed, MLP-based forecasting algorithms produced convincing results (Claveria & Torra, 2014,Lado-Sestayo & Vivel-Búa, 2019). On the other hand,Höpken et al., (2017) proved the capability of a simpler nearest neighbor approach to outperform standard statistical approaches, like linear regression.
Finally, a thorough analysis on feature selection and feature importance is performed. This analysis was intended with the dual goal of optimizing classification performance (RQ 3) and providing more insights on the features’ semantics and their impact on the prediction task (RQ 4). In fact, time-related features of the request, such as the number of days prior of the stay and the duration of the stay, as well as qualitative features, such as the country of the inquirer and the business source of the request, turned out to be the most important attributes to discriminate a booking request. Those findings were partially confirmed in previous experimental research. For instance,Antonio et al., (2017) reported the country of the customer as one of the most relevant independent variables to predict booking cancellation. Instead, in contrast with our results, the duration of the stay and the booking time showed a marginal impact on the conversion or cancelation rate (Antonio et al., 2017,Cezar & Ogüt, 2016). In conclusion, promising results emerged from the exploratory analysis of an automated approach, in the e-tourism domain, used as a filtering support for requests for quotation which will most probably convert into a booking. Hence, this work could be easily extended and replicated by practitioners and hotel managers, applying standard machine learning techniques to the real-world data of an accommodation facility, hence, improving the capability of the managing software of prioritizing more profitable incoming requests, and, finally, increasing revenues.
The main limitation of the presented research concerns the limited amount of data related to a single hotel and the scarce set of (non-personalized) features involved in the study. Thus, future directions of this research should concern enriching the dataset, including more data from different hotels, as well as additional features, such as customers’ demographics, room characteristics, or features describing the request and booking process. For instance, it would be interesting to process the communication (i.e., e-mails and chats) between the accommodation and the customer with natural language processing techniques in order to extrapolate text-related features. An additional direction for the work is the one of exploiting more sophisticated techniques to infer in a systematic way the impact of the feature and their importance in the prediction task.