Comparison of Job Satisfaction Prediction Models for Construction Workers: CART vs. Neural Network

: To establish a suitable prediction model of construction workers' job satisfaction, this study chooses the widely used models CART (Classification and Regression Tree) and NN (Neural network) in the prediction model to make a comparison and finds out the main influencing factors of construction workers' job satisfaction in occupational health and safety training. Through the investigation and analysis of 280 cases of empirical data, it is found that the CART model based on Kappa value and Accuracy of categorical variables have a better prediction effect, and the main factors affecting job satisfaction are job categories, working days per week and the latest training time. The main innovation of this paper is to add the actual value set of empirical data on the basis of the usual training set, verification set, test set and prediction set, and draw a conclusion by comparing the predicted value with the actual value of kappa.


INTRODUCTION
The job satisfaction of construction workers is one of the hot topics in occupational health and safety training and management in the construction industry. There are many prediction models, such as support vector machine, decision tree, neural network, logistic regression, and Bayesian Network, which could be combined with the stepwise regression method to improve the classification accuracy of almost all data mining technologies [1]. At the same time, there are corresponding prediction models in various fields and industries, such as different machine learning models (ML) in weather prediction models. Using different input combinations of meteorological variables, four different ML models of evaporation prediction are established: classification regression tree (CART), cascaded correlation neural network (CCNN), gene expression planning (GEP) and support vector machine (SVM). It is shown from the result table that all ML models can well predict the evaporation at the study site [2], and it can be seen that the comparative prediction of multiple models is often used, which is also worthy of reference in this study.
In the field of Biochemistry, principal component analysis (PCA), multiple regression analysis (MLR), and artificial neural network (ANN) are used to study the toxicity and risk assessment of chemical compounds. Based on this, a quantitative model is proposed. The results show that MLR is suitable for the prediction of toxicity, however, compared with its results, the prediction of the artificial neural network is better and more effective [3], so the ANN model has better performance in some specific backgrounds. Some scholars used the CART model to estimate the hospital death probability of acute myocardial infarction (AMI). They used CART induction model, calculating the area under ROC curve to evaluate its performance (AUC) (95% confidence interval (CI)), and found that CART model is easier to use and explain because the generated decision rules can be applied without mathematical calculation [4,5]. In addition, the prediction analyses of the gene chip, artificial neural network, and classification, and regression tree were used for metaanalysis of gene expression profile, and the two-layer genetic screening method was used to reduce the number of variables, leading to a good accuracy rate (close to 100%) [6]. In particular, some scholars applied hidden Markov model to residential energy consumption prediction, used the energy consumption data collected from four multi-story buildings in Seoul, South Korea for model verification and result in analysis, and compared the model prediction results with three commonly used prediction algorithms, namely support vector machine (SVM), artificial neural network (ANN) and classification regression tree (CART) [7]. It can be seen that CART model and NN model are widely used, so this study attempts to apply these two models to the construction enterprise site and to make a breakthrough on the basis of previous research through the analysis and prediction of workers' job satisfaction.
CART is a decision tree intelligent discriminant analysis method [8,9], which is widely used in social sciences. Research shows that the free statistical modelling of machine learning and equation artificial intelligence is a very promising comprehensive tool [10], and that CART analysis can also be applied to clinical-pathological monitoring of Medicine [11]. In terms of soil and water conservation, agricultural production, and biodiversity of ecological functions, the importance of analysis factors can be determined by CART [12]. At the same time, in the field of engineering, the CART is mainly used in equipment improvements and analysis of geotechnical engineering characteristics. In engineering construction, CART could be combined with multiple regression, monitoring and recording the operation process of TBM (tunnel boring machine) and evaluation system [13], which could evaluate important problems of dam operation. Design and safety are evaluated for dam structure in the estimation of dam storage capacity [14]. Moreover, the CART has the following advantages: (1) Large model capacity: the model will select independent variables according to the contribution in all independent variables for analysis, so it can automatically handle a large number of independent variables, without worrying about the interference of unrelated variables into the model effect and other issues.
(2) Wide range of application: The target variable can be either a discrete variable or a continuous variable. It can also effectively deal with the problem of exact variables. (3) The model level is clear, readable, and understandable. Therefore, this model is selected as the main analysis method of empirical data.
NN was used in algorithm learning and optimization earlier formed an improved integrated learning algorithm and was applied to diagnosis and improve the quality of the track, providing an important guarantee for the safe operation of the track. At the same time, with the improvement of the resource utilization of rolling bearing, the operation cost was greatly reduced [15,16]. Some scholars applied a neural network to the performance analysis of materials and established the tunnel risk evaluation model by combining fuzzy mathematics and BP neural network [17]. In addition, the response factors of the equipment were modelled to improve the accuracy of numerical simulation, improve the thermoforming process [18], as well as to measure and monitor research. The electromechanical impedance (EMI) technology and backpropagation neural network (BPNNs) are used to monitor the bolt looseness inside the bolt ball joint [19]. In addition, some scholars have gradually applied neural networks to the fields of biology [20], ecology, and medical chemistry [21]. The application of neural network in the field of engineering construction closely related to this study mainly includes the application of project risk assessment [17,22], structural strength analysis, stability analysis [23], environmental safety and other factors [24] and early warning analysis of influencing factors [25]. BP (backpropagation) neural network is a concept put forward by the scientists led by Rumelhart and McClelland in 1986. It is a multilayer feedforward neural network trained according to the error backpropagation algorithm, which is the most widely used neural network at present. It is composed of 1 group of interconnected operation units, each of which has a corresponding weight. BP neural network consists of three parts: the input layer, middle hidden layer (one or more layers), and the output layer. This paper designs and models according to the principle of the BP neural network.
Although CART and NN have many applications in the field of engineering construction, there are few kinds of literature used to predict and analyze the satisfaction of construction workers [26]. Besides, in the research process of the CART and NN model, the comparative analysis between the predicted value and the actual value of the model is added, which can better reflect the prediction effect of the established model.

MATERIAL AND METHODS 2.1 Data Sources
The data of this study comes from a one-to-one field questionnaire survey, which designs 22 questions related to occupational health and safety training and job satisfaction of construction site workers (including 21 multiple-choice questions and 1 open-ended suggestion question), and then preliminarily investigates 12 construction workers, modifies the answer options, and finally forms a formal questionnaire. The questionnaire was designed and investigated from December 2018 to May 2019. To ensure the representativeness of the questionnaire survey, four representative regions (provinces) in China, namely East China (Shandong Province), South China (Hainan Province), central China (Hubei Province), North China (Hebei Province), 10 construction projects, 299 workers were randomly interviewed face to face, forming 280 effective questionnaires. Based on the survey data, the effective questionnaire data is divided into two parts, the first part is 239 for the analysis of training set, verification set, and a test set of CART and NN, and the second part is 41 for the prediction set. The analysis tools are IBM SPSS statistics 23 [27] and IBM SPSS modeler 18.0 [28].

Molecular Descriptors
In this paper, 21 questions in the questionnaire are designed as 20 independent variables (X1 -X20) and 1 dependent variable (Y). See Tab. 1 for the variable table.
For the convenience of analysis, the independent variables in the above variable table are divided into three parts, the first part is X1 -X7, the second part is X8 -X13 (X8 -X10, X11 -X13), and the third part is X14 -X20. another related training is more appropriate X10 Have you ever witnessed an accident X11 Cumulative working life in the construction industry X12 Have you ever experienced an accident X13 Do you have a vocational skill certificate X14 Your gender X15 Your age X16 Your level of education X17 Your marital status X18 Have you worked in other industries X19 Do you often pay attention to information X20 plays a role in job responsibility awareness Y Your job satisfaction *OHS: Occupational Health and Safety In the research of this paper, descriptive statistics of the survey population is crucial to the final conclusion. Therefore, the quantitative statistical description of the five variables (X14, X15, X16, X17, X3) of the survey population is shown in Tab. 2. In addition, 280 questionnaires are all collected and analyzed in this table.

Statistical Analysis
The statistical and modelling analysis process is shown in Fig. 1. The statistical analysis data comes from OHS (20190403).sav and OHS (0201test-PRE).sav. The modelling uses IBM SPSS modeler 18.0, which is divided into two parts: model establishment and model prediction application. The two models are divided into two parts: CART and NN for corresponding analysis. Finally, the CART and NN models are compared and optimized

Data Set for Analysis
First of all, the 20 independent variables of 239 survey data are preliminarily modelled by IBM SPSS modeler 18.0. The predictor importance parameter in the analysis results shows that (see Tab. 3), 7 of the 10 main variables of CART (X1 -X10) and NN (X1 -X7, X11 -X13) are consistent (X1 -X7), and the weight of the independent variables in CART and NN is also consistent from the table. Therefore, in the later comparative analysis, this paper will make comparative classification analysis with 20 independent variables, 13 independent variables, 10 independent variables, and 7 independent variables. Secondly, it can be seen from the relevant literature that the Kappa analysis method can be well used for the consistency analysis of comparative data. The focus of this paper is to compare the consistency between the predicted value and the actual value of two modelling analysis methods (CART and NN), and to judge the model prediction according to the size of Kappa value.
According to the number of independent variables, this paper classifies them into K20 (X1 -X20), K13 (X1 -X13), K10 and K7 (X1 -X7), and analyzes and compares the Kappa values of each category (see Tab. 4). In the analysis table, R stands for CART forecast value, N stands for NN forecast value, SA stands for actual value, and R SA  row stands for Kappa value (Value, Asymptotic Standardized Error a , Approximate T b , Approximate Significance). Besides, the P-value of N SA  is significant only in K13. This paper attempts to further optimize the value of NN modelling under K13. See the later part of Tab. 4 again. Compare and analyze the input value of the NN model according to different values of the training set, verification set, and test set. It can be seen that when the ratio of the training set and test set is 9:1, the Kappa value of the predicted value and actual value of NN model ( N SA  ) was the highest (Value =.398, Approval Significance =. 011), and P-value was significant (P < 0.05).
Finally, it can be seen from the prediction accuracy of the two models (see Tab. 5.), in the K13 category (NN model input value is 9:1), the accuracy of CART is 76.15%, and the accuracy of NN is 71.70%. The accuracy of CART prediction is higher than that of NN prediction. In conclusion, it is found that in the prediction model of job satisfaction of construction workers, the prediction consistency and accuracy of the CART model are higher than those of the NN model. Next, this paper will further analyze the two kinds of optimized prediction models.
It can be seen from Tab. 4 that the kappa value R SA  (.570), N SA  (.340) and R N  (.250) of K13 in the four types of data (K20, K13, K10, K7) are the largest, and their P values are significant (P < 0.05). At the same time, it is found that the results of R SA  in the first three types of data (K20, K13, K10) are consistent (Value =.570, Asymptotic Standardized Error a =.123, Approximate T b = 3.815, Approximate Significance =.000), which shows that the performance of CART modelling method is stable, and the Kappa value of CART in all categories is greater than that of NN, so the modelling analysis of CART in construction workers. The result of job satisfaction analysis is better than the NN method.

Classification and Regression Tree (CART) and Neural Networks (NN)
It can be seen from Fig. 2 that the probability of variable X1 is 100%, the probability of branch variable X3 in the lower two branches > 3 months is 53.61%, and the probability is 46.38% higher than that of the other branch X2, indicating that more workers have participated in training for more than 3 months recently. From the colour map of CART branch, it is obvious that the variables with a higher probability value of each layer of the two branches are X1 -X3 -X8 -X5 (X5-2) and X1 -X2 -X3 -X4 -X9 -(X9-2), and it can be seen that X3 appears in both branches, indicating that the variable X3 (work type) has a greater impact on the determination of job satisfaction, and the proportion of Woodworking workers in all work types is large. From the rightmost branch in the figure, it can be seen that the branch conditions of X3 -X8 and X5 -(X5-2) are not accepted, indicating that the workers with high probability have neither received relevant training nor OHS special training. From the left branch X2 -X3, it can be seen that the probability of working days > 3 days per week is high, and the probability of X4 -X9 is Good, indicating that the effectiveness of training is not very good, and the teaching methods of X9 -(X9-2) confirmatory video and apprenticeship are good.
It can be seen from the Predictor Importance (weight) sorting of Tab. 6 that the weight variables of the top three in CART column are X3 -X2 -X1, the weight variables of the top three in NN column are X5 -X4 -X9, and the variables with high probability are X1, X2, X3, X8, X4, X5, X9 in Fig. 2. Because of this, this paper divides the above variables into two categories: X3 -X2 -X1 -X8, X5 -X4 -X9, to show that the main factors affecting job satisfaction are: job category (X3), working days per week (X2), the latest training time (X1), participating in relevant training (X8), and the secondary factors are receiving OHS special training (X5), training effectiveness (X4), and training method (X9).  Besides, the NN model is shown in Fig. 3. It can be seen that the ratio of the training set and test set is 9:1, the network input layer has 10 independent variables (X5, X4, X9, X2, X11, X1, X7, X6, X10, X3) with larger weight. The middle hidden layer has two (N1, N2), and the Output layer is Y. The weight of each input variable to the hidden layer is shown in Tab. 6.
From the above analysis, we can see that there are many factors affecting job satisfaction (X3, X2, X1, X8, X5,  X4, X9). SPSS software is used to test the independence of the Chi-Square test. The Chi-Square value and P-value of variable Y and other variables indicate that the relationship between variables is significant (P < 0.05), which can also be seen from Tab. 7.

External Validation
Now, based on the above CART model and NN model established under K13, 41 pieces of data in the prediction set are selected to form a comparison chart (see Fig. 4, Fig.  5, and Fig. 6). It can be seen that the figures of $N, $R and SA on data point 12 -21 from Fig. 4, the figures of $R and SA on data points 1 -2, 5 -9, 12 -21, 27 -29 from Fig. 5, and the figures of $N and SA on data points 12 -21 from Fig. 6 are completely coincident, while the figures of other data points are not coincident significantly. Thus, it shows that the coincidence degree of the predicted value of CART ($R) and SA is fairly high, which means that the prediction effect of the CART model is better than that of the NN model.

DISCUSSION
In order to establish a suitable prediction model of job satisfaction of construction workers, this paper makes a comparative analysis of CART and NN models. The research data comes from the face-to-face survey of field workers, and the reliability and validity of the data source are high. Most of the questions in the questionnaire are classified variables, and a few are continuous variables, which is different from the methods of continuous variable analysis in the past literature. For instance, there is a literature comparison between the CART and NN models, whose results show that the NN model has a better effect on the cost prediction of colorectal cancer patients [26]. However, in the prediction and analysis of this literature and other literature [29,30], there is little comparative study of Kappa value from the actual value data source.
The main innovations of this study are as follows: first, based on the usual training set, verification set, test set and prediction set, the empirical data will be added to the actual value set, and the conclusion will be drawn through the comparative study of the kappa between the predicted value and the actual value. Therefore, the kappa value analysis of the predicted value and the actual value to evaluate the modelling effect is one of the main innovations of this paper. Secondly, the study found that the kappa value and accuracy of the CART model based on the classification variables are better, and the main factors affecting job satisfaction are job category, training time, and working days per week. Finally, use the CART model to find out the classification level relationship of the main influencing factors, and reveal the weight of sub-items, which is convenient to guide the management and practice of construction workers.
As for the main influencing factors of job satisfaction of construction workers, first of all, the job category has the greatest impact on satisfaction (the importance value is 0.2699, see Tab. 6). At the same time, according to Fig. 2, the proportion of woodworking, etc. is 51.40%. Therefore, it is recommended to use the classified management method to strengthen the post-management of woodworking, etc. in order to improve the overall job satisfaction and ensure the safety and efficiency of engineering production. Secondly, the importance value of working days per week on satisfaction is 0.2026 (see Tab. 6), and it can be seen from Fig. 2 that the workers whose working days per week are more than 3 days have the greatest impact on satisfaction. Through correlation analysis, it can be seen that the longer the working hours per week, the lower the satisfaction. Therefore, in combination with the actual situation of construction enterprises and posts, the arrangement of working days per week should be reasonable. Finally, recent training time has a great impact on job satisfaction. The occupational health and safety training in construction enterprises is not only important learning and training of daily management but also requires regular learning and training of four new technologies, namely, new technology, new materials, new processes, and new equipment, so as to improve the training frequency. It is suggested that the training interval should be 3 -6 months.
In addition, in the process of this study, we also try to use stepwise regression dimension reduction method, but it is found that the classification accuracy of the model has not been improved. Et al. (2019) found that the stepwise regression method can reduce the dimension of variables and improve the classification accuracy of almost all data mining technologies. Obviously, the two conclusions are inconsistent [1], indicating that the dimension reduction method has limitations in the selection of variables. Therefore, the dimension reduction method is not generally applicable in CART and NN models.
In this study, CART and NN models are selected for prediction and comparative analysis. Because there are many prediction models, especially the support vector machine and Bayesian network model, which are widely used, a comparative analysis will be added in the followup study. In addition, performance indicators such as precision, recall, F1-score will also be added to the model analysis.
The CART model established in this paper can be used not only to predict and evaluate the job satisfaction of workers on-site but also to explore the main influencing variables and their hierarchical relationships. Certainly, the list of variable weight values output by the prediction model is directly related to the scale design and the number of input variables, that is, the subjective demand design of different researchers will affect the prediction effect of the model. Later, the author will explore and verify the prediction effect and applicability of the CART model in other industries.
Undoubtedly, this study is based on the specific background of the construction site, but the workers' scenes in the world are changeable, and none of the construction workers and industrial workers' scenes are the same. This is also an issue that needs to be paid attention to in this study, and it cannot be claimed that the CART model can always provide better results regardless of the research environment, especially for a specific study.

CONCLUSION
In the prediction model of job satisfaction of construction workers, the prediction effect of kappa value based on classification variables and CART model based on accuracy is better than that of the NN model. At the same time, the main influencing factors of occupational health and safety of construction site workers on job satisfaction are job category (X3), working days per week (X2), and the latest training time (X1).
This paper suggests that in the aspect of occupational health and safety training for construction workers, managers should classify the training according to different post categories, increase the training frequency, and arrange the working days of each week reasonably, so as to improve the job satisfaction of construction workers.