Movie Score Predication Model Based on Multiple Nonlinear Regression

: In the movie industry, the ability to predict a movie ꞌ s score before its theatrical release can decrease its financial risk. However, accurate predicctions are not easily obtained. To improve the accuracy and scientificity of movie score prediction, this paper proposes a multiple nonlinear regression movie score prediction model (MSPM) in exponential form. Firstly, the influencing factors of film scoring are analyzed. A single problem is selected for the variables of the existing prediction model. This model combines the metadata variables of the film itself and the characteristic variables of film members to conduct quantitative and qualitative analysis on the factors affecting film scoring. Secondly, MSPM is established and the concept of index is introduced. In order to avoid the redundancy of explanatory variables in the MSPM model, the AIC values of the MSPM model and its five sub-models are also calculated to ensure the necessity of selecting explanatory variables. Douban data set is selected to predict movie scores. Finally, compared with linear regression model ( Y s ) and equal scale model ( Y M ), the actual movie score and predicted value were compared. The results showed that MSPM had the highest prediction effect. Experiments show that the model is effective and robust, and reveals the relationship between film scores and related variables. Real-world data confirms that the MSPM model is a timely and appropriate framework for measuring movie scores.


INTRODUCTION
"When Robbers Meet the Monster", a 200 millionyuan blockbuster starring Louis Koo, was released in 2019, but the result surprised everyone, scoring only 3.5 points on Douban. "Better Days", directed by young director Zeng Guoxiang, grossed more than 800 million yuan in five days and scored 8.3 points on Douban. Film marketing executives are often overwhelmed by such results [1]. A related question arises: "What are movie scores related to?" Maybe social media data can help predict movie scores [2]. The rapid development of big data technology provides a scientific basis for the prediction of online movie scores and box office [3].
In previous studies, many researchers have used movie-related variables, such as multiple linear regression, to predict movie scores. Palomba [4] mined relevant information of movies and movie consumers to build a movie consumption prediction model. Zhou [5] extracted information from film posters to enrich relevant variables of film prediction. Ivasic [6] can provide more information for predicting box office revenue by combining the features of movie posters with other film-related data. However, feature extraction in these methods is usually done manually, does not have good generalization, and relies on prior knowledge. In conclusion, we found that the variables used in the above studies were single, and the variables related to movies were not completely linear with movie ratings. Due to the lack of objective indicators to measure a certain film score, the choice of relevant variables is often controversial. Therefore, it is of great value to establish a quantitative model to evaluate a film, whether for the market or in mass activities. In addition, there are studies from machine learning algorithms; Wang [7] used the multi-layer perceptron neural network to establish a film prediction model. Ghiassi [8] proposed a model based on dynamic artificial neural network (DAN2) for film revenue prediction during production. Hur [9] used three machine learning algorithms to predict movie scores. PLucey [10] builds models to predict film scores. The body and facial expressions of the audience during watching the movie were also analyzed. Jiang [11] collected data on 1266 online movies from 2013 to 2015 and divided them into three categories: high, medium and low, counted by view. After feature selection and feature creation, ordered support vector machine (SVM) model is used to predict movies. Sometimes it is not very efficient. At home and abroad, a large number of research institutions and researchers have made a lot of efforts in this regard [5], but accurate prediction is not easy to obtain [12]. There are mainly two reasons: (1) insufficient variable diversity [13][14][15][16][17]; (2) simple prediction algorithm [18].
Therefore, this paper takes Douban website as the research object, combines the metadata variable of the film itself and relevant characteristic variable of the film staff, and constructs a nonlinear exponential model --MSPM, to predict the popularity of the film. By verifying the data set and comparing with other models, it is concluded that the prediction effect is better, which can play a certain scientific guiding role for film investors and cinemas [19][20][21].

Data Sources and Preprocessing
This paper selects a total of 9220 movies from 2000 to 2019 from movie platform of Douban website. In the process of data cleaning and preprocessing, we deleted repeat data and missing data, and finally obtained a total of 5062 valid data from 2000 to 2019.
The author took into consideration the following two factors when selecting 10 variables related to movies from Douban website. Some selected metadata variables are directly related to the movie itself, such as title, year, genre, country, release time and movie duration, etc. The corresponding characteristic variables related to movie members are selected in the other part, such as the popularity of director, writer and leading actors.
In consideration of peopleꞌs diversified life preferences, people may also differ in fancying different types of movies, as shown in Fig. 1. On the whole, the number of drama movies is the largest, followed by comedy movies and horror movies in sequence. We took the movie score as an indicator to measure the popularity of the movie, and as the explained variable. The relevant influencing factor indicators, namely, the movie duration, the number of raters, launch time, the director, the writer and the actors, are taken as explanatory variables. The name and symbols of specific variables are shown in Tab. 1.

Table1 Explanatory Variables and Symbols
Name Symbol Explanatory variables duration x 5 actor x 6 Explained variable score y

Construction of Model 2.2.1 Regression Analysis for Single Factor
According to the above analysis, there are six variables that affect the movie score, including movie duration, the number of raters, the movie launch time, the director, the writer and the actors. We first studied the influence of a single performance indicator on score, and guessed that there was a unitary linear relationship between each single performance indicator and score: where y represents the movie score and x i represents 6 variables respectively. The fitting results and t-test of variables are shown in Tab. 2. Obviously, the fitting goodness judgment coefficient R 2 of director, writer and movie score is < 0.3, and that R 2 of other indicators and score is < 0.1. In the fitting process, it was found that the movie score is negatively correlated with the launch time. In the following study, we used the reciprocal 1/x 3 of the launch time to find its relationship with movie score. Therefore, the influence of a single explanatory variable on score is not very obvious. Several explanatory variables should be combined instead of explaining the change of explained variable and score y.

Movie Score Prediction Model-MSPM
Dietz [22] established the STIRPAT (Stochastic by Regression on Population, Affluence and Technology) model and introduced the concept of index when they were studying carbon dioxide emission assessment models. The specific expression is expressed as: where a is the constant term of the ratio, b, c, and d respectively represent the coefficient of the variable, and represents the random error. The logarithmic form of this model is more commonly used in the actual analysis, namely: Therefore, this paper introduces the concept of index based on the constructing principle of STIRPAT model, and extends the STIRPAT model. The original three explanatory variables increase to six, namely, movie duration, number of raters, movie launch time, director, writer, actor, all of which have different influences on movie. Table 3 Pearson Correlation Coefficient Matrix for the 6 Variables Tab. 3 shows that x 4 and x 5 , which are weakly correlated, have the largest correlation among variables. It means that the linear relationship between any two of the six explanatory variables is weak so the six explanatory variables can be considered independent from each other.
Therefore, we proposed a multiple nonlinear regression model equation in exponential form ---MSPM. The specific model is expressed as: where x 1 represents the movie duration, x 2 represents the number of raters, 1/x 3 is the reciprocal of the launch time, x 4 represents the popularity of the director, x 5 represents the popularity of the writer, and x 6 represents the popularity of the actors. a is a constant and the exponent of explanatory variable, and ε represents the random error of the model. In the actual regression fitting and analysis, the model can be converted into an algorithm form and into a linear form, namely: where, a 0 = lna + lnε represents the sum of the logarithm of the constant term and the logarithm of the error term. Technical Gazette 28, 3(2021), 914-921

Test of Model 2.3.1 Fitting Goodness Test of MSPM
Next, the movie data from 2000 to 2019 were put into the logarithmic form of the MSPM, and the fitting goodness R 2 is 0.2848 was obtained after regression fitting.
And, the adjusted determination coefficient 2 This indicates that the obtained model has a relatively high fitting goodness and can better relate the six variables to the score.
Meanwhile, the value p of F-test of the statistics in the regression results of this model is 0.000, which is less than 0.05, indicating that the regression coefficient of at least one explanatory variable is not 0. The value p of t-test for each explanatory variable is shown in Tab. 4. It can be seen that the p values of the t-test of the six explanatory variables are all less than 0.01, so the corresponding coefficients are not 0. All the six explanatory variables have passed the verification of the ttest with a significance of 1%.

Akaike Information Criterion (AIC) Test of MSPM
To avoid redundancy of explanatory variables in the MSPM and verify that all six explanatory variables are included in the MSPM, we also calculated the AIC values of the MSPM and its sub-models.
Next, we used any three of the four explanatory variables of the MSPM to establish six nonlinear models, which are special forms of sub-models of the MSPM.
Then, fitting regression was conducted in the MSPM and the other six sub-models respectively, and the fitting goodness and the value ΔAIC and ΔBIC of the MSPM and several models were respectively solved for comparative analysis. The calculation results of fitting goodness and AIC value of MSPM and sub-model are shown in Tab. 5.

EMPIRICAL ANALYSIS
In this paper, the prediction of movie score is based on the following assumptions.
1. Score on Douban movie website reflects the movie quality. The higher the score, the better the movie.
2. For most people, good movies are preferable to bad ones.

Movie Score
On the whole, the movie score is a single peak trend where decrease comes after increase, which is in line with the general law of nature. There are few movies with score below 5, while there are many movies scored between 6 and 7.5. After the peak, the number of movies scored above 8 declines rapidly, which means there are fewer excellent movies or bad ones, as shown in Fig. 2.

Relationship between Movie Score and Other Variables
Movie score is an important indicator to measure the quality of a movie in movie and TV industry. A movie with a high score will attract more people, thus promoting the market supply. Movie score is also influenced by many factors, such as its duration, the number of raters, its launch time and the popularity of director, writer and actor. Therefore, this paper makes an in-depth analysis of the differences in movie score in different dimensions.

Movie Score and Duration
To study the relationship between movie score and duration, scatter plots of movie score and duration are drawn, with the result shown in Fig. 3. The scatter plot of average score and duration is shown in Fig. 4.

Figure 3 Analysis of Movie Score and Duration
The movie duration is about 80-150 min and the score span is large, with the overall distribution between 4.5 and 9.5 points. But the distribution is more concentrated around 7 to 7.5 points, which shows the general characteristics of the movie duration. It can be seen from Fig. 4 that the score gradually increases within 0-50 min. After the 50 min, the movie score first decreases with the increase of duration and reaches the bottom at about 100 min after which it increases again with the increase of duration.

Movie Score and the Number of People
To study the relationship between movie score and the number of raters, we made a scatter plot of the two elements, as shown in Fig. 5.

Figure 5 Analysis of Average Score and People
It can be seen from Fig. 5 that there is a positive correlation between the number of raters and the score. The more the raters, the greater the possibility of a high score. At the same time, it also reflects the high popularity of movies and the trend of higher popularity. On the whole, the number of people for most movies is generally under 100,000. The number of raters is an important indicator for users to watch movies and to reflect the excellence of movies.

Movie Score and Date
To study the relationship between movie score and the date, we made a scatter plot of movie score and date, and that of average movie score and the date. It can be seen from Fig. 6 that the overall movie score shows a downward trend. Specifically, it can be seen from Fig. 7 that the movie scores from 2010 to 2019 are significantly lower than those from 2000 to 2009. But this does not mean that the movie quality from 2010 to 2019 must be lower than that of previous movies. With the continuous progress of the times, the shooting conditions and transmission channels have been changed substantially.

Movie Score and Director
To study the relationship between movie score and director, we made a scatter plot of movie score and director characteristic variable, and that of average movie score and director characteristic variable.  As shown in Fig. 8, the score of the director is distributed from 6 to 9 points. It can be seen from Fig. 8 and Fig. 9 that the director can greatly influence the movie score which increases along with the popularity of director. The director is just like the supreme commander of the army. The quality of a movie largely depends on the director's quality and cultivation. The style of a movie often reflects the artistic style and character of the director, and furthermore manifests the director's value of things. Therefore, there is no doubt that a good director exerts dramatic influence on movies.

Movie Score and Writer
To study the relationship between movie score and the writer, we made a scatter plot of movie score and writer characteristic variable, and that of average movie score and writer characteristic variable.  It can be seen from Fig. 8 and Fig. 10 that the influence of writer and director on movie scores is similar. That is, the higher the level of writer and director, the higher the movie score. The positioning of a movie and the idea of the script determine the height it can reach in the directing process, the later stage and even when the movie is totally completed.

Movie Score and Popularity of Actors
An excellent movie cannot be separated from the performance of excellent actors. We will study the relationship between movie score and actors in the following part. It can be seen from Fig. 12 that the influence of actors on movie score is polarized. According to Fig. 12 and Fig.  13, an actor of average level is also likely to perform a movie with a high score. A good actor has a great possibility of performing a bad movie.
where, Y s is the movie score, E is the coefficient of explanatory variable, and ε is the error term. The Y s model is a multiple linear regression model of six explanatory variables and movie score. The two sides of the equation of this model have equal changes, and the comparison with the MSPM constructed in this paper is also to further confirm that the influence of different factors on the popularity of movies is not equal. It is expressed as where, M Y is the movie score. It was transformed into a linear relation by mathematical transformation and then the following was obtained:

Comparative Analysis of Multiple Models
The data from 2000 to 2019 were respectively introduced into the MSPM, model Y s and Y M model. The fitting goodness and the F-test of the three models are shown in Tab. 6.

Predictability of MSPM
To evaluate the robustness of MSPM, we substituted training data into the three models MSPM, Y M , Y s to test the predictive ability of MSPM. The observed data of 20 years from 2000 to 2019 were respectively used as training data to fit the equation, and the parameter fitting values a 1 , a 2 , a 3 , a 4 , a 5 , a 6 of the MSPM under the training sample sets of different years were obtained. The variation trends of a i (i = 1, 2, 3, 4, 5, 6) and a 0 are shown in Fig. 14 and 15. where，a 0 = lna + lnε. It can be seen from Fig. 14 that the coefficient a i fluctuates gently over time, and Fig. 15 shows that the a 0 = value fluctuates around a 0 = 0. Therefore, we used the observation data in 2019 to fit the MSPM, and then substituted the calculated parameter fitting value a 0 = and a i (i = 1, 2, 3, 4, 5, 6) into the equation. Then the actual score Y of the movie was compared with the predicted value Y E . Fig. 16 is the scatter plot of the relationship between the actual score Y and the predicted value Y E of the movie based on the MSPM. Fig. 17 is the difference scatter plot between the predicted value and the actual value of the movie score in 2019. where, the red section is the data point for four highly popular movies, which are Better Days, Wandering Earth, Captain Marvel, Frozen 2 and X-Men: Black Phoenix respectively. Fig. 17 shows that most of the data points are distributed near the line Y = Y E , which means that the predicted values are consistent with the observed data set. The red star points in the figure represent the real values and predicted values of the five hot movies, which are distributed near the line Y = Y E .  Fig. 18 and Fig. 19 are the scatter plots of the relationship between the predicted value and the actual score of the model, and the difference diagram. Although the points in Fig. 18 are uniformly distributed around line, the difference between the two in Fig. 19 is mostly above 0 compared with that in Fig. 17. It reveals that the prediction accuracy is higher than that of the model.
Most of the points in Fig. 20 are above, indicating that the predicted value of the model is basically greater than the actual value of the movie score. As shown in Fig. 21, the difference is between -5 and -10, which is a big error. Therefore, the predictive power of MSPM is better. To sum up, MSPM has more accurate prediction ability.

CONCLUSIONS AND FUTURE WORK
This prediction can be used by the audience to choose a movie according to the movie online score, serve as the basis to make the screening plan for the movie theater, and also provide the direction and reference for the peripheral industries. On the basis of the previous studies, this paper explores the factors that influence movie score, and concludes the relationship between the three characteristic factors, namely, the director, the writer and the actors, and movie score. Also, it proposes a prediction model of movie online score on the Douban website --MSPM. Compared with many other models, the constructed MSPM has a better prediction effect. MSPM can evaluate performance and observation value in other fields that can be independently measured. In other fields, such as scientific research, the reputation of some scholars is closely related to their published records and research findings. This paper presents a movie online score prediction model based on Douban data. In the follow-up work, experiments are expected to be based on more movie score platforms to continuously optimize the adaptability and prediction accuracy of the movie online score prediction model.