Skip to the main content

Preliminary communication

https://doi.org/10.31298/sl.148.3-4.5

LineFit.xls: a Microsoft Excel template for fitting 11 regression models to Y-X data

Kyriaki Kitikidou orcid id orcid.org/0000-0003-3198-9387 ; Democritus University, Department of Forestry and Management of the Environment and Natural Resources, 68200, Orestiada, Greece
Elias Milios ; Democritus University, Department of Forestry and Management of the Environment and Natural Resources, 68200, Orestiada, Greece


Full text: english pdf 789 Kb

page 163-169

downloads: 292

cite

Download JATS file


Abstract

An essential challenge in any environmental field (forestry, agriculture, etc.), which places an emphasis on data analysis for the purpose of decision making and problem solving, is the estimation of a dependent environmental variable (Y) through an independent one (X). In this work, a Microsoft Excel template is proposed for assessing a set of eleven popular regression Y-X models. Any researcher can use LineFit.xls as a modeling tool for assessing these eleven regression models and selecting the one that best fits their data by running tests on all regression assumptions and comparing models using the most common fitting comparison criteria. Microsoft Excel, being a widely used and user-friendly program, makes it easy to update, expand, and personalize the tests to meet specific needs.

Keywords

data fitting,data modeling,fit comparasion, MS Office software

Hrčak ID:

315488

URI

https://hrcak.srce.hr/315488

Publication date:

26.3.2024.

Article data in other languages: croatian

Visits: 1.130 *




Introduction

Uvod

All environmental sciences rely heavily on methods that estimate the value of a dependent variable (Y) using a value for a related independent variable (X).

For instance, Freese (1964) noted that in the field of forestry, the volume of a tree (Y) may be described as a function of its breast height diameter (X), the strength of wood (Y) as a function of its specific gravity (X), and the cost of logging (Y) as a function of its proximity to hard-surfaced roads (X).

In agriculture, for example, soil organic matter (Y) may be expressed as a function of time (X), net photosynthesis (Y) can be related to irradiance (X), and respiration (Y) can be described as a function of temperature (X) (Archontoulis and Miguez, 2015).

Environmentalists might greatly benefit from a modeling tool that allows them to compare and assess a set of common regression models using the most frequently used fitting comparison criteria and conduct tests on all regression assumptions to see which one best fits their data.

Given Excel's popularity and ease of use, it would be an excellent option to provide a template for assessing a collection of such regression models.

Materials and Methods

Materijali i metode

Within the LineFit.xls template, we suggest conducting tests on 11 different regression models (SPSS, 2007), as shown in Table 1

Table 1. Models assessed with the LineFit template

Tablica 1. Modeli ocijenjeni pomoću predloška LineFit

image1.png

Linear regression is a useful statistical approach for estimating the value of a dependent variable (Y) through an independent variable (X). However, the following five assumptions must be satisfied before we are able to assess regression models' fit to our data:

1. Linearity: bi regression coefficients should be statistically significantly different from zero.

2. Independence: Residuals should be independent, non-correlated.

3. Homoscedasticity: Residuals should have constant (homogeneous) variance across all Xi values.

4. Zero error: Residuals should have zero average.

5. Normality: Residuals should approximate a normal distribution.

Table 2 summarizes the five aforementioned assumptions along with the relevant literature.

Table 3 provides three statistical comparison criteria and corresponding literature, for selecting the best fitted regression model for a given set of data

Table 2. Regression assumptions examined with the LineFit template

Tablica 2. Pretpostavke regresije ispitane u skladu s predloškom LineFit

image2.png

Table 3. Statistical criteria for the comparison of regression models calculated with the LineFit template

Tablica 3. Statistički kriteriji za usporedbu regresijskih modela izračunanih pomoću predloška LineFit

image3.png

Results

Rezultati

LineFit.xls includes 13 spreadsheets. In the “data” spreadsheet, one can enter their own Y-X data in columns A and B, respectively (highlighted in blue). The amount of data can be up to 65535 entries (pairs of Y-X data). Summary statistics (count, mean, standard deviation, min, and max values) are given in cells D1:I4.

For demonstration purposes, columns A and B in the “data” spreadsheet are filled with a random number generator. By pressing the F9 key, one can observe all changes throughout the template.

In the spreadsheets from “1” up to “11”, the statistics of Tables 2 and 3 are calculated, for each of the 11 regression models of Table 1. The linearity test is calculated across cells C5:F5, the independence test in cell M2, the homoscedasticity test in cell AO2, the zero error test in cell AM5, and the normality test in cell AM8. When the regression assumption is not satisfied, the font in these cells turns red. In addition, the graph of the residuals’ distribution and the Y-X scatterplot, along with the fitted line, are given for each of the 11 regression models.

An example of these graphs produced from volume (V, m3) – diameter at breast height (DBH, cm) data (Y-X data) (Softa, 2023) is given in Table 4.

Table 4. Graphs produced with the LineFit template from sample Y-X data

Tablica 4. Grafikoni proizvedeni pomoću predloška LineFit iz uzorka podataka Y-X

image4.pngimage5.pngimage6.png

Regarding the statistical criteria for the comparison of the 11 regression models, the coefficient of determination R2 is calculated in cell J2, the Standard Error of the Estimate (SEE) in cell K2, and the Root of the Mean Squared Error (RMSE) in cell L2 in the corresponding spreadsheet for each of the 11 regression models.

The regression assumptions listed in Table 2 and the statistical comparison criteria listed in Table 3 are both visible in their aggregated forms, in the “comparison” spreadsheet. At this point, the researcher has the opportunity to choose a particular regression model that will serve their needs in the most efficient way. When a regression assumption is not met, the font in these cells becomes red as well. Especially for the linearity test, when at least one regression coefficient is not statistically significantly different from zero, the “violated” message is displayed in red font. In the event that all regression coefficients are statistically significantly different from zero, the message “OK” is displayed.

Regarding the independence test, cells C3:C13 are conditionally highlighted. When a value is between 1.5 and 2.5, the cell is highlighted in dark green (optimum value), for a value between 1 and 1.5 or 2.5 and 3, the corresponding cell is highlighted in light green (acceptable value), and for a value >3 or <1, the corresponding cell is highlighted in yellow (the assumption of independence is not met).

The cells where the three statistical comparison criteria are calculated (i.e., the cells H3:J13) are highlighted on a graduated scale, from dark green (best values) to yellow (worst values).

For the data by Softa (2023), the results of the “comparison” spreadsheet are given in Table 5. Based on these results, one might decide (subjectively, not necessarily) that the best fitted model is the power (no 7) model:

V = 0.000046 · DBH2.6513

because the linearity, independence and residuals’ zero mean assumptions are not violated, and the comparison criteria have satisfactory values, compared with those of the other 10 models.

Table 5. Results of the “comparison” spreadsheet

Tablica 5. Rezultati proračunske tablice usporedbe (“comparison”)

image7.png

Discussion and Conclusion

Rasprava i zaključak

LineFit.xls is an all-in-one template whose outputs represent the behavior of eleven regression models, allowing any researcher to assess the fit of these models to their Y-X data. It is important to remember, however, that regression, as a parametric statistical process, may still perform pretty well even if some of the assumptions behind it are violated (Kitikidou et al., 2012; Kitikidou et al., 2013). Before rejecting a regression model because the assumptions are not met and choosing a non-parametric analysis, one should critically consider that non-parametric analyses employ rankings of values rather than the values themselves and hence cannot provide usable, quantitative estimations (Altman, 2009).

The LineFit.xls MS Excel template is available as a downloadable file on this journal's website to offer the scientific community an option to test and evaluate the most popular Y-X regression models with an all-in-one modeling tool.

References

1 

Altman, D., 2009;Parametric v non-parametric methods for data analysis,Brit. Med. J. 3383167:

2 

Archontoulis, S.V., F.E., Miguez, 2015;Nonlinear regression models and applications in Agricultural Research,. Agron. J. 107(2):786–798

3 

Draper, N.R., H. Smith, 2019. Applied regression analysis,. Wiley India Private Limited,; Delhi, India.:

4 

Durbin, J., G.S., Watson, 1950;Testing for serial correlation in least squares regression. I. Biometrika. 3734:409–428

5 

Durbin, J., G.S., Watson, 1951;Testing for serial correlation in least squares regression. II. Biometrika. 3812:159–178

6 

Ezequiel, M., K.A., Fox, 1959:Methods of correlation and regression analysis: Linear and curvilinear,. John Wiley and Sons,; New York, USA.:

7 

Field, A.P., 2009. Discovering statistics using SPSS,. SAGE,; California, USA.:

8 

Freese, F., 1964. Linear regression methods for Forest Research, U.S. Dept. of Agriculture, Forest Service, Forest Products Laboratory.;

9 

Jarque, C.M., 2011. Jarque-Bera Test,in: Lovric, M. (ed) , editor. International Encyclopedia of Statistical Science. Springer,; Berlin, Germany,: p. 701–702

10 

Kitikidou, K., 2005. Applied statistics using the SPSS statistical package,Greek,. Tziola publications,; Thessaloniki.:

11 

Kitikidou, K., E., Milios, L., Iliadis, M., Kaymakis, 2012;Combination of M-estimators and neural network model to analyze inside/outside bark tree diameters,IFIP Adv. Inf. Comm. Te. 3811118:

12 

Kitikidou, K., E. Milios, L. Iliadis, M., Kaymakis, 2013;Pilot neural modeling of the inside bark tree diameter. A comparative study with robust regression,. Eng. Intell. Syst. 23:125–131

13 

Koenker, R., G., Bassett, 1982;Robust tests for heteroscedasticity based on regression quantiles,. Econometrica. 50(1):43–61

14 

Mathews, J.H., 1987. Numerical methods for computer science, engineering, and Mathematics,. Prentice-Hall,; New Jersey, USA.:

15 

Softa, E. 2023. Dendrometrical characteristics of Robinia pseudoacacia in degraded areas of Xanthi region. MSc thesis,. Democritus University of Thrace,; Greece.:

16 

SPSS, Inc., 2007. SPSS statistics base 17.0 user's guide,. Chicago, USA.:

17 

Student, 1908;The probable error of a mean,. Biometrika. 6(1):1–25

18 

Wackerly, D.D., W., Mendenhall, R.L., Scheaffer, 2012:Mathematical statistics with applications,. Brooks/Cole, California, USA.:


This display is generated from NISO JATS XML with jats-html.xsl. The XSLT engine is libxslt.