PREDICTION OF MAGNETIC SUSCEPTIBILITY CLASS OF SOIL USING DECISION TREES

Original scientific paper Magnetic susceptibility (MS) is a dimensionless proportionality constant that indicates the degree of magnetization of a material in response to an applied magnetic field. In our study, the focus is to predict the magnetic susceptibility classification of the soil by using data mining algorithms. Magnetic susceptibility values depend on the composition, grain size of magnetic minerals and their source, such as lithogenic, pedogenic and anthropogenic origins. In this paper, we applied two data mining classification algorithms which are called ID3 and C4.5 for predicting MS class and the degree of pollution along the Izmir area in Turkey. By applying the algorithms, possible MS classes are obtained, according to the heavy metal concentration (Pb, Cu, Zn, Co, Cd, Ni) values related to MS. The aim of applying the algorithms is constructing the decision tree and the rules so as to obtain MS values. Thus, errors resulting from the change of ambient conditions and the measurement difficulties are eliminated. According to the rules, we reached 82 % accuracy condition and it is shown that test values and the measurement values are compatible with each other.


Introduction
The minerals that are present in soil are either natural (through lithogenesis, pedogenesis) or of anthropogenic origin (industrial residues). The magnetic mineral content of the soil can be expressed in very broad terms by its magnetic susceptibility [1]. Magnetic susceptibility is a measure of iron-bearing components in a material and it can be used to identify the type of the material on which the test is conducted as well as the amount of the ironbearing minerals that the material contains [2]. Many studies are available in literature where the heavy metal contamination and industrial activities causing soil, air or water pollution were investigated [3 ÷ 7]. In addition, magnetic susceptibility was shown to be a highly useful indicator of industrial pollution, gas emission into air due to traffic and other atmospheric pollutants [1, 8 ÷ 20].
In the recent years, data mining studies found a place in the environmental geophysics publications. Especially in the evaluation of the results of field measurements these studies offer different interpretations to investigator. Studies about environmental geophysics have been moved to a different dimension with data mining methods.
In the literature some studies about data mining and environmental geophysics have been done. For example: Hanesch et al. [21] using fuzzy C-means cluster analysis and Non-Linear mapping techniques topsoil data from locations were analysed and link was observed between magnetic susceptibility and the heavy metal content with their method. Vibha et al. [22] presented an efficient hybrid model that was achieved by first clustering the data and then classifying it, and using the spatial conceptual information extracted from the environmental variables. Preetz et al. [23] introduced a classification system to assess soil magnetic susceptibilities from geoscientific maps. Canbay et al. [24] applied a data mining classification algorithm which is called C4.5 for predicting MS class and the degree of pollution along the Izmit area in Turkey. But we surveyed the literature on Magnetic susceptibility (MS) prediction with data mining methods and did not come across any study.
Pollution is a subject of current interest and there is a need for monitoring techniques developed by several fields of research, in order to analyse the distribution and the reach around the contamination sources. Although the man-made contribution of heavy metals and other pollutants can be studied by careful chemical methods (time-consuming, laborious and costly), magnetic monitoring constitutes an alternative tool for pollution studies. The relationship between both kinds of variables constitutes complex cases of non-linear mathematics. In consequence, multivariate techniques that have become necessary and used to investigate the problem, multivariate statistical analyses were investigated for magnetic monitoring in soils. Furthermore a classification and the need for prior knowledge also may be the case, long-term and sometimes the actual physical properties of soil samples can be lost to avoid taking this measurement in site. The soil data base comprised of pedological, geochemical and geological data and magnetic susceptibility data makes it necessary to evaluate the combination of multivariate study. We suggested using data mining techniques instead of these multivariate techniques.
The focus of our study is to predict the MS classification of the soil using data mining techniques.
MS values depend on the composition, grain size of magnetic minerals and their source, such as lithogenic, pedogenic and anthropogenic origins. In this paper, we applied two data mining classification algorithms which are called ID3 and C4.5 for predicting MS class and the degree of pollution along the Izmit area in Turkey. In our study, possible MS classes are obtained, according to the heavy metal concentration (Pb, Cu, Zn, Co, Cd, Ni) values related to MS. It is shown that test values and the measurement values are compatible with each other. The main aim of this paper is to determine the relationship between the data mining and heavy metal contamination via magnetic susceptibility measurement in the Kocaeli, Turkey area.

ID3 and C4.5 decision tree algorithms
Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas [25]. The developments of computer technology have created too much data on the other hand too little information. Therefore we need to extract useful information from the large chunk of data. The knowledge discovery process basically has seven steps: these are data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge presentation. Steps 1 through 4 are different forms of data pre-processing, where data are prepared for mining. The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base [26].
In data mining, predictive and descriptive models can be separated as two main headers. In the predictive model, firstly a model is designed from the data that is known its result. After that this created model can predict future outcomes, results of which are unknown. In the descriptive model, the patterns that are extracted from existing data can be used to make a decision.
Data mining methods are grouped as classification, association rules and clustering. The classification method consists of decision trees, neural networks, Bayesian Network, Bayesian Classification.
In this study, we used decision tree method, ID3 [27] and C4.5 [28] algorithms, to classification of soil magnetic susceptibility.
ID3 algorithm selects the best attribute based on the concept of entropy and information gain for developing the tree. C4.5 algorithm acts similar to ID3 but improves a few of ID3 behaviours: • A possibility to use continuous data.
• Using unknown (missing) values which have been marked by "?". • Possibility to use attributes with different weights.
In ID3 algorithm, (1) To classify given data, required information amount is calculated with Eq. (1). p i is the probability of an arbitrary data object belonging to C i : p i = s i /s [30]. (a 1 , a 2 ,..., a v ): A that is an attribute has v different values, (S 1 , S 2 ,..., S v ): S is divided subset by using A, S j : occurs S and A has a j value of data samples.
If A is selected as the test attribute, the entropy, required dividing sample set with A, will be calculated with Eq.   ID3 algorithm pseudo code is given in Tab. 1. In C4.5 algorithm used divide and conquer paradigm, based on multi-branched recursion, at the last result very high accuracy is procured. In C4.5 algorithm, T: datasets, C: is collection as (C 1 , C 2 , C 3 ,..., C k ), v: is a property, it also divides T into subsets, has noncoincidence n value.
The occurrence probability of C j can be calculated with Eq. (4).
The occurrence probability of V = v i can be calculated with Eq. (5).
The conditional probability of the type of C j in the cases With Eq. (7) entropy of information can be calculated.
Conditional entropy can be found with Eq. (8).
Information gain can be calculated with Eq. (9).
The information entropy of attribute is found with Eq. (10) and finally with Eq. (11) gain ratio is calculated.
C4.5 algorithm pseudo code is given in Tab. 2. If D is pure or stopped break end if for all attribute a ϵ D do calculate information-theoretic criteria, end for, a*=Best attribute, T*=Create decision node for finding a* in the root, D*=Sub-datasets form D based on a*, For all D* do T*=C4.function(D*), add T* to the corresponding branch of T, end for return (T)

Magnetic susceptibility
Soil samples were collected vertically from a depth of 0 ÷ 30 cm at 13 stations situated on the 300 km 2 area with an average grid density of 10 km in the Kocaeli region. Stations were located at the middle Eocene aged Çaycuma formation and the rest of them were located at the Quaternary aged alluvium. The Çaycuma formation is composed of sandstone, claystone, marl, limestone and pebbles. Particle size varies between 0,0013 and 40 mm in this area. At last, samples were taken with plastic tubes at different depths within 30 cm of investigation depth at each station and then mixed as a composite sample for chemical analysis. Stations were chosen in a rural area around the roads and the others were selected close to site-specific pollution sources such as industrial plants near the main roads.
Susceptibility can be useful and very high sensitive and speed parameter of mineralogy and granulometry. Magnetic techniques have been applied by environmental scientists with demonstrable success in the pollution studies. Many anthropogenic emissions contain various particles, which cause heavy metal pollution of soils in industrial areas. Fundamentally, magnetic susceptibility can give a general view of the degree of pollution. Today, very often this method is used for agriculture.
Different authors studied the soil, air and water pollutions caused by heavy metal constructions and other industrial activities.
Magnetic susceptibility was shown to be highly useful in investigating industrial pollutants, traffic emission, and other atmospheric pollutants. The use of magnetic measurements as proxy of heavy metal pollution is based on the fact that origins of heavy metals and magnetic particles are genetically related. Environmental magnetism studies have demonstrated the relationship between heavy metal contents, and magnetic, lithological and pedological properties in soils. Several studies confirmed direct correlation between the magnetic susceptibility of contaminated soils and the presence of hydrocarbons and certain heavy metals (Pb, Zn etc.)

Materials and methods
Magnetic susceptibility measurements were collected from 13 different stations and different environmental settings: a heavy industrial area with main roads of heavy traffic, and a rural area around the roads. 93 samples were taken vertically from 3 different layers (5, 10 and 15 cm) in 13 stations. The surface measurements were performed using an SM-20 and MS-2 Bartington loop sensor with a diameter of 185 mm at the stations. The penetration depth is about 30 cm, after the magnetic susceptibility measurement in laboratory, heavy metal (Pb, Cu, Zn, Co, Cd and Ni) contents and concentrations of the samples were determined using 6001 model Atomic Absorption Spectrometer of Shımadzu. Samples were taken with plastic tubes at different depths within 30 cm of investigation depth at each station and then mixed as a composite sample for chemical analysis. 1,0 ± 0,09 g soil samples were weighed and placed in platinum or porcelain crucibles. In the ash furnace, the temperature was gradually increased to 900 °C. The samples were then left to cool in the furnace and then taken into 100 ml beakers, in which 10 ml HNO3 and 30 ml HCl (both acids should be concentrated, 'king water') were added. The mixture was dried by evaporating the liquid mixture in a fume cupboard and then 5 ml of concentrated HCl was added to the mixture after which it was dried by evaporating the mixture. The remaining mixture was dissolved in a small amount of HCl just enough to dissolve the samples. The volume was finally brought up to 250 ml with HCl solution (5 %).
Magnetic susceptibility values measured in field (topsoil magnetic susceptibility measurements), massspecific magnetic susceptibility values measured in laboratory and heavy metal concentrations are given in Tab. 3.

Improved method
In this study we improved a method that can be classify of soil magnetic susceptibility with classification algorithms also predict results of new data. We presented prediction success rate for each algorithm. There are Pb, Cu, Zn, Co, Cd and Ni heavy metal measurements of soil samples in our dataset. The reference values for heavy metal concentrations are given in Tab. 4 [31].
The classification of soil magnetic susceptibility results that is based on thresholds given in Tab. 5, were determined by experimental measurements in the city of Kocaeli.  The Çaycuma formation is composed of sandstone, claystone, marl, limestone and pebbles. Generally particle size varies between 0,0013 and 40 mm in this area. The sampled region has very small grain size of magnetic minerals and high penetration capacity. Soil example is taken as mixed. We also took large number of samples from different featured rock samples and thus improved method can extract more general rules about classification of magnetic susceptibility of these soil examples.
In Tab. 6, some examples of train dataset are given dataset.xlsx file. Heavy metal measurements are used as an attribute. "results for C4.5"column that is the attribute of the target class in the dataset only used to compose C4.5 decision tree, with the same method using "results for ID3" attribute column ID3 algorithm's decision tree can be drawn. We used all of the data in this dataset to extract classification rules.
We also got new rules from C4.5 algorithm for the same dataset using improved method. These new rules are given in Tab. 8. Using C4.5 algorithm in our dataset Pb is founded as decision point. As will be understood from Tab. 8 Pb heavy metal is the highest possible decisive criterion for classification result. Basically both tables Tab. 7 and Tab. 8 refer the same results with different classification name because of different algorithms. Namely "1" and "few", "2" and "medium", "3" and "high" definitions have the same meaning that can be shown in Tab. 5. These rules can be used for the new test datasets to get classification results. Thanks to improved algorithm, we can also draw individual decision trees from ID3 and C4.5 algorithms' rules. In Fig. 1 C4.5 algorithm's decision tree is given but we did not show ID3 algorithm's decision tree because it is too big.
According to Fig. 1 heavy metal concentrations of Pb, Cu, Zn, Cd are important to predict MS value with C4.5 algorithm. The most important metal is Pb then we look at Cu concentration and then sequentially Zn, Cd, at last looking back to Zn concentration. The improved method also finds classification result of new data. If you enter new values for each attribute, the algorithm easily classifies this new data and returns classification result. We used soil samples from different sampling stations for testing. Their average heavy metal concentrations, topsoil MS field measurements and mass-specific MS laboratory measurements are given in Table 9 [31].  We took into account "topsoil MS field measurements" column in Tab. 9. These 13 different samples are used as test data over extracted rules in Table  7 and Tab. 8. Using improved method we aimed to achieve the same values with that column's classification value in accordance with Tab. 5. In Tab. 10 extracting results from ID3 and C4.5 algorithms and real topsoil MS field measurements given in Tab. 9 were compared with each other. That can be seen in Tab. 10, C4.5 algorithm's correct prediction rate is 12 % better than ID3s for these samples.
Additionally we applied extracted rules on 150 new / different soil measurements. According to our observations C4.5 algorithm's prediction rate is more successful than ID3 algorithms by about 43 %. As a result in this method magnetic susceptibility class can be estimated more correctly by using C4.5 algorithm and totally we reached 82 % accuracy condition on soil MS values.

Conclusion
Element variety created by the soil structure changing for sampling locations also changes the magnetic susceptibility. Another important result realized in the measurement stage was that the magnetic susceptibility decreases in the samples with high content variety. In addition, it can be said that when extreme minimum and maximum values are encountered in magnetic sensitivity in the short distant measurement locations where the kind of rocks and pollution source do not show variation. Another frequent situation was that changing element properties also changes anisotropic properties.
After the measurements in the fields, samples were taken to the laboratory at the same time, thus completing the measurement process as far as possible without samples losing their properties. The characteristics of natural environment cannot be protected so that digital difference between measurement values occurs.
In this study, chemical analyses results and field measurements were considered and then some rules were extracted. Thanks to these rules new heavy metal values' (concentration quantities) field measurement class are predicted. Field measurements namely topsoil MS field measurements are used as a target class for our dataset. ID3 and C4.5 classification algorithms have been applied for prediction of magnetic susceptibility of the soil. According to the heavy metal concentration and topsoil magnetic susceptibility values, the improved method can construct the decision tree and the rules and then predict the MS class of the new soil example.
As mentioned previously, some rules given in Tab. 7 and Tab. 8 are extracted. These rules will change when new/different dataset is used. Cd is the most efficient heavy metal over classification result according to ID3 algorithm, however Pb is the most efficient heavy metal in C4.5 algorithm. This difference comes from the mathematical calculations in these algorithms.
In addition to our study, for thousands of soil examples the decision tree may be more stable. The soil examples belong to one region. If we could use more examples from different regions of the country we could obtain more general rules. Using these rules, magnetic susceptibility measurements can be classified entering measured heavy metal values. Moreover it is not necessary to bring soil samples to laboratory environment to measure magnetic susceptibility. Thus, errors resulting from the change of ambient conditions are eliminated.
As a future work, other classification methods for example Random Forest algorithm or Naive Bayes can be applied. In addition, the methods can be compared on the same MS values for deciding the best algorithm. Also the algorithms should be applied to thousands of heavy metal values.