A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning

Shin, Byungju; Kim, Minwoo; Wang, Bohyun; Lim, Joon S.

doi:10.17559/TV-20211102113116

Technical gazette, Vol. 29 No. 3, 2022.

Original scientific paper

https://doi.org/10.17559/TV-20211102113116

A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning

Byungju Shin ; k2soft (06140) 4th Floor, 39, Teheran-ro, 33-gil, Gangnam-gu, Seoul, Republic of Korea
Minwoo Kim ; k2soft (06140) 4th Floor, 39, Teheran-ro, 33-gil, Gangnam-gu, Seoul, Republic of Korea
Bohyun Wang ; Gachon University, (13120) 531 AI Building, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea
Joon S. Lim orcid.org/0000-0003-3112-2644 ; Gachon University, (13120) 531 AI Building, 1342, Seongnam-daero, Sujeong-gu, Seongnam-si, Gyeonggi-do, Republic of Korea

Full text: english pdf 1.090 Kb

page 852-860

downloads: 403

cite

APA 6th Edition

Shin, B., Kim, M., Wang, B. & Lim, J.S. (2022). A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning. Tehnički vjesnik, 29 (3), 852-860. https://doi.org/10.17559/TV-20211102113116

MLA 8th Edition

Shin, Byungju, et al. "A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning." Tehnički vjesnik, vol. 29, no. 3, 2022, pp. 852-860. https://doi.org/10.17559/TV-20211102113116. Accessed 14 Mar. 2025.

Chicago 17th Edition

Shin, Byungju, Minwoo Kim, Bohyun Wang and Joon S. Lim. "A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning." Tehnički vjesnik 29, no. 3 (2022): 852-860. https://doi.org/10.17559/TV-20211102113116

Harvard

Shin, B., et al. (2022). 'A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning', Tehnički vjesnik, 29(3), pp. 852-860. https://doi.org/10.17559/TV-20211102113116

Vancouver

Shin B, Kim M, Wang B, Lim JS. A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning. Tehnički vjesnik [Internet]. 2022 [cited 2025 March 14];29(3):852-860. https://doi.org/10.17559/TV-20211102113116

IEEE

B. Shin, M. Kim, B. Wang and J.S. Lim, "A Study on the Features Selection Algorithm Based on the Measurement Method of the Distance Between Normal Distributions for Classification in Machine Learning", Tehnički vjesnik, vol.29, no. 3, pp. 852-860, 2022. [Online]. https://doi.org/10.17559/TV-20211102113116

Abstract

Feature selection is an important technique that simplifies machine learning models to easily understand, shorten learning time, and reduce curve over-fitting or under-fitting. This paper presents a shape selection algorithm based on a method of investigating similarities between sampled shape values for classification variables (classes). This is based on the premise that the lower the similarity, the higher the usefulness of class classification. The confidence interval of a normal distribution is used to measure similarity. It is judged that the more overlapping the confidence intervals, the higher the similarity. The smaller the duplication of the confidence interval, the lower the similarity, and if the similarity is low, it can be used as a criterion for classification. Therefore, I propose an equation to apply this method. To confirm the usefulness of the equation, a colorectal cancer dataset with about 2000 genes was used and comparative experiments were performed with other feature selection algorithms. The comparison algorithms were Gini Index (10 features), mRMR (10 features), and relational matrix algorithms (7 features). Artificial neural networks were generally used as machine learning algorithms, and comparative verification was performed based on the rib one-out cross-validation method. As a result of the experiment, the results of the Gini index (85.487%), mRMR (87.09%), and relational matrix algorithms (87.09%) were better than those of 88.71% by selecting 10 features. In addition, experiments on iris, wine, glass, music emotions, seeds, and Japanese collection datasets were conducted on multiple classification problems. In the case of wine, the accuracy was 98.8% when all functions were used, but six functions were removed, resulting in 99.4% accuracy. In the case of music sensitivity, the accuracy was 51.7% when all 54 features were used, but when 20 features were removed, it improved to 61.3%. In the case of seeds, it was found that when the number of seeds decreased from 7 to 5, it slightly improved from 93.3% to 93.8%. In the case of iris, glass, and Japanese vowels, the accuracy did not increase even though the function was removed. Therefore, it can be concluded that features can be easily and effectively selected from the multi-class classification problem using the method proposed in this paper.

Keywords

classification; distance; feature selection; Gaussian distribution; similarity

Hrčak ID:

275300

URI

https://hrcak.srce.hr/275300

Publication date:

19.4.2022.

Visits: 1.034 *

Login and registration

Technical gazette, Vol. 29 No. 3, 2022.

Abstract

Keywords

Hrčak ID:

URI

Publication date: