Original scientific paper
https://doi.org/10.2478/bsrj-2018-0017
Number of Instances for Reliable Feature Ranking in a Given Problem
Marko Bohanec
orcid.org/0000-0002-5295-5111
; Salvirt Ltd., Ljubljana, Slovenia
Mirjana Kljajić Borštnar
orcid.org/0000-0003-4608-9090
; Faculty of Organizational Sciences, University of Maribor, Kranj, Slovenia
Marko Robnik-Šikonja
orcid.org/0000-0002-1232-3320
; Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
Abstract
Background: In practical use of machine learning models, users may add new features to an existing classification model, reflecting their (changed) empirical understanding of a field. New features potentially increase classification accuracy of the model or improve its interpretability. Objectives: We have introduced a guideline for determination of the sample size needed to reliably estimate the impact of a new feature. Methods/Approach: Our approach is based on the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals for feature ranks. Results: We test our approach using real world qualitative business-to-business sales forecasting data and two UCI data sets, one with missing values. The results show that new features with a high or a low rank can be detected using a relatively small number of instances, but features ranked near the border of useful features need larger samples to determine their impact. Conclusions: A combination of the feature evaluation measure ReliefF and the bootstrap-based estimation of confidence intervals can be used to reliably estimate the impact of a new feature in a given problem.
Keywords
machine learning; feature ranking; feature evaluation
Hrčak ID:
203480
URI
Publication date:
11.7.2018.
Visits: 1.290 *