Feature Selection : Benefits and Methods. How to Choose a Feature Selection Method?



Feature selection, one of the main components of feature engineering, is the process of selecting the most important features to input in machine learning algorithms. Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features and narrowing down the set of features to those most relevant to the machine learning model.

The main benefits of performing feature selection in advance, rather than letting the machine learning model figure out which features are most important, include:

  • simpler models: simple models are easy to explain - a model that is too complex and unexplainable is not valuable

  • shorter training times: a more precise subset of features decreases the amount of time needed to train a model

  • variance reduction: increase the precision of the estimates that can be obtained for a given simulation

  • avoid the curse of high dimensionality: dimensionally cursed phenomena states that, as dimensionality and the number of features increases, the volume of space increases so fast that the available data become limited - PCA feature selection may be used to reduce dimensionality


Three key benefits of performing feature selection on your data are:

  • Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.

  • Improves Accuracy: Less misleading data means modeling accuracy improves.

  • Reduces Training Time: Less data means that algorithms train faster.



The most common input variable data types include: Numerical Variables, such as Integer Variables and Floating Point Variables; and Categorical Variables, such as Boolean Variables, Ordinal Variables, and Nominal Variables. Popular libraries for feature selection include sklearn feature selection, feature selection Python, and feature selection in R.

Feature Selection Methods by Label Information

Feature selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. Unsupervised techniques are classified as filter methods, wrapper methods, embedded methods, or hybrid methods: