12 ways of Feature Selection/Dimension Reduction

Yanlin Chen
4 min readJan 12, 2019


It’s important to choose the best features for the machine learning models. Removing irrelevant features results in:

Accuracy: a better performing model

Interpretability: easier to understand

Speed: Model can run faster

There are various ways for selecting features, and it depends on the data itself and the result we want to get.

Dimensionality Reduction Techniques:

Percent missing values

Drop variables that have a high % of missing values

Because if the data has a lot of missing values, it would be hard for machine learning models to learn from.


  1. Even if you are dropping it, you might encode the amusingness as a feature in order to keep as much information as the original data. Turning a missingness into a feature, it might become a useful feature actually. For example, you can create a binary indicators to denote missing (or non-missing values) like “is missing”.

Amount of variation

If the feature is mostly all the same value or have a very low variation. Then the model is also hard to learn from it.


  1. Standardize all variables, or use standard deviation to account for variables with difference scales.
  2. Drop variables with zero variation (unary)

Pairwise correlation (between features)

Many variables are often correlated with each other, and hence are redundant. So if you drop one of them, you won’t lose that much information.


  1. If two variables are highly correlated, keeping only one will help reduce dimensionality without much loss of information.
  2. Which one to keep? The one that has a higher correlation coefficient with the target.


When two or more variables are highly correlated with each other.


  1. Drop one or more variables should help reduce dimensionality without a substantial loss of information.
  2. Which variables(s) to drop? Use Condition index.

Principal Component Analysis (PCA)

Dimensionality reduction technique which emphasizes variation.

It helps eliminating multicollinearity — but explicability is compromised. It uses orthogonal transformation

When to use:

  1. Excessive multicollinearity
  2. Explanation of the predictors is not important: because the output of PCA is not features you can identify, but a combination or a mix of all features.
  3. A slight overhead in implementation is okay
  4. More suitable for unsupervised learning

Cluster analysis

Dimensionality reduction technique which emphasizes correlation/similarity.

It identifies groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with variables in other clusters.

It reduces multicollinearity — and explicability is not (always) compromised.

When to use:

  1. Excessive multicollinearity
  2. Explanation of the predictors is important

It will return a pooled value for each cluster (i.e., the centroids), which can then be used to build a supervised learning model.

Correlation (with the target)

Drop variables that have a very low correlation with the target.

If a variable has a very low correction with the target, it’s not going to be useful for the model (prediction).

However, this might, like any these techniques, can miss a useful feature. Because there might be a feature interaction. Like Variable A does not correlate with the target, Variable Bdoes not correlate with the target, but Variable A and B together, if you turn them into a combined feature, then they are correlated with the target.

Forward Selection

  1. Identify the best variable (e.g., based on Model Accuracy)
  2. Add the next best variable into the model
  3. And so on until some predefined criteria is satisfied

Backward Selection/elimination (RFE)

  1. Start with all variables included in the model
  2. Drop the least useful variable (e.g., based on the smallest drop in Model Accuracy)
  3. And so on until some predefined criteria is satisfied

Stepwise selection

Similar to forward selection process, but a variable can also be dropped if it deemed as not useful any more after a certain number of steps.


LASSO: Least Absolute Shrinkage and Selection Operator

Two birds, one stone: Variable Selection + Regularization

LASSO is actually an algorithm for creating a regularized linear model. A nice property of LASSO is you change this regularization parameter, when it’s either very large or very small, there’s no regularization and you just have a plain linear model. Then if you increase or decrease that regularization parameters slightly, it does regularization which with LASSO actually drops coefficients all the way to zero. And a coefficient of zero means the feature has been dropped. So it essentially does feature selection for you.

Tree-based selection

Forests of tress to evaluate the importance of features, which automatically computes the feature importance by fitting a number of randomized decision trees/ensembled trees… on various sub-samples of the dataset and use averaging to rank order features.

You can set a threshold and say if a given feature importance is below a certain threshold, then remove it from the model.

These last two ideas are only useful if that is your model that you’re using, or you could theoretically use a tree based model to look at feature importance, and not actually use a tree based model for your model that you’re building.



Yanlin Chen

Data Analyst in Fintech: Data Science, Analytics, data visualization specialist. #Python #NLP #Hadoop #AWS