# 12 ways of Feature Selection/Dimension Reduction

--

It’s important to choose the best features for the machine learning models. Removing irrelevant features results in:

**Accuracy**: a better performing model

**Interpretability**:** **easier to understand

**Speed**: Model can run faster

There are various ways for selecting features, and it depends on the data itself and the result we want to get.

# Dimensionality Reduction Techniques:

## Percent missing values

Drop variables that have a high % of missing values

Because if the data has a lot of missing values, it would be hard for machine learning models to learn from.

Solution:

- Even if you are dropping it, you might encode the amusingness as a feature in order to keep as much information as the original data. Turning a missingness into a feature, it might become a useful feature actually. For example, you can create a binary indicators to denote missing (or non-missing values) like “is missing”.

## Amount of variation

If the feature is mostly all the same value or have a very low variation. Then the model is also hard to learn from it.

Solution:

- Standardize all variables, or use standard deviation to account for variables with difference scales.
- Drop variables with zero variation (unary)

## Pairwise correlation (between features)

Many variables are often correlated with each other, and hence are redundant. So if you drop one of them, you won’t lose that much information.

Solution:

- If two variables are highly correlated, keeping only one will help reduce dimensionality without much loss of information.
- Which one to keep? The one that has a higher correlation coefficient with the target.

## Multicollinearity

When two or more variables are highly correlated with each other.

Solution:

- Drop one or more variables should help reduce dimensionality without a substantial loss of information.
- Which variables(s) to drop? Use Condition index.

## Principal Component Analysis (PCA)

Dimensionality reduction technique which emphasizes variation.

It helps eliminating multicollinearity — but explicability is compromised. It uses orthogonal transformation

When to use:

- Excessive multicollinearity
- Explanation of the predictors is not important: because the output of PCA is not features you can identify, but a combination or a mix of all features.
- A slight overhead in implementation is okay
- More suitable for unsupervised learning

## Cluster analysis

Dimensionality reduction technique which emphasizes correlation/similarity.

It identifies groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with variables in other clusters.

It reduces multicollinearity — and explicability is not (always) compromised.

When to use:

- Excessive multicollinearity
- Explanation of the predictors is important

It will return a pooled value for each cluster (i.e., the centroids), which can then be used to build a supervised learning model.

## Correlation (with the target)

Drop variables that have a very low correlation with the target.

If a variable has a very low correction with the target, it’s not going to be useful for the model (prediction).

However, this might, like any these techniques, can miss a useful feature. Because there might be a feature interaction. Like Variable A does not correlate with the target, Variable Bdoes not correlate with the target, but Variable A and B together, if you turn them into a combined feature, then they are correlated with the target.

## Forward Selection

- Identify the best variable (e.g., based on
**Model Accuracy**) - Add the next best variable into the model
- And so on until some predefined criteria is satisfied

## Backward Selection/elimination (RFE)

- Start with all variables included in the model
- Drop the least useful variable (e.g., based on the smallest drop in
**Model Accuracy**) - And so on until some predefined criteria is satisfied

## Stepwise selection

Similar to forward selection process, but a variable can also be dropped if it deemed as not useful any more after a certain number of steps.

## LASSO

LASSO: Least Absolute Shrinkage and Selection Operator

Two birds, one stone: Variable Selection + Regularization

LASSO is actually an algorithm for creating a regularized linear model. A nice property of LASSO is you change this regularization parameter, when it’s either very large or very small, there’s no regularization and you just have a plain linear model. Then if you increase or decrease that regularization parameters slightly, it does regularization which with LASSO actually drops coefficients all the way to zero. And a coefficient of zero means the feature has been dropped. So it essentially does feature selection for you.

## Tree-based selection

Forests of tress to evaluate the importance of features, which automatically computes the feature importance by fitting a number of randomized decision trees/ensembled trees… on various sub-samples of the dataset and use averaging to rank order features.

You can set a threshold and say if a given feature importance is below a certain threshold, then remove it from the model.

These last two ideas are only useful if that is your model that you’re using, or you could theoretically use a tree based model to look at feature importance, and not actually use a tree based model for your model that you’re building.