What is Regularization in Machine Learning?

The Tech Platform
Apr 27, 2023
9 min read

Machine learning is a subset of artificial intelligence that involves building models that can learn from data and make predictions or decisions without being explicitly programmed. However, one common problem that machine learning models face is overfitting. In this article, we will learn what is Regularization in Machine Learning in detail.

Read: Best online Machine Learning Course

What is Overfitting?

Overfitting in machine learning occurs when a model is trained too well on a particular dataset, to the extent that it becomes too specific to that dataset and performs poorly on new, unseen data. It may also occur when a model is too complex or when it is trained for too many epochs. A complex model has too many parameters, making it more likely to fit the training data too closely, while training for too many epochs means the model has had too much exposure to the training data, and as a result, has memorized it instead of learning the underlying patterns.

Read: Overfitting in Machine Learning How to. Detect and Prevent it?

Overfitting is a common problem in machine learning, and it can lead to poor performance on new data. To address this problem, various techniques such as regularization are used to improve the model's ability to generalize to new data.

Read: Overfitting vs Underfitting

What is Regularization?

Regularization in machine learning refers to a set of techniques used to prevent overfitting and improve the generalization performance of a model by adding constraints or penalties to the model's optimization function. The primary objective of regularization is to reduce the complexity of the model and improve its ability to generalize to new, unseen data.

Importance of Regularization:

Reduces the variance of the model by constraining the parameter values, making it less likely to fit the noise in the training data.
Improves the model's ability to generalize to new data by preventing it from memorizing the training data and instead learning the underlying patterns.
Applicable to a wide range of machine learning algorithms, including linear regression, logistic regression, and neural networks.
Can combine with other techniques such as cross-validation and hyperparameter tuning to improve the model's performance further.

Types of Regularization Techniques

There are 5 types of regularization techniques in machine learning:

L1 Regularization
L2 Regularization
Elastic Net Regularization
Dropout
Early Stopping

1. L1 Regularization (or Lasso Regression)

L1 regularization is a technique that helps to shrink the coefficients of some features to zero, effectively removing them from the model. This can help to simplify the model and reduce overfitting. L1 regularization adds a penalty term to the loss function of the machine learning model that is proportional to the absolute value of the weights.

Advantages:

Produce sparse models with few non-zero coefficients, which can help with feature selection and model interpretability.
Handle multicollinearity by shrinking some coefficients to zero and selecting one feature among a group of correlated features.
Work well when the number of features is larger than the number of observations, as it can select at most n features in a data set with n observations.

Disadvantages:

Unstable and inconsistent when there are multiple features with similar predictive power, as it may arbitrarily select one and ignore the others.
Computationally expensive to solve, as it requires some special optimization techniques such as quadratic programming or coordinates descent.

2. L2 Regularization (Or Ridge Regression)

L2 regularization is a technique that helps to keep the weights of the model small and prevent overfitting. This can improve the model’s ability to generalize to new data. L2 regularization adds a penalty term to the loss function of the machine learning model that is proportional to the square of the weights.

Advantages:

Handle multicollinearity by shrinking the coefficients of correlated features.
Work well when the number of features is larger than the number of observations, as it can uniquely identify a model in this case.
Reduce the variance of the estimates, as it tends to shrink the coefficients uniformly.

Disadvantages:

Produce dense models with many non-zero coefficients, which can reduce model interpretability.
Cannot perform feature selection, as it does not set any coefficient to zero.

3. Elastic Net Regularization

Elastic net regularization is a technique that combines both the L1 and L2 penalties of the lasso and ridge regression methods. It can overcome some of the limitations of both methods, such as selecting too few features or performing poorly when there are correlated features. Elastic net regularization adds a penalty term to the loss function of the machine learning model which is a weighted sum of the absolute value and the square of the weights. The weight parameter r controls the ratio of the L1 and L2 penalties. When r = 0, the elastic net becomes ridge regression, and when r = 1, it becomes lasso regression.

Advantages:

Perform feature selection by shrinking some coefficients to zero, like lasso regression.
Handle multicollinearity by shrinking the coefficients of correlated features, like ridge regression.
Achieve a better trade-off between bias and variance than lasso and ridge regression by tuning the regularization parameters.
Applied to various types of data, such as linear, logistic, or Cox regression models.

Disadvantages:

Computationally expensive and time-consuming due to two regularization parameters and a cross-validation process.
Unstable and inconsistent when there are multiple features with similar predictive power, as it may arbitrarily select one and ignore the others.

4. Dropout

Dropout works by randomly disabling neurons and their corresponding connections. This prevents the network from relying too much on single neurons and forces all neurons to learn to generalize better. This can be seen as a way of approximating training a large number of neural networks with different architectures in parallel. It can be applied to different types of layers in a neural network, such as input layers or hidden layers. It can also be tuned by changing the probability of dropping out neurons.

Advantages:

Reduce the dependencies among neurons and make them more robust to noise.
Can approximate training a large number of neural networks with different architectures in parallel and average their predictions.
Work with a variety of neural network architectures, such as feedforward, convolutional, or recurrent networks.
Simple to implement and does not require many computational resources.

Disadvantages:

Increase the training time, as the network needs more epochs to converge.
Introduce variability in the performance of the network, as it depends on which units are dropped out.
Reduce the interpretability of the network, as it obscures the contribution of each neuron.

5. Early Stopping

Early stopping regularization is a technique to avoid overfitting when training a neural network with an iterative method, such as gradient descent. This works by monitoring the performance of the model on a validation set during the training process and stopping the training when the validation error starts to increase. It can be seen as a way of choosing the optimal number of training epochs that prevents the model from learning the noise in the training data.

Advantages:

Improve the generalization ability and reduce the error rate on unseen data by preventing overfitting and variance of the model.
Conserve time and resources by halting the training process early when the validation error starts to increase.
Simplify the network and decrease the complexity and number of parameters by choosing the optimal number of training epochs.
Increase the robustness and stability of the model while diminishing sensitivity to noise or outliers by reducing the dependencies among neurons.

Disadvantages:

Introduce variability in the performance of the model, as it depends on the choice of the validation set and the performance measure.
Sensitive to the trigger to stop training, as it may stop too early or too late depending on the threshold or patience value.
Difficult to apply to some types of neural network models, such as recurrent networks or networks with batch normalization, as they may have unstable validation errors.

Understanding the Bias-Variance Tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that refers to the tradeoff between the complexity and accuracy of a model. A model with a high bias makes strong assumptions about the data and tends to underfit the data, meaning it has low accuracy on both the training and the test sets. A model with high variance does not make many assumptions about the data and tends to overfit the data, meaning it has high accuracy on the training set but low accuracy on the test set.

The bias-variance tradeoff can be illustrated by decomposing the mean squared error (MSE) of a model into three components: bias, variance, and irreducible error. An irreducible error is an error that cannot be reduced by any model, such as noise or randomness in the data. The bias is the difference between the average prediction of the model and the true value. The variance is the variability of the model prediction for a given data point. The MSE can be written as:

MSE = Bias^2 + Variance + Irreducible Error

The bias-variance tradeoff implies that there is an optimal level of complexity for a model that minimizes the MSE. If the model is too simple, it will have high bias and low variance, resulting in high MSE. If the model is too complex, it will have low bias and high variance, resulting in high MSE.

How does Regularization help in the Bias-Variance Tradeoff?

Regularization helps in the bias-variance tradeoff by reducing the complexity of the model and preventing overfitting. By shrinking or setting some parameters to zero, regularization can also perform feature selection, which means selecting only the most relevant features for the prediction task.

Regularization can help to achieve a better tradeoff between bias and variance by reducing the variance of the model without increasing the bias too much. A well-regularized model can have a lower mean squared error than an unregularized model by balancing the bias and variance components.

However, regularization is not a magic bullet that can solve all problems. If the model is too simple or has a high bias, regularization may not help much. If the regularization strength is too high, it may cause underfitting or high bias. Therefore, regularization should be used with care and tuned properly to find the optimal level of complexity for the model.

Implementing Regularization in Machine Learning

Here we will implement regularization in Machine Learning using Cross-validation. Cross-validation helps to find the optimal regularization parameter that minimizes the validation error, which is an estimate of the generalization error. It also helps to avoid using a separate validation set, which reduces the amount of data available for training. By using cross-validation, we can use all the data for both training and validation and evaluate the final model on a separate test set (if available) to estimate its generalization performance.

Follow the below steps to implement:

Choose a regularization technique (such as L1, L2, dropout, early stopping, etc.) and a regularization parameter (such as lambda, dropout probability, etc.).
Split the available data into k folds or subsets, where k is a positive integer (usually between 5 and 10).
For each fold, use it as a validation set and train the model on the remaining k-1 folds using the chosen regularization technique and parameter.
Evaluate the performance of the model on the validation set using a suitable metric (such as accuracy, mean squared error, etc.).
Repeat steps 3 and 4 for all k folds and calculate the average performance across all validation sets. This is the cross-validation score of the model with the chosen regularization technique and parameter.
Repeat steps 1 to 5 for different regularization techniques and parameters and compare their cross-validation scores. Choose the regularization technique and parameter that gives the highest cross-validation score.
Train the final model on the entire data using the chosen regularization technique and parameter and evaluate it on a separate test set (if available) to estimate its generalization performance.

Here is an example of implementing L2 regularization in machine learning using cross-validation in Python for a linear regression model. We use the scikit-learn library to perform the tasks.

STEP 1: Import the libraries

import numpy as np 
import pandas as pd from sklearn.linear_model 
import Ridge from sklearn.model_selection 
import train_test_split, KFold, cross_val_score from sklearn.metrics 
import mean_squared_error

STEP 2: Load the data

data = pd.read_csv(‘housing.csv’) 
# a sample dataset of housing prices 

X = data.drop(‘MEDV’, axis=1) 
# the input features 

y = data[‘MEDV’] 
# the output variable

STEP 3: Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

STEP 4: Choose a range of values for lambda, the L2 regularization parameter

lambdas = [0.01, 0.1, 1, 10, 100]

STEP 5: Choose the number of folds for cross-validation

k = 5

STEP 6: Create a KFold object

kf = KFold(n_splits=k, shuffle=True, random_state=42)

STEP 7: Initialize an empty list to store the cross-validation scores for each lambda value

cv_scores = []

STEP 8: Loop over the lambda values

for l in lambdas: 
# Create a Ridge object with the current lambda value 
ridge = Ridge(alpha=l) 

# Perform cross-validation on the training set and calculate the average MSE across the validation sets 
mse = -cross_val_score(ridge, X_train, y_train, cv=kf, scoring=‘neg_mean_squared_error’).mean() 

# Append the MSE to the cv_scores list cv_scores.append(mse)

STEP 9: Find the index of the minimum MSE in the cv_scores list

best_index = np.argmin(cv_scores)

STEP 10: Find the best lambda value based on the minimum MSE

best_lambda = lambdas[best_index]

STEP 11: Print the best lambda value and its corresponding MSE

print(f’The best lambda value is {best_lambda} with MSE = {cv_scores[best_index]}')

STEP 12: Train the final model on the entire training set using the best lambda value

ridge = Ridge(alpha=best_lambda) 
ridge.fit(X_train, y_train)

STEP 13: Evaluate the final model on the test set and calculate the MSE

y_pred = ridge.predict(X_test) 
test_mse = mean_squared_error(y_test, y_pred)

STEP 14: Print the test MSE

print(f’The test MSE is {test_mse}')

Conclusion

Regularization is a technique in machine learning that helps prevent overfitting and improve the generalization performance of a model. Regularization achieves this by adding constraints or penalties to the model's optimization function, which reduces the complexity of the model and encourages it to learn the underlying patterns in the data rather than fitting the noise.