In the world of machine learning, overfitting is a common enemy—your model performs great on training data but flops when faced with unseen inputs. The model hugs the training data so well that when the model faces unseen data, it is unable to produce the best result. This is a problem very common with machine learning models.
Regression models are among the most fundamental and widely used techniques in statistical modeling and machine learning. Its appeal lies in its simplicity, interpretability, and effectiveness when the underlying assumptions are met. However, traditional regression models such as linear regression often struggle in real-world applications where the dataset includes a large number of features, multicollinearity, or noisy observations. In such scenarios, the model tends to overfit, capturing noise rather than meaningful patterns, which leads to poor generalization on unseen data.
Model overfitting occurs due to several reasons, such as:
Sensitive models like decision trees are prone to overfit noisy data
Imagine fitting a 15-degree polynomial to just 10 data points. The curve may pass through all points exactly but will wildly oscillate between them, producing absurd predictions on new data.
Now, model underfitting occurs when the model is too simplistic to learn the underlying patterns in the data, leading to poor performance on both training and test sets.
Imagine trying to fit a straight line (linear regression) to a clearly curved dataset will miss the true relationships and produce large errors.
We should always remember that when we create any model, be it a regression or classification model, we should always create a generalised model. The model should always have low bias and variance, which means that it should perform decently with training, test, and validation data.
This is where regularization techniques come into play.
Regularization techniques combat overfitting and increase a model’s ability to generalize. They achieve this by adding a penalty to the linear regression cost function, which limits the model’s complexity and improves generalization.
Ridge Regression and Lasso Regression stand out as popular regularization methods. These are essentially linear regression extensions that incorporate a regularization term. This term helps reduce the model’s variance, potentially increasing bias slightly, ultimately leading to a more favorable balance between bias and variance.
Ridge Regression, also known as L2 regularization, penalizes the sum of the squared coefficients. It is particularly effective when all features are relevant, but multicollinearity is an issue. Ridge shrinks coefficients toward zero but never exactly to zero.
Lasso Regression, or L1 regularization, penalizes the sum of the coefficients’ absolute values. It not only helps with overfitting but also performs feature selection by shrinking some coefficients exactly to zero, effectively removing them from the model.
To start with, let us take a simple linear regression and define its cost or loss function.
In simple linear regression, the cost function quantifies the error between the predicted values (ŷ) and the actual observed values (y). The most commonly used cost function is the Mean Squared Error (MSE).
Where:
Further, the below equation represents the sum of squared errors (SSE) or the residual sum of squares (RSS), which is commonly used in regression analysis to measure the discrepancy between the predicted and actual values.
Our main aim when creating a model is to lower this discrepancy, that is to reduce the difference between the observed and the predicted value.
However, in the case of overfitting, this value is closer to zero in the training dataset and increases in the test and validation datasets.
Ridge Regression, also known as Tikhonov regularization, or L2 regularization, along with the loss function or the residuals we add two more parameters or a penalty term, hence shrinking model coefficients and reducing model complexity.
Where:
Here, our main aim should be to reduce the entire equation along with the penalty term.
Overfitting can lead to a training dataset where the residual sum of squares is near zero, and the slope of the regression line is very steep. Ridge Regression addresses this by introducing a penalty term to the Ridge Loss, function. This penalty term increases with the steepness of the slope, discouraging it from becoming excessively large. Consequently, to minimize the overall Ridge Loss, in such scenarios, the best-fit line is adjusted, thereby mitigating the effects of overfitting.
In summary:
The best line gets selected, which has a low residual sum of squares and a low penalty. This line will be chosen based on multiple iterations, so the chance of overfitting will be much lesser in this case. The model will be a generalised model using Ridge regression.
Lasso Regression employs L1 regularization, incorporating the absolute value of the coefficients (or the magnitude of the slope) into the loss function. It is also known as Least Absolute Shrinkage and Selection Operator.:
Unlike Ridge, Lasso can shrink some coefficients to exactly zero, making it great for feature selection. Suppose there are multiple features in a dataset,
Where:
In the case of Lasso regression, the penalty term for (L1 regularization) is:
Where:
Now, unlike Ridge (which only shrinks coefficients), Lasso can make some weights exactly 0.
This means the model completely ignores those features. Hence, automatically, features with non-zero coefficients are kept and features with zero coefficients are removed. So, you’re left only with features that really help in predicting the output.
This often reduces the complexity of the model and makes it faster and easier to interpret. Thus, it helps avoid overfitting, especially when there are many features.
Imagine you have 100 features, but only 10 of them are actually useful. Lasso will:
Feature | Ridge Regression | Lasso Regression |
---|---|---|
Type of penalty | L2 (squared magnitude) | L1 (absolute magnitude) |
Feature selection | ❌ No | ✅ Yes |
When to use | Many small effects | Few strong effects |
Coefficient shrinkage | Yes | Yes (can become zero) |
Model interpretability | Moderate | High (fewer features) |
Elastic Net is a type of regression that combines both Lasso (L1) and Ridge (L2) regularization techniques.
Elastic Net improves on the limitations of Lasso, especially when working with high-dimensional data and a small number of samples. Lasso tends to select just one variable from a group of highly correlated features and ignore the rest, which can be a problem when all those features carry useful information.
To fix this, Elastic Net adds a quadratic term (the L2 norm, like in Ridge Regression) to the penalty. This makes the loss function more stable and convex, and helps include more relevant variables rather than dropping them entirely. Essentially, Elastic Net combines the strengths of both Lasso and Ridge—it selects important features while handling correlated ones better.
The process of finding Elastic Net coefficients happens in two stages:
Because it applies two layers of shrinkage, this naive approach can sometimes increase bias and reduce predictive accuracy. To balance things out, the final coefficients are rescaled by multiplying them with (1+λ2), correcting the double shrinkage effect.
In short, it’s especially useful when:
The loss function for Elastic Net is:
Let’s walk through a Ridge and Lasso regression example using the California Housing dataset with scikit-learn
.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X, y = fetch_california_housing(return_X_y=True)
# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
💡 Scaling is important for Ridge and Lasso since they are sensitive to feature magnitude.
from sklearn.linear_model import Ridge, Lasso
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)
from sklearn.metrics import mean_squared_error
ridge_pred = ridge.predict(X_test)
lasso_pred = lasso.predict(X_test)
print("Ridge MSE:", mean_squared_error(y_test, ridge_pred))
print("Lasso MSE:", mean_squared_error(y_test, lasso_pred))
import matplotlib.pyplot as plt
plt.plot(ridge.coef_, label='Ridge')
plt.plot(lasso.coef_, label='Lasso')
plt.legend()
plt.title("Ridge vs Lasso Coefficients")
plt.xlabel("Feature Index")
plt.ylabel("Coefficient Value")
plt.grid(True)
plt.show()
You can easily tune Ridge Regression using RidgeCV
:
from sklearn.linear_model import RidgeCV
ridge_cv = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)
ridge_cv.fit(X_train, y_train)
print("Optimal alpha:", ridge_cv.alpha_)
Optimal alpha: 0.1
What is Ridge Regression in machine learning?
It’s a regularized form of linear regression that prevents overfitting by penalizing large coefficients using L2 regularization.
How does Ridge Regression prevent overfitting?
By adding a penalty term to the cost function, it discourages the model from fitting noise in the data.
Can Ridge Regression perform feature selection?
No. It shrinks coefficients but does not zero them out like Lasso does.
When should I use Ridge Regression over Lasso?
Use Ridge when all features are likely to contribute and multicollinearity is an issue.
How do I implement Ridge Regression in Python?
Using Ridge()
from scikit-learn. See the code example above!
Ridge Regression vs ElasticNet—what’s the difference?
ElasticNet combines Ridge (L2) and Lasso (L1), balancing shrinkage and sparsity.
Regularization is not a fancy buzzword—it’s a must-have in real-world machine learning. Ridge Regression gives you stability in high dimensions, while Lasso can give you simplicity through feature selection.
Both the techniques Lasso and Ridge regression are powerful for regularization, each with its own strengths—Lasso for feature selection and sparsity, and Ridge for handling multicollinearity and stabilizing models.
However, neither is perfect on its own. There is always a need to iterate through each step and process to find the method that works best for the specific model, dataset, and features. This often involves experimenting with different regularization strengths, tuning hyperparameters, and evaluating model performance using cross-validation.
The choice between Lasso, Ridge, or Elastic Net depends on the problem at hand—whether you need feature selection, stability with multicollinearity, or a balance of both. Ultimately, thoughtful experimentation and model evaluation are key to selecting the most effective regularization technique for your machine learning task.
Elastic Net shines when there is a need to combine the best of both worlds by applying both L1 and L2 penalties. This allows it to select important features like Lasso while maintaining the grouping effect and robustness of Ridge.
In high-dimensional datasets with correlated features, Elastic Net offers a more balanced and reliable approach, reducing overfitting and improving generalization. By understanding when and how to use these regularization techniques, you can build more interpretable and efficient machine learning models.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!