How to Kernelize the Ridge Regression Algorithm
The ridge regression algorithm learns a linear function to map input points to a real number by minimizing an objective function. The optimization problem in ridge regression is given by:
Here, denotes the tradeoff constant. The first term in the objective function is the training error (also called empirical risk of a predictor function) in terms of linear least squares error. The second term is the regularization term penalizing functions which have large L2-norm squared.
Linear methods like ridge regression work best when the underlying dataset is sort of linear. However, when we use linear methods to model a non-linear dataset, the machine learning model often tends to underfit the dataset as the model class is not strong enough to predict well on such a dataset.
Kernel methods were introduced to overcome this problem. Kernel methods allow us to implement linear methods on a non-linear dataset by implicitly embedding the input into a high-dimensional euclidean space in which the data-points become linear.
In this blog, we will learn how to kernelize the Ridge Regression algorithm and also see an implementation of it using scikit-learn
library.
For we begin, let’s familiarize ourselves with two terms which we will use later on.
- A kernel function takes two points in the input space and maps them to a real number. Mathematically speaking, a kernel function is a function where is the set of all real numbers. Intuitively, kernel function measures how ”similar” two points are in the feature space.
Now that we have understood what kernel functions are, let us understand kernel matrix.
- Given a kernel function and a set of points , the corresponding kernel matrix is defined as . So each -th entry in the kernel matrix is the value of the kernel function at points and .
We are now ready to derive the kernel version of the Ridge Regression algorithm.
The idea is to rewrite the optimization problem of the Ridge Regression algorithm in terms of the kernel function or the kernel matrix. To this end, we will invoke the Representer theorem which basically states that for an optimization problem of the form:
where:
- is the empirical risk (training error)
- is the regularizer function
- is the tradeff constant
- is the euclidean feature space
- is a vector in
The solution exists and is given by:
where:
- is a real number for all
- is a function from the input space to .
Notice that the optimization problem in Ridge Regression appears in the same form as mentioned in the Representer theorem. Hence, we can apply the theorem and obtain a solution of the Ridge Regression problem as a linear combination of input points in the feature space:
Now this can be substituted back into the above ridge regression problem yielding the following optimization problem in terms of the kernel matrix :
The solution to this optimization problem can be derived analytically. Let’s take a look at the derivation:
Step 1: The objective function in the above optimization problem is
It is a convex function, therefore, any local minima is already a global minima of the function. Hence, it is enough to find one local minima of the objective function.
Step 2: In order to find a local minima, we will take the derivative of the objective function with respect to and set it to 0:
(Note that is symmetric)
This implies
And we have obtained the solution of the Kernel Ridge Regression problem!!
Now lets also formulate the evaluation function in terms of the kernel function, after all the essence of kernelizing an algorithm is to write the optimization problem and the evaluation/ target function in terms of the kernel function/ kernel matrix.
Substituting the value of w (given by the representer theorem) into the above equation yields the following:
We have finally obtained the target function for the kernel ridge regression algorithm. Observe that the function depends only on the kernel and not on the input points explicitly as discussed earlier.
Now, lets look at the implementation of the Kernel Ridge Regression algorithm using scikit-learn
library.
We will begin by installing necessary dependencies. Run the following command on the terminal.
pip install scikit-learn
For demonstration, we will use the diabetes toy dataset present in the scikit-learn
library.
Lets now import the required utilities.
from sklearn.datasets import load_diabetes
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
We will now load the diabetes dataset and extract the data
matrix and the target
array.
diabetes = load_diabetes()
data = diabetes.data
target = diabetes.target
Its time to split the dataset into train and test sets in order to evaluate the generalization performance of the Kernel Ridge Regression (KRR) model and also to avoid overfitting.
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
X_train
and X_test
) before training a model. However, we omit this step here because the feature matrix data
provided by the load_diabetes()
function is already standardized.
We will be tuning the hyperparameters of the KRR model using Gridsearch cross validation method.
krr_model = KernelRidge()
param_grid = {"alpha": [1e-5, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 2], "kernel": ["linear", "rbf", "poly", "sigmoid", ]}
grid_search = GridSearchCV(krr_model, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=5)
Lets train our KRR model on the diabetes dataset and get the best hyperparameter values along with the trained model.
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(best_params)
best_model = grid_search.best_estimator_
Now that we have trained our KRR model, its time to make predictions on the test set and compute the performance metrics on both the training and test sets.
predictions = best_model.predict(X_test)
test_mae = mean_absolute_error(y_test, predictions)
test_mse = mean_squared_error(y_test, predictions)
train_predictions = best_model.predict(X_train)
train_mae = mean_absolute_error(y_train, train_predictions)
train_mse = mean_squared_error(y_train, train_predictions)
print(f"Test MAE : {test_mae} and Test MSE : {test_mse}")
print(f"Train MAE : {train_mae} and train MSE : {train_mse}")
Here’s the full code:
from sklearn.datasets import load_diabetes | |
from sklearn.kernel_ridge import KernelRidge | |
from sklearn.model_selection import train_test_split, GridSearchCV | |
from sklearn.metrics import mean_absolute_error, mean_squared_error | |
diabetes = load_diabetes() | |
data = diabetes.data | |
target = diabetes.target | |
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2) | |
krr_model = KernelRidge() | |
param_grid = { | |
"alpha": [1e-5, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 2], | |
"kernel": [ | |
"linear", | |
"rbf", | |
"poly", | |
"sigmoid", | |
], | |
} | |
grid_search = GridSearchCV( | |
krr_model, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=5 | |
) | |
grid_search.fit(X_train, y_train) | |
best_params = grid_search.best_params_ | |
print(best_params) | |
best_model = grid_search.best_estimator_ | |
predictions = best_model.predict(X_test) | |
test_mae = mean_absolute_error(y_test, predictions) | |
test_mse = mean_squared_error(y_test, predictions) | |
train_predictions = best_model.predict(X_train) | |
train_mae = mean_absolute_error(y_train, train_predictions) | |
train_mse = mean_squared_error(y_train, train_predictions) | |
print(f"Test MAE : {test_mae} and Test MSE : {test_mse}") | |
print(f"Train MAE : {train_mae} and train MSE : {train_mse}") |
Woohoo! We have come a long way. I hope you found this blog helpful. Thanks for reading!