import numpy as np
from scipy.optimize import minimize
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def logistic_loss(self, y_hat, y):
return -y*np.log(self.sigmoid(y_hat)) - (1-y)*np.log(1-self.sigmoid(y_hat))
def fit(self, X, y):
self.X_train = X
= self.pad(X)
X_
= np.random.rand(X.shape[0])
v0 = self.kernel(X_, X_, **self.kernel_kwargs)
km
def empirical_risk(km, y, v, loss):
= km@v
y_hat return loss(y_hat, y).mean()
= minimize(lambda v: empirical_risk(km, y, v, self.logistic_loss), x0 = v0)
result self.v = result.x
Goal
In this blog we implement and test kernel logistic regression for binary classification. The benefit of using kernel logistic regression, is that unlike regular logistic regression, it is possible to handle non-linear decision boundaries. We test kernel logistic regression in different geometries of data and with varying noise.
Link to Code
Background Information and Implementation
In kernel logistic regression we still perform Empirical Risk Minimization but with a modified loss function: \[L_k(v) = \frac{1}{n} \sum_{i=1}^n l(\langle v \; , \; k(x_i) \rangle, y_i)\] where \(v \in \mathbb{R}^n\) and \(k(x_i)\) is a modified feature vector dependent on a kernel function.
In order to implement this, we make use of the minimize function from scipy.optimize. Similarly, we start by padding \(X\) and initializing a random vector \(v\) (in this case of size \(n\)). Then, we compute the modified feature vector. To calculate the empirical risk we simply find the logistic loss of a matrix multiplication between the modified feature vector and \(v\). Note that we are able to do this since the predictor function is still and inner product.
Basic Checks
If we test our implementation, we find that the model is able to handle non-linear decision boundaries successfully. Here, the “rbf_kernel” is the kernel function and “gamma” is a parameter to that kernel function that says how wiggly the decision boundary should be. In other words, a large gamma should result in an overfitted model.
%reload_ext autoreload
%autoreload 2
from kernel_logistic import KernelLogisticRegression
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.datasets import make_moons, make_circles
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
= make_moons(200, shuffle = True, noise = 0.2)
X, y = KernelLogisticRegression(rbf_kernel, gamma = 10)
KLR
KLR.fit(X, y)
KLR.score(X, y)
= KLR)
plot_decision_regions(X, y, clf = plt.gca().set(title = f"Accuracy = {KLR.score(X, y)}",
title = "Feature 1",
xlabel = "Feature 2") ylabel
# new data with the same rough pattern
= make_moons(200, shuffle = True, noise = 0.2)
X, y = KLR)
plot_decision_regions(X, y, clf = plt.gca().set(title = f"Accuracy = {KLR.score(X, y)}",
title = "Feature 1",
xlabel = "Feature 2") ylabel
However, as predicted before, a large gamma results in an overfitted model:
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
= KernelLogisticRegression(rbf_kernel, gamma = 100000)
KLR
KLR.fit(X, y)print(KLR.score(X, y))
= KLR)
plot_decision_regions(X, y, clf = plt.gca().set(title = f"Accuracy = {KLR.score(X, y)}",
title = "Feature 1",
xlabel = "Feature 2") ylabel
1.0
# new data with the same rough pattern
= make_moons(200, shuffle = True, noise = 0.2)
X, y = KLR)
plot_decision_regions(X, y, clf = plt.gca().set(title = f"Accuracy = {KLR.score(X, y)}",
title = "Feature 1",
xlabel = "Feature 2") ylabel
Experiment
Lets investigate which value of gamma is best for our model by plotting a graph of the training and validation score against different gamma values:
import pandas as pd
import numpy as np
np.random.seed()
def experiment(noise, data_geometry):
= 10.0**np.arange(-1, 7)
gamma_values = pd.DataFrame({"gamma": [], "train" : [], "test" : []})
df
for _ in range(10): #we perform 10 runs of the experiment and take the mean
= data_geometry(100, shuffle = True, noise = noise)
X_train, y_train = data_geometry(100, shuffle = True, noise = noise)
X_test, y_test
for gamma in gamma_values:
= KernelLogisticRegression(rbf_kernel, gamma = gamma)
KLR
KLR.fit(X_train, y_train)= pd.DataFrame({"gamma" : [gamma],
to_add "train" : [KLR.score(X_train, y_train)],
"test" : [KLR.score(X_test, y_test)]})
= pd.concat((df, to_add))
df
= df.groupby("gamma").mean().reset_index()
means
"log")
plt.xscale("gamma"], means["train"], label = "training")
plt.plot(means["gamma"], means["test"], label = "validation")
plt.plot(means[
plt.legend()= plt.gca().set(xlabel = "Value of gamma",
labs = "Accuracy")
ylabel
0.2, make_moons) experiment(
From here, we find that a gamma of around 100 is best for this particular model. Now, what if we vary the noise of the data? Intuitively, a low noise should bring the training and validation scores “closer” since the pattern of the training data would not differ significantly from the testing data. On the other hand, higher noise should result in lower accuracy and it should take bigger values of gamma for the training score to converge.
0.1, make_moons) experiment(
0.4, make_moons) experiment(
While the results fit our predictions, these graphs suggest that the value of gamma for each model is still around 100. In other words, gamma is independent from the noise!
A Different Data Pattern
Finally we try a different pattern to see the results.
= make_circles(200, shuffle = True, noise = 0.1)
X, y = KernelLogisticRegression(rbf_kernel, gamma = 100)
KLR
KLR.fit(X, y)
KLR.score(X, y)
= KLR)
plot_decision_regions(X, y, clf = plt.gca().set(title = f"Accuracy = {KLR.score(X, y)}",
title = "Feature 1",
xlabel = "Feature 2") ylabel
0.1, make_circles) experiment(
0.2, make_circles) experiment(
Here, we still find that the best gamma value is independent of the noise! However, such gamma value is 1 in this case, which differs from the previous pattern.