Perceptron

Let's start with a simple linear model for binary classification. We have an input vector $x$ and we multiply it by the weight vector $θ$ to get an output value - the sign of this output value tells us which class the input vector belongs to, $1$ or $-1$.

$$ f(x; θ) = sign(θ^{T} x) $$

Note: In this model, we don't use a bias term; we omit the $ b $ term in the standard $ A x + b $ equation since appending a constant term to the end of the input vector produces the same effect without requiring any additional logic. This no longer holds true when regularization (support vector machines, logistic regression, etc.) is applied.

The perceptron learning algorithm takes this simple linear model and adjust the weights to try and classify all the input vectors correctly. For each input vector $ x_{t} $ that was incorrectly labeled, the weights are adjusted using the following:

$$ θ^{'} = θ + y_{t} x_{t} $$

Intuitively, this just means we keep adjusting the weights towards the incorrect vectors until they are classified correctly.

We can also derive the above equation using the derivative of the below error function. We use the zero-one loss function which returns 0 if the sample is correctly classified and 1 otherwise.

$$ E(θ) = \sum_{t=1}^{n} Loss( y_{t}, f(x_{t}; θ) ) $$

If we remove the correctly labeled vectors (essentially turning the loss function into a no-op which always returns 1), the error function can be turned into the following:

$$ E(θ) = \sum_{t=1}^{n} - y_{t} \cdot θ^{T} x $$

It is trivial to take the derivative with respect to the weight vector in this equation, and since the derivative always points up the gradient, we simply move in the opposite direction to minimize the error function.

Although the above derivation is enough convince me that the perceptron will eventually converge on the solution, assuming one exists, we can also prove this more rigorously by analyzing how the cosine similarity between the ideal weight vector and the "current" weight vector changes.

Convergence Proof

We assume that there is an ideal weight vector $ θ^{*} $ which correctly classifies all the samples. Then the cosine similarity between the current weight vector and the target vector can be computed as follows.

$$ similarity = \frac{θ^{*} \cdot θ^{k}}{ ||θ^{*}|| \cdot ||θ^{k}|| } $$

By proving that after each adjustment, the numerator increases at least linearly while the denominator increases at most linearly, we can show that the adjustments will reach the upper bound of 1 in a finite number of steps.

To prove the numerator must increase linearly after each adjustment, we just look at the steps we use to reach second-to-last adjustment.

$$ θ^{*} \cdot θ^{k} = θ^{*} \cdot (θ^{k-1} + y_{t} x_{t}) $$

As shown above, the value of the numerator changes by $ θ^{*} \cdot y_{t} \cdot x_{t} $, and since we assume that $ θ^{*} $ is the ideal solution, then this change must be a positive value (since $ y_{t} $ and $ θ^{*} \cdot x_{t} $ must have the same sign for the prediction to be correct).

If we further specify that the example must be classified correctly by some margin, we can offer stronger guarantees.

$$ y_{t} \cdot θ^{*} \cdot x_{t} \geq γ $$

By specifying that our ideal solution has a margin of $ γ $ for each training sample, the numerator must also increase by at least $ γ $.

$$ θ^{*} \cdot θ^{k} \geq k \cdot γ $$

To show that the denominator increases at most linearly, we can look at how the magnitude of the weight vector squared changes

$$ ||θ^{k}||^2 = ||θ^{k-1} + y_{t} x_{t}||^2 $$
$$ = ||θ^{k-1}||^2 + 2 \cdot y_{t} \cdot θ^{k-1} \cdot x_{t} + x_{t}^2 $$

Since the weights are only adjusted for incorrect samples, then we can guarantee that $ y_{t} \cdot θ^{k-1} \cdot x_{t} $ is always negative, leading to the following inequality, where $ R $ represents the largest possible value of $ ||x_{t}|| $.

$$ ||θ^{k}||^2 \leq k \cdot R^2 $$

Using these inequalities, we can rewrite our original cosine distance as the following:

$$ 1 \geq \frac{k \cdot γ}{\sqrt{k \cdot R^2} \cdot ||θ^{*}|| } $$

$$ k \leq \frac{R^2 \cdot ||θ^{*}||^2}{ γ^2 } $$

This gives us an upper bound on the number of adjustments needed to reach the ideal solution... unfortunately, it also requires us to know the ideal solution.

However, this inequality will still be useful for reasoning about the geometric margin ($ \frac{γ}{||θ||} $) and complexity of the problem.

Support Vector Machine

Unlike the perceptron, support vector machines don't just aim to find a hyperplane that separates the two classes - they aim to find the optimal hyperplane that maximizes the margin to the nearest points.

Using the above equation, we just aim to maximize the geometric margin ($ \frac{γ}{||θ||} $) or, conversely, minimize the inverse $ \frac{||θ||}{γ} $. Since $γ$ is just a constant term, we can just set it to 1 - scaling the weight vector doesn't change the outcome.

The results in the standard hard-margin SVM equation shown below, where we try to minimize our weights without violating any of the constraints.

$$ minimize \phantom{2} \frac{1}{2}||θ||^2 \phantom{2} subject \ to \phantom{2} y_{t} \cdot θ^{*} x_{t} \geq 1 $$

However, the bias term should not be regularized since we don't care where the hyperplane ends up, as long as it finds the maximum margin. Therefore, we introduce a bias term to get the revised equation.

$$ minimize \phantom{2} \frac{1}{2}||θ||^2 \phantom{2} subject \ to \phantom{2} y_{t} \cdot (θ^{*} x_{t} + θ_{0}) \geq 1 $$

If we relax the constraints, we get the standard SVM equation in it's standard form where $C$ represents the trade-off between the margin size and the number/magnitude of incorrectly labelled classes.

$$ minimize \phantom{2} \frac{1}{2}||θ||^2 + C \cdot \sum_{t=1}^{n} max(0, 1 - y_{t} \cdot (θ^{*} x_{t} + θ_{0})) $$

As usual, when trying to find minima, we can just take the derivative, revealing that the stochastic gradient descent computation for linear support vector machines is extremely similar to that of perceptrons, just with an additional weight term that pulls the weight vector towards 0.

Logistic Regression

So far, both of the linear models we've looked at have attempted to find a dividing hyperplane which splits the feature space into positive and negative samples. Logistic regression takes a different approach; instead, it is a maximum likelihood estimator which attempts to model the probability that a input vector belongs to a certain class.

The probability of a certain class, given the input vector, is computed by applying to logit function to the output of the linear equations.

$$ P(y | x) = (1 + e^{-y*(θ^{*} x_{t} + θ_{0})})^{-1} $$

By maximizing the below likelihood function, we get the weights which maximizes the overall likelihood of classifying things correctly (and therefore is more robust to noisy input values than support vector machines).

$$ maximize \phantom{2} \prod_{t=1}^{n} P(y_t | x_t) $$

Since multiplication is hard to work with, we use the negative log-likelihood and turn the above equation into a summation.

$$ minimize \phantom{2} \sum_{t=1}^{n} - log(P(y_t | x_t)) $$

$$ minimize \phantom{2} \sum_{t=1}^{n} log(1 + e^{-y*(θ^{*} x_{t} + θ_{0})}) $$

If we add the regularization term back in, we see that the final form of the logistic regression model is closely related to that of the support vector machine, just with the logistic loss function instead of the hinge loss function.

$$ minimize \phantom{2} \frac{1}{2}||θ||^2 + C \cdot \sum_{t=1}^{n} log(1 + e^{-y*(θ^{*} x_{t} + θ_{0})}) $$