Loading [MathJax]/jax/output/HTML-CSS/jax.js

Understanding the mathematics behind Linear Discriminant Analysis (LDA)

We are already familiar with Logistic Regression classification algorithm. It works fine for two-class classification problems. However, if there are more than two classes, Logistic Regression will not be preferred and we tend to use another linear classification technique: Linear Discriminant Analysis (LDA).

Before we get into the details of LDA, let’s first review the Naive Bayes classification algorithm, which forms the basis for LDA.

Naive Bayes

The Bayes Theorem is stated as:

Given a feature vector X=(x1,x2,...,xn) and a class variable Ck, Bayes Theorem states that:

P(Ck|X)=P(X|Ck)P(Ck)P(X),  for k=1,2,...,K

We call P(CkX) the posterior probability, P(XCk) the likelihood, P(Ck) the prior probability of class, and P(X) the prior probability of predictor. We are interested in calculating the posterior probability from the likelihood and prior probabilities.

Using the Naive independence assumption, which states that P(XCk)=P(x1,...,xnCk)=ni=1P(xiCk), the posterior probability can then be written as:

P(Ck|X)=P(Ck)ni=1P(xiCk)P(X)

The Naive Bayes classification problem then becomes: for different class values of Ck, find the maximum of P(Ck)ni=1P(xiCk). This can be formulated as:

ˆC=argmaxCkP(Ck)ni=1P(xiCk)

Discriminant analysis

We then talk about the broader concept: Discriminant analysis. Discriminant analysis can be viewed as a 5-step procedure:

  • Step 1: Calculate prior probabilities. The prior probability of class P(Ck) could be calculated as the relative frequency of class Ck in the training data.
  • Step 2: Test of variances homogeneity. Use Bartlett’s test to test if K samples are from populations with equal variance-covariance matrices. The result of this test will determine whether to use Linear or Quadratic Discriminant Analysis.
    • Use Linear discriminant analysis for homogeneous variance-covariance matrices: Σ1=Σ2=...=ΣK=Σ.
    • Use Quadratic discriminant analysis for heterogeneous variance-covariance matrices: ΣiΣj for some ij.
  • Step 3: Estimate parameters of the likelihoods. We estimate the parameters (e.g. μi, and Σ) of the conditional probability density functions P(XCk) from the training data. Here, we shall make the standard assumption that the data are multivariate normally distributed.
  • Step 4: Compute discriminant functions. This is the rule to classify the new object into one of the known populations.
  • Step 5: Estimate classification performance: Use cross validation to estimate misclassification probabilities. We can then make predictions for new observations.

LDA (Linear Discriminant Analysis)

Now we go ahead and talk about the LDA (Linear Discriminant Analysis).

We assume that in population πi the probability density function of x is multivariate normal with mean vector μi and variance-covariance matrix Σ (same for all populations). The formula for this normal probability density function is:

P(Xπi)=1(2π)p/2|Σ|1/2exp[12(Xμi)Σ1(Xμi)]

According to the Naive Bayes classification algorithm. We know that we classify the example to the population for which P(πi) P(X|πi) is the maximum. For the ease of calculation, we also take a log transform.

LDA is used when the variance-covariance matrices for all populations are homogeneous. In LDA, our decision rule is based on the Linear Score Function, a function of the population means for each of our g populations, μi, and the pooled variance-covariance matrix.

The Linear Score Function is defined as:

sLi(X)=12μiΣ1μi+μiΣ1X+logP(πi)=di0+pj=1dijxj+logP(πi)=dLi(X)+logP(πi)

where di0=12μiΣ1μi and dij = jth element of μiΣ1. And we call dLi(X) the linear discriminant function.

As we can see from the above formula, the far right-hand expression is similar to a linear regression with intercept di0 and coefficients dij.

Given an example with a feature vector X=(x1,x2,...,xp), linear score function is computed for each population, and we classify the example into the population which has the largest Linear Score Function. This is equivalent to classifying to the population for which the posterior probability of membership is largest.

We should also notice that the linear score function is a function of unknown parameters, μi and Σ. So we have to estimate their values from the training data. Discriminant analysis requires estimates of:

  • Prior probabilities: P(πi);
  • Population means: μi=E(X|πi);
  • Variance-covariance matrix: Σ=var(X|πi);
#machine learning
#algorithms
More topics
Written by Shuzhan Fan on Jul 23, 2018