Processing math: 100%

Understanding the mathematics behind Naive Bayes

Naive Bayes, or called Naive Bayes classifier, is a classifier based on Bayes Theorem with the naive assumption that features are independent of each other. Without further ado, let’s get straight to the derivation of the model.

Bayes Theorem

Given a feature vector X=(x1,x2,...,xn) and a class variable Ck, Bayes Theorem states that:

P(Ck|X)=P(X|Ck)P(Ck)P(X),  for k=1,2,...,K

We call P(CkX) the posterior probability, P(XCk) the likelihood, P(Ck) the prior probability of class, and P(X) the prior probability of predictor. We are interested in calculating the posterior probability from the likelihood and prior probabilities.

Using the chain rule, the likelihood P(XCk) can be decomposed as:

P(XCk)=P(x1,...,xnCk)=P(x1x2,...xn,Ck)P(x2x3,...xn,Ck)P(xn1xn,Ck)P(xnCk)

Naive independence assumption

The above sets of probabilities can be hard and expensive to calculate. Fortunately, with the naive conditional independence assumption, which is stated as:

P(xixi+1,...,xnCk)=P(xiCk)

We can get:

P(XCk)=P(x1,...,xnCk)=ni=1P(xiCk)

And the posterior probability can then be written as:

P(Ck|X)=P(Ck)ni=1P(xiCk)P(X)

Naive Bayes model

Since the prior probability of predictor P(X) is constant given the input, we can get:

P(Ck|X)P(Ck)ni=1P(xiCk)

where means positive proportional to.

The Naive Bayes classification problem then becomes: for different class values of Ck, find the maximum of P(Ck)ni=1P(xiCk). This can be formulated as:

ˆC=argmaxCkP(Ck)ni=1P(xiCk)

The prior probability of class P(Ck) could be calculated as the relative frequency of class Ck in the training data.

Different Naive Bayes classifiers

Practically speaking, the likelihood P(xiCk) are usually modeled using the same class of probability distribution. The different Naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of P(xiCk).

Some commonly used Naive Bayes classifiers include Gaussian Naive Bayes classifier, Multinomial Naive Bayes classifier, and Bernoulli Naive Bayes classifier.

Multinomial Naive Bayes and Bernoulli Naive Bayes are two classic naive Bayes classifiers used in text classification. It’s hard to tell which one performs better than the other in general. It’s best to run and evaluate both models.

Extra words about Naive Bayes

Remember we have a naive independence assumption about the features of the data. While this may seem overly simplistic or naive, in practice naive Bayes classifiers have worked quite well in many real-world applications.

Because of the independence assumption, naive Bayes classifiers can quickly learn to use high dimensional features with limited training data compared to more sophisticated methods. This can be useful in situations where the dataset is small compared to the number of features, such as images or texts.

The computational efficiency of Naive Bayes lies in the fact that the runtime complexity of Naive Bayes classifier is O(nK), where n is the number of features and K is the number of label classes. This is especially useful in dealing with very high dimensional data, such as a large-corpus text dataset, or a high-resolution image dataset.

#machine learning
#algorithms
More topics
Written by Shuzhan Fan on Jun 8, 2018