Simple and Complete Tutorial or Understanding Naive Bayes Classifier

Simple And Complete Tutorial for Naive Bayes Classifier

With datasets smaller than 10,000 rows that has a very large number of features, Naive Bayes classifier might be the best bet for you, if you are sure that the features of the data set are independent of each other.

The algorithm is not computationally expensive and gives a higher prediction accuracy.

Table Of content

Step by step approach to master Naive Bayes Classifier.

  1. Understanding Bayes Theorem
  2. Bayes theorem in terms of machine learning
  3. Understanding Naive Bayes Classifier
  4. Understanding Priors in Naive Bayes Classifier
  5. Different Kind of Distributions in Naive Bayes Classifier
  6. Assumptions we undertake in Naive Bayes Classifier
  7. Loss function in Naive Bayes
  8. Bias – variance tradeoff in naive bayes classifier



Before we, go into Bayes theorem, let’s first brush up on our probability knowledge.

The first thing we need to understand is Conditional Probability which implies, what is the probability an event A occurs, given B has already occurred?

This is important as in Naive Bayes we are concerned with the Probability of getting the data when we already know the parameters of model/ the target function that generated the data set and also the other way around, probability that the data we have already observed was generated by a specific model.


$$ P(\theta\mid Data ) = \frac{P(Data \mid \theta) \, P(\theta)}{P(Data)} $$

Here Theta Represents the Parameters of the Target Function that generated the sample data set


Example To Visualize The Bayes Theorem

Let’s calculate the probability of you having a problem in your car given the check engine light is on,

To start off, let’s define individual probabilities

P(You Have an issue in the engine) = 1/12

P(Your check engine light is On) =4/12

For the sake of simplicity, I am going to call the probability of having an engine issue as P(A) and the probability of the check engine light to be on as P(B).

Probability simply put is the ratio of a desirable event to occur divided by the total number of outcomes possible.

So, in conditional probability, the total number of total outcomes is restricted to P(B), because we know that the event B has already occurred.  The conditional probability of the desirable event A to occur when event B has already occurred is given by the intersection of the event A and B

Based on this let’s can calculate the conditional probability we were earlier looking for:-

P(You \,Have\, an \,issue \mid check \,engine\,light\,is\,on ) = \frac{P(You \,Have\, an \,issue \cap check engine light is on) }{P(check \! engine\, light\, is\, on)

If you know the Probability of the event A and B to occur together, we can find the conditional probability.

Here is another take on understanding,

There are only 4 different ways in which the two events can occur

Event A : Your check engine light is On

Event B: Your Engine has a Problem

This diagram above provides an very good intuition to a how multiplication rule.

Image From Wikipedia

Let’s try to put the first instance in the picture into words, For the event A and B to occur together, first any of those two events has to occur, let say A occurs first, now the event B has to occur given that A has already occurred.

To find the probability of the intersection of two events, we simply multiply the two probabilities.

P(A\cap B) = P(A) . P(B| A)

Another way in which we can reach the same intersection is that , the event B occurs first and then event A occurs conditioned upon the fact the event A has already occurred.

P(A\cap B) = P(B) . P(A| B)

This gets us to the foundation of the Bayes theorem, if we equate the two equation above, you get:-

P(A) . P(B| A) = P(B) . P(A| B)


P(A| B) = \frac{P(B| A) . P(A)}{P(B)}


Naive Bayes Classifier

Now lets us convert the same intuition that we developed from the Bayes theorem and apply it to machine learning.

$$ P(\theta\mid Data ) = \frac{P(Data \mid \theta) \, P(\theta)}{P(Data)} $$

Here Theta Represents the Parameters of the Target Function that generated the sample data set

Now let us Look at what each of these terms individually represents.

P(\theta) is called the prior, it captures any information, that we know beforehand about the distribution, like is it binomial, Gaussian or multinomial. This represents our prior beliefs about the distribution before we have looked at any data.

P(Data | \theta) is the likelihood, it signifies how likely is a given model to produce the above data.

P( \theta|Data ) represents the posterior distribution, it represents our beliefs about the model, after we have taken into account all the data points.

Naive Bayes classifier goes one step further than another Non-Bayesian model as it includes a prior that is built upon previous knowledge about the distribution.

Maybe we already have information about the distribution from which the data is withdrawn.

For a model like logistic regression, we work with just maximize the likelihood of the model predicting the data, hence we call it maximum likelihood estimation.

In a Bayesian approach, we also take into account the prior, so we solve it by maximum posterior estimation.

Here is where Bayesian and Non-Bayesian approaches differ. For Bayesian statisticians, Non-Bayesian statisticians are just doing the same thing with uninformative priors.

The counter-argument for Non-Bayesian statisticians is that if the data set is large then the impact of the prior is reduced

Not getting too much into the debate, I will point out that the algorithm is extremely useful for data sets with high dimensionality, where it can perform better than other sophisticated algorithms.

A more Practical Look On the Bayes Classifier

For studying further the Bayes classifier let’s see it in a different way. Rather than have the parameter theta, let’s work with a data set that has k classes, and try to predict in which class a new data point will lie. Usually Naive Bayes is used for classification tasks.

P( C_k| x_1, x_2,x_3.... ) = P(C_k) P(x_1,x_2,x_3...|C_k)

X1, X2, X3, X4 ……. are the different features of the new data point and C1, C2, C3…. Ck are the predicted classes

We leave the value P(Data) because that is an observed value, we can calculate its value as we have already observed the data, hence it is just a normalizing constant. It does not change the maximum posterior Estimation.

We get the final classification based on choosing the class with the maximum probability:-

We will choose the value of y based on the class which has the highest probability.

In a machine learning problem, we will generally have multiple features i.e x1, X2, X3, X4 ……. and classes.  C1, C2, C3…. Ck

But the tricky part that we still have to work around is  P(x_1,x_2,x_3...|C_k)

Let try to calculate this conditional probability. Just think sequentially, first, we know that the outcome has to belong to a particular class, now conditional on that what’s the probability to get value xn and then we calculate the probability to get the value xn-1 conditioned on the fact that xn and class label has already occurred.

We get:-

 P(x_1,x_2,x_3...|C_k) = P(x_1|x_2,x_3...,C_k) P(x_2|x_3...,C_k)..... P(x_n| C_k)


In Naive Bayes we work with the assumption that all features are independent of each other, so, we can express the probability using the multiplication rule of conditional probability,

It simply boils down to:-

P( C_k| x_1, x_2,x_3.... ) = P(x_1|C_k) P(x_2|C_k)..... P(x_n| C_k) P(C_k)

The assumption does not seem to greatly affect the posterior probabilities, especially in regions near decision boundaries, thus, leaving the classification task unaffected.

The assumption of independence is the reason why the classifier is called naive. But we have to make that assumption because calculating the probability of  P(x_1,x_2,x_3...|C_k) is difficult because there are not many examples in the dataset for the combination of all attributes.

However,  it is important to check the multicollinearity between different features before starting with the analysis. If the features have high multicollinearity it is recommended to use another algorithm to find a baseline.

The final representation of the algorithm is as follows:-

Just in case the individual probabilities of P(Xi | Ck) are small we can take the log of probabilities. This is done because the final multiplication will be very small.

log(P( C_k| x_1, x_2,x_3.... )) \propto log(P(x_1|C_k)) + log(P(x_2|C_k))..... +log(P(x_n| C_k)) + log(P(C_k))

Before we get to any further lets, just reiterate the two reasons why the Naive Bayes Classifier is Naive:-

  1. We determine the prior by looking at the empirical distribution
  2. We assume independence of predictors

Few Pointers About Implementing Naive Bayes With Python

Working With the Prior

While implementing Naive bayes algorithm with sklearn libary or other machine learning library. You have these three choices when it comes to the selection of the prior.

  1. Uniform Prior – We just leave the prior out and take its value equal to 1
  2. Fit Prior- We choose the value of prior based on the class probabilities in our data set.
  3.  Class Prior- If we have additional details about the population and not just our sample dataset, we can provide an array of probabilities as an input to the machine learning algorithm.

Let’s take the three example on how the prior can impact our final estimation.

a) We choose a Prior that does not work.

The Prior causes the posterior probability to change too much and it does not give the right output

b) Total Number of sample N in the dataset is large

If the total number of the sample is too large then despite the prior being wrong we end up the correct estimation of the posterior probability.

c) We choose the right Prior

We can see that prior acts to regulate the likelihood function, and make sure the posterior probability equals to 1


In image two or choice of the prior effects our final posterior distribution, but in the 4th image when the sample size is large, the wrong prior has no effect on the distribution.

Choosing Distribution of Input Variables

Based on the distribution of the input variables,  we have three different types of Naive Bayes algorithms to choose from in sklearn

  • Gaussian: It is used in classification and it assumes that features follow a normal distribution.
  • Multinomial: It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
  • Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

Issues with mixed classes in naive Bayes

If you are working with mixed class variables in naive Bayes, that is an issue. The sklearn algorithm only accommodates variables with a similar distribution.

So, in that case, you might need to change the variables around to make it work, for example, if one of the variables is income of the person, you can encode the variable in buckets like greater than 50,000 , 50000- – 75000.

You get my point.

Loss Function

The loss function of naive Bayes is  0-1 loss Function

This loss function penalizes incorrect classification, a solution with the most number of correct classification will have a smaller loss.

But remember that in Naive Bayes we are not iterating over the solution to minimize the loss, we just have one pass over the data to figure out the posterior probability.

Other ways to check the efficiency of a naive Bayes algorithm:-


ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point – a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.


LIkelihood ratio

Likelihood ratio, tests can be useful in comparing the likelihood of two different hypothesis. You get a sense of how likely this is to happen.

For example, an instance had the probability of 0.02 for class A and 0.001 for class B, the likelihood of the instance belonging to class A is (0.02/(0.02+0.001))*100 which is about 95.23%.

Bias and variance tradeoff

The Naive Bayes classifier employs a very simple (linear) hypothesis function, the function it uses to model data. It suffers from high bias, or error resulting from inaccuracies in its hypothesis class because its hypothesis function is so simple it cannot accurately represent many complex situations.

On the other hand, it exhibits low variance or failure to generalize to unseen data based on its training set, because it’s hypothesis class’ simplicity prevents it from overfitting to its training data.

Bagging and boosting, in this case, won’t work as the variance is already low.

Some Pointers To Be Careful With Naive Bayes Algorithm

  • If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.
  • If test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
  • Remove correlated features, as the highly correlated features are voted twice in the model and it can lead to over inflating importance.
  • Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some other options (look at detail here). I would recommend to focus on your  pre-processing of data and the feature selection.
  • You might think to apply some classifier combination technique like ensembling, bagging and boosting but these methods would not help. Actually, “ensembling, boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes has no variance to minimize.

Practical implementation of naive Bayes in sklearn

For seeing all the concepts that we discussed above and to see them implemented in the practical sense let us work with the task of email classification.

  1. Describing the Data– We have two main columns in the data set that have relevant information, the last three columns we can just drop them as they don’t have any valuable information.

2. Encoding word vectors into numerical values

Both the v1 and v2 columns in the data set are strings, for working with niave bayes algorithm we need to convert these columns into numerical encodings.

For column V1 ,we just define a function, that return 0 if the value is spam and 1 for ham.

 def string_to_numeric(x):
  if x=="spam":
     return 0
  if x=="ham":
     return 1

Here is the histogram data of the plot.

Now, we need to convert the email text into a bag of words using a count vectorizer. It is simply put one hot vector encoding.

Count vectorizer creates a matrix representation of all the email text, where each value of the row vector represents the count of specific words

It is also recommended to use TfidfTransformer to normalize transform a count matrix to a normalized tf or tf-idf representation

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency from Sklearn.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

 data_BagOfWords=vec.fit_transform(data["v2"] )

3. Apply the Naive Bayes classier

The dataset after the pre-processing has these dimensions, 5572 x 8404, this is a very high dimensional data and the data set is also smaller than 10,000 rows.

So, it is useful to work with Multinomial Naive Bayes algorithm. We are choosing naive Bayes because we can have different discrete value for our features.

X_train, X_test, y_train, y_test = train_test_split(data_preprocessed, 
 clf = MultinomialNB(alpha=0, class_prior=None,    fit_prior=False).fit(X_train, y_train)
 predicted = clf.predict(X_test)
 metrics.confusion_matrix(y_test, predicted)

4) Checking the Output of our Naive Bayes Algorithm

Loss Function

We calculate the 0-1 loss using sklearn and we get a very small value, 0.045739910313901344

from sklearn import metrics 
metrics.zero_one_loss(y_test, predicted)


We calculate the ROC_AUC Score, the value we get is 0.955, so we know that our algorithm had a positive performance.

metrics.roc_auc_score(y_test, predicted) 

About the author


Mastering Data Engineering/ Data science one project at a time. I have worked and developed multiple startups before, and this blog is for my journey as a startup in itself where I iterate and learn.

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Copyright © 2023. Created by Meks. Powered by WordPress.