A Complete Guide to Naive Bayes Algorithm in Python

Sep 13, 2022 Updated: Apr 18, 2024

0 54

Naive Bayes is a classification algorithm for binary class and multiclass classification problems. Naive Bayes is applied on each row and column. It is used for text classification.

What is Probability?

The chance of something happening is called ‘probability.’

Probability = No. of outcome / Total No. of outcome

Flipping a Coin(H,T) = ½ — Independent Event
Rolling a Dice,P(5) = ⅙ — Independent Event
Let’s take both events
P(H,5)= P(H) * P(5) —Joint Event
Event A = Taking a blue marble
P(B)=⅗

Event B= Taking Second blue marble
P(2B)=2/4 = ½

P(getting 2nd blue marble / when already 1st blue marble has been taken out)=½

Event A and B are dependent events, hence we will use Conditional Probability.

General Formula

Let’s look at an example :
Task: Based on weather conditions predict if the player will play or not.

Step 1: Make a Frequency table

                     FREQUENCY TABLE

Step 2 : Create a Likelihood table

                                LIKELIHOOD TABLE

The problem statement is whether the players would play if the weather is sunny.

P(Yes/Sunny) = P(Sunny/Yes) * P(Yes) / P(Sunny)
= 3/9 * 9/14 / 5/14
P ( No / Overcast ) = P ( Overcast / No ) * P ( No ) / P (Overcast)
=0/5 * 5/14 / 0/14
= 0

Naive Bayes Algorithm Explained

Machine Learning Course

Feature Vector Creation

The feature vector is the method which converts text data into feature vectors. Feature vector creation can be done by
Bag of words

TFIDF

Bag of words

Step1: Tokenization ( Removing Stopwords)

                                                                       TOKENIZATION
Sentence 1: He is a good boy.                  — good boy
Sentence 2: She is a good girl.                 — good girl
Sentence 3: Both are good boy and girl    — good boy girl

Step2: Feature vector creation ( unique tokens are getting features )

f1    f2      f3
  Good     boy    girl

Sent1 1 1 0

Sent2 1 0 1

Sent3 1 1 1

TF-IDF

TF-IDF is short for Term Frequency Inverse Document Frequency

TF = ( No. of times the word repeated in sentence /
       Total No. of words in a sentence)

IDF = log ( Total No. of sentence / 
        No.of sentences containing that word )

Step1: Tokenization ( Removing Stopwords )

         Sentence 1:    — good boy
Sentence 2:    — good girl
         Sentence 3:    — good boy good girl

Step2: Creating TF table

        Sent1   Sent2   Sent3 

    Good         1/2             1/2               1/3

     Boy            1/2               0                1/3

     Girl              0               1/2               1/3

Step3: Creating IDF table

Good log( 3 / 3 ) = log(1) =0

Boy log ( 3 / 2 )

Girl log ( 3 / 2 )

Step4: Feature vector creation

      Good            Boy               Girl

Sent1 1/2 * 0 1/2 * log( 3 / 2 ) 0

Sent2 1/2 * 0 0 1/2 * log( 3 / 2 )

Sent3 1/3 * 0 1/3 * log( 3 / 2 ) 1/3 * log( 3 / 2 )

3 Types of Naive Bayes in Scikit Learn

Gaussian

It is used for classification problems and it assumes that features have a normal distribution.

Multinominal

It is used for discrete counts.

Bernoulli

It is used for binary counts(ie., Zeroes and One).

Pros of Naive Bayes

Naive Bayes is a fast and highly scalable algorithm.
Naive Bayes can be classified into both binary classification and multi-class classification. It has different types of Naive Bayes Algorithms like GaussianNB, MultinominalNB, and BernoulliNB.
The algorithm is simple and depends on doing a bunch of counts.
Best choice for text classification problems. Widely used in spam mail classification.
Training on small datasets is easy.

Cons of Naive Bayes

It cannot learn the relationship between features because it considers all the features to be unrelated.

Application of Naive Bayes

Naive Bayes is broadly utilized for text classification.
News article classification SPORTS, TECHNOLOGY etc.
Spam or Ham: Naive Bayes is the most popularly used for mail filtering.

Python Implementation of Naive Bayes

Importing the required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import string
import matplotlib.pyplot as plt

Refer to the following articles:

Why is Python an Interpreted Programming Language?

Scala vs Python for Apache Spark

What is Python Programming?

Loading the dataset

data = pd.read_csv(“spam.tsv”,sep=’\t’,names=[‘Class’,’Message’])
data.head(8) # View the first 8 records of our dataset

Output:

The mails are categorized into 2 classes ie., spam and ham.

#Let’s see the count of each class

data.groupby(‘Class’).count()

Output:

Text PreProcessing

#Lets assign ham as 1

data.loc[data[‘Class’]==”ham”,”Class”] = 1

#Lets assign spam as 0

data.loc[data[‘Class’]==”spam”,”Class”] = 0

Removing punctuations

#the default list of punctuations

import string
String.punctuation

Output:
‘!”#$%&\'()*+,-./:;<=>?@[\]^_`{|}~’

#Let’s remove the punctuation

def remove_punct(text1):
text1 = “”.join([c for c in text if c not in string.punctuation])
return text1
data[‘text_clean’] = data[‘Message’].apply(lambda x: remove_punct(x))
data.head()

Output:

#Tokenization

#Countvectorizer is used to convert text to numerical data.

#Initialize the object for countvectorizer

CV = CountVectorizer(stop_words=”english”)

Splitting x and y
xSet = data[‘text_clean’].values
ySet = data[‘Class’].values
ySet

Splitting Train and Test Data
xSet_train,xSet_test,ySet_train,ySet_test = train_test_split(xSet,ySet,test_size=0.2,random_state=10)

xSet_train_CV = CV.fit_transform(xSet_train)
xSet_train_CV

Feature vectorizer

#text preprocessing and feature vectorizer

#To extract features from a document of words, we import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(X) ## fitting and transforming the data into vectors

#Training the model

#Creating training and testing

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

#Model creation

from sklearn.naive_bayes import MultinomialNB

#model object creation

nb=MultinomialNB()

#fitting the model

nb.fit(X_train,y_train)

#Prediction

#getting the prediction

y_hat=nb.predict(X_test)
y_hat

#Evaluation

#Evaluating the model

from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,y_hat))

SpamClassificationApplication
msg = input(“Enter Message: “) # to get the input message
msgInput = CV.transform([msg]) #
predict = NB.predict(msgInput)
if(predict[0]==0):
print(“————————MESSAGE-SENT-[CHECK-SPAM-FOLDER]—————————“)
else:
print(“—————————MESSAGE-SENT-[CHECK-INBOX]——————————“)

Output:

Being a prominent data science institute, DataMites provides specialised training in topics including deep learning, machine learning, artificial intelligence, the internet of things, and Python. Our Machine Learning Courses at DataMites have been authorised by the International Association for Business Analytics Certification (IABAC), a body with a strong reputation and high appreciation in the analytics field.

XGBOOST in Python (Hyper parameter tuning)

XGBOOST in Python

Reinforcement Learning in Python with Simple Example

Reinforcement Learning in Python

A Complete Guide to Naive Bayes Algorithm in Python

What is Probability?

General Formula

Feature Vector Creation

TF-IDF

3 Types of Naive Bayes in Scikit Learn

Gaussian

Multinominal

Bernoulli

Pros of Naive Bayes

Cons of Naive Bayes

Application of Naive Bayes

Python Implementation of Naive Bayes

Feature vectorizer

How Much is the Artificial Intelligence Course Fee in Jakarta?

Data Analytics Lifecycle: From Data Collection to Insights

DROP US A QUERY

Follow Us

Recommended Posts

Data Analytics Lifecycle: From Data Collection to Insights

Getting Started with Machine Learning: A Beginner’s Guide

Power BI vs. Tableau for Data Science

Introduction to Power BI: What It Is and Why It Matters

Introduction to Artificial Intelligence - Key Concepts...

Random Posts

Data Analytics Lifecycle: From Data Collection to Insights

Machine Learning Course Fee in Bangalore

Top 10 Career Paths in Microsoft Power BI

Python Programming Career Scope in Hyderabad

Sidhharth Sahu Data Science Career Success Story

Data Analytics Lifecycle: From Data Collection to Insights

Getting Started with Machine Learning: A Beginner’s Guide

Power BI vs. Tableau for Data Science

Introduction to Power BI: What It Is and Why It Matters

Introduction to Artificial Intelligence - Key Concepts and Applications

Support Vector Machine Algorithm (SVM) – Understanding Kernel Trick
September 7, 2019

What is the Salary of a Data Scientist in Oceania?
May 25, 2021

What are the Top Ranking Companies in Noida?
June 29, 2022

What Are The Top IT Companies In Germany?
December 8, 2022

What are the Top IT Companies in Australia?
November 2, 2023

A Complete Guide to Naive Bayes Algorithm in Python

What is Probability?

General Formula

Feature Vector Creation

TF-IDF

3 Types of Naive Bayes in Scikit Learn

Gaussian

Multinominal

Bernoulli

Pros of Naive Bayes

Cons of Naive Bayes

Application of Naive Bayes

Python Implementation of Naive Bayes

Feature vectorizer

Related Posts

DROP US A QUERY

Popular Posts

Follow Us

Recommended Posts

Random Posts