Naive Bayes is a classification algorithm for binary class and multiclass classification problems. Naive Bayes is applied on each row and column. It is used for text classification.
What is Probability?
The chance of something happening is called ‘probability.’
Probability = No. of outcome / Total No. of outcome
- Flipping a Coin(H,T) = ½ — Independent Event
Rolling a Dice, P(5) = ⅙ — Independent Event
Let’s take both events
P(H,5)= P(H) * P(5) —Joint Event
- Event A = Taking a blue marble
Event B= Taking Second blue marble P(2B)=2/4 = ½
P(getting 2nd blue marble / when already 1st blue marble has been taken out)=½
Event A and B are dependent events, hence we will use Conditional Probability.
Let’s look at an example :
Task: Based on weather conditions predict if the player will play or not.
Step 1: Make a Frequency table
Step 2 : Create a Likelihood table
The problem statement is whether the players would play if the weather is sunny.
P(Yes/Sunny) = P(Sunny/Yes) * P(Yes) / P(Sunny)
= 3/9 * 9/14 / 5/14
P ( No / Overcast ) = P ( Overcast / No ) * P ( No ) / P (Overcast)
=0/5 * 5/14 / 0/14
Naive Bayes Algorithm Explained
Feature Vector Creation
The feature vector is the method which converts text data into feature vectors. Feature vector creation can be done by
Bag of words
Bag of words
Step1: Tokenization ( Removing Stopwords)
TOKENIZATION Sentence 1: He is a good boy. — good boy Sentence 2: She is a good girl. — good girl Sentence 3: Both are good boy and girl — good boy girl
Step2: Feature vector creation ( unique tokens are getting features )
f1 f2 f3 Good boy girl
Sent1 1 1 0
Sent2 1 0 1
Sent3 1 1 1
TF-IDF is short for Term Frequency Inverse Document Frequency
TF = ( No. of times the word repeated in sentence / Total No. of words in a sentence) IDF = log ( Total No. of sentence / No.of sentences containing that word )
Step1: Tokenization ( Removing Stopwords )
Sentence 1: — good boy Sentence 2: — good girl Sentence 3: — good boy good girl
Step2: Creating TF table
Sent1 Sent2 Sent3 Good 1/2 1/2 1/3 Boy 1/2 0 1/3 Girl 0 1/2 1/3
Step3: Creating IDF table
Good log( 3 / 3 ) = log(1) =0
Boy log ( 3 / 2 )
Girl log ( 3 / 2 )
Step4: Feature vector creation
Good Boy Girl
Sent1 1/2 * 0 1/2 * log( 3 / 2 ) 0
Sent2 1/2 * 0 0 1/2 * log( 3 / 2 )
Sent3 1/3 * 0 1/3 * log( 3 / 2 ) 1/3 * log( 3 / 2 )
3 Types of Naive Bayes in Scikit Learn
- It is used for classification problems and it assumes that features have a normal distribution.
- It is used for discrete counts.
- It is used for binary counts(ie., Zeroes and One).
Pros of Naive Bayes
- Naive Bayes is a fast and highly scalable algorithm.
- Naive Bayes can be classified into both binary classification and multi-class classification. It has different types of Naive Bayes Algorithms like GaussianNB, MultinominalNB, and BernoulliNB.
- The algorithm is simple and depends on doing a bunch of counts.
- Best choice for text classification problems. Widely used in spam mail classification.
- Training on small datasets is easy.
Cons of Naive Bayes
- It cannot learn the relationship between features because it considers all the features to be unrelated.
Application of Naive Bayes
- Naive Bayes is broadly utilized for text classification.
- News article classification SPORTS, TECHNOLOGY etc.
- Spam or Ham: Naive Bayes is the most popularly used for mail filtering.
Python Implementation of Naive Bayes
- Importing the required libraries
- import pandas as pd
- from sklearn.model_selection import train_test_split
- from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
- from sklearn.naive_bayes import MultinomialNB
- from sklearn.metrics import accuracy_score
- import string
- import matplotlib.pyplot as plt
Refer to the following articles:
Loading the dataset
data = pd.read_csv(“spam.tsv”,sep=’\t’,names=[‘Class’,’Message’])
data.head(8) # View the first 8 records of our dataset
The mails are categorized into 2 classes ie., spam and ham.
#Let’s see the count of each class
#Lets assign ham as 1
data.loc[data[‘Class’]==”ham”,”Class”] = 1
#Lets assign spam as 0
data.loc[data[‘Class’]==”spam”,”Class”] = 0
#the default list of punctuations
#Let’s remove the punctuation
text1 = “”.join([c for c in text if c not in string.punctuation])
data[‘text_clean’] = data[‘Message’].apply(lambda x: remove_punct(x))
#Countvectorizer is used to convert text to numerical data.
#Initialize the object for countvectorizer
CV = CountVectorizer(stop_words=”english”)
Splitting x and y
xSet = data[‘text_clean’].values
ySet = data[‘Class’].values
Splitting Train and Test Data
xSet_train,xSet_test,ySet_train,ySet_test = train_test_split(xSet,ySet,test_size=0.2, random_state=10)
xSet_train_CV = CV.fit_transform(xSet_train)
#text preprocessing and feature vectorizer
#To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(X) ## fitting and transforming the data into vectors
#Training the model
#Creating training and testing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
#model object creation
#fitting the model
#getting the prediction
#Evaluating the model
from sklearn.metrics import classification_report,confusion_matrix
msg = input(“Enter Message: “) # to get the input message
msgInput = CV.transform([msg]) #
predict = NB.predict(msgInput)
Being a prominent data science institute, DataMites provides specialised training in topics including deep learning, machine learning, artificial intelligence, the internet of things, and Python. Our Machine Learning Courses at DataMites have been authorised by the International Association for Business Analytics Certification (IABAC), a body with a strong reputation and high appreciation in the analytics field.
XGBOOST in Python (Hyper parameter tuning)
Reinforcement Learning in Python with Simple Example