Overfitting and Underfitting in Machine Learning Algorithms

Overfitting and Underfitting in Machine Learning Algorithms
Overfitting and Underfitting in Machine Learning Algorithms

You might have heard of overfitting and underfitting many times when testing the machine learning model. So let’s understand each of them.

Consider we have a course  which includes syllabus, students and examination.

Think of syllabus as features or independent variables of dataset, content of syllabus as training dataset, student as model and examination as test dataset in machine learning.

CASE 1:- When a student prepared well from the given syllabus and achieved good marks in the training dateset, we say the student performed well. 

Machine Learning:- When the training metrics are giving good scores as well as test metrics have good or near scores to training metrics, we say the model has good generalization capabilities on testing data.

CASE 2:- When the student prepared well but in examination he failed or scored bad scores, we can say the student is not performing well.

Machine Learning:- When the training metrics are giving good scores but the testing metrics are worst, then this is an OVERFITTING.

CASE 3:- When a student didn’t even prepare the syllabus and not achieved good scores in examination, we say the student’s performance is poor. 

Machine Learning:- When the both training metrics as well as test metric  score both are worst, we  say model is UNDER FITTED.

CASE 4:- When a student prepared all the syllabus, but on the day of exam or before that someone gave him/her the questions which are supposed to be asked in the next day examination, in this case the student will definitely score good scores on the next day exam. But when a new test paper is given to him he may perform better compared to the last exam. This is called leakage of question paper.

Machine Learning:- When some of the target variables of the testing set  has been exposed to the target variable such scenario is known as data leakage. It happens due to 2 main reasons

  1. Target or label leakage
  2. Train and test set contamination.

Data leakage is itself a big concept which can be taken in a separate article.

Join DataMites for Machine Learning Courses.