Resampling Methods in Machine Learning

Mar 5, 2021 Updated: Mar 27, 2024

0 4273

Introduction

The real-world data is complex, and if a machine-learning algorithm fails to identify the hidden pattern, chances are high we have created an overfitted or under a fitted model based on test data/unseen data performance.

In regression along with r squared score, we check predicted r squared and adjusted r squared score so that the model created has generalized on test data also. Similarly, when solving classification, we check Precision, Recall, Type I error, Type II error to know how many mispredictions has been done.

Now when you have fewer features and a simple dataset, it is easy to get good scores(not accuracy_score) but the actual datasets are complex and have lots of features so our algorithm should be able to find the underlying hidden pattern. If we simply expose our entire data directly to the algorithm, chances are high, it may not be able to generalize on test data. The question is how to handle this problem and the answer is resampling techniques. Basically, sampling is the process of collecting samples for domain problems. However, resampling is the process of creating subsamples from the samples collected.

Resampling is the indispensable tool of modern statistical analysis which involves repeatedly drawing samples from training data and refitting to a model of interest on each sample in order to obtain additional information about the fitted model.

Now, let’s focus on the resampling methods available.

The major types of resampling are

Cross-Validation
Bootstrap Resampling.

Let’s understand each of the methods

Cross-Validation

Cross-validation is a process of evaluating the models on different training subsets and this method can be used to evaluate the algorithm on unseen data and to compare 2 algorithms for their flexibility on the data given.
There are 5 types of cross-validation techniques.

HoldOut Method

In this method, the original data is randomly divided into training and test data where test data is known as hold out data. On the training set, the algorithm is fitted and trained. Once training is completed, the model is evaluated on a test or Hold out data.

In the sklearn library, it is available in the package of model_Selection and import can be done as

From sklearn.model_selection import train_test_split

And the train and test set can be obtained as

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=42)

Now when using this validation, it is not clear which part or subsection of data is used for training i.e high chances one set of data may be exposed heavily from training but test data has entirely different samples. So this is the main drawback of train_test_split. Let’s explore another method of resampling.

Leave-One-Out Cross-Validation

It overcomes the drawback of the holdout validation method and involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation (x1, y1) is used for the validation set, and the remaining observations {(x2, y2),…,(xn, yn)} make up the training set.

So now instead of training the model on n samples, the training is happening on n-1 samples and the one sample is used to test the model and because of this, it provides an approximate unbiased estimation on test data. The training is repeated on each sample and errors on each test sample is averaged leading to unbiased results.

scores = cross_val_score(model, X, y, scoring=’neg_mean_absolute_error’, cv=cv, n_jobs=-1)

scores = cross_val_score(model, X, y, scoring=’accuracy’, cv=cv, n_jobs=-1)

k-Fold Cross-Validation

This is an alternative to LOOCV. Here the samples are randomly divided into a set of observations into k groups or folds of equal size. The first fold is treated as a validation set while other k-1 folds are used to train the model and scores are calculated on each held-out the fold. This procedure is repeated k times and the final score is calculated by averaging the score of the held-out fold.

It gives the good bias-variance trade-off on k=5 or k=10

Bootstrap

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a population parameter.

The bagging technique in the ensemble approach is based on bootstrap where we try to draw samples that can reappear in the model creation leading to reduction of bias and variance.

Comparison of all resampling techniques

Hold out Validation	Leave-One-Out Cross-Validation	k-Fold Cross-Validation	Bootstrap
Use it when data have less features	Medium sized dataset	Large dataset with large features	Use it with any shape of data
High chance of bias	Less bias	Good bias and variance tradeoff	Good bias and variance tradeoff
High chance of overfitting	Generalized model	Generalized model	Generalized model

DataMites provides Machine Learning with Python course. Join now to become a expert in Machine learning.

Resampling Methods in Machine Learning

Introduction

The major types of resampling are

Cross-Validation

Bootstrap

Comparison of all resampling techniques

What are the Top IT Companies in Australia?

Data Analytics Lifecycle: From Data Collection to Insights

DROP US A QUERY

Follow Us

Recommended Posts

Data Analytics Lifecycle: From Data Collection to Insights

Getting Started with Machine Learning: A Beginner’s Guide

Power BI vs. Tableau for Data Science

Introduction to Power BI: What It Is and Why It Matters

Introduction to Artificial Intelligence - Key Concepts...

Random Posts

Python Course Fee in Pune

Why You Need to Have Microsoft Power BI Skills

Navigating the Transition: From Electrical Engineering...

Career Growth in Data Science Domain – Raufal Azam

How Much is the MLOps Course Fee in Hyderabad?

Data Analytics Lifecycle: From Data Collection to Insights

Getting Started with Machine Learning: A Beginner’s Guide

Power BI vs. Tableau for Data Science

Introduction to Power BI: What It Is and Why It Matters

Introduction to Artificial Intelligence - Key Concepts and Applications

Support Vector Machine Algorithm (SVM) – Understanding Kernel Trick
September 7, 2019

What is the Salary of a Data Scientist in Oceania?
May 25, 2021

What are the Top Ranking Companies in Noida?
June 29, 2022

What Are The Top IT Companies In Germany?
December 8, 2022

What are the Top IT Companies in Australia?
November 2, 2023

Resampling Methods in Machine Learning

Introduction

The major types of resampling are

Cross-Validation

Bootstrap

Comparison of all resampling techniques

Related Posts

DROP US A QUERY

Popular Posts

Follow Us

Recommended Posts

Random Posts