Resampling Methods in Machine Learning

Resampling Methods in Machine Learning
Resampling Methods in Machine Learning

Introduction

The real-world data is complex, and if a machine-learning algorithm fails to identify the hidden pattern, chances are high we have created an overfitted or under a fitted model based on test data/unseen data performance.

In regression along with r squared score, we check predicted r squared and adjusted r squared score so that the model created has generalized on test data also. Similarly, when solving classification, we check Precision, Recall, Type I error, Type II error to know how many mispredictions has been done.

Now when you have fewer features and a simple dataset, it is easy to get good scores(not accuracy_score) but the actual datasets are complex and have lots of features so our algorithm should be able to find the underlying hidden pattern. If we simply expose our entire data directly to the algorithm, chances are high, it may not be able to generalize on test data. The question is how to handle this problem and the answer is resampling techniques. Basically, sampling is the process of collecting samples for domain problems. However, resampling is the process of creating subsamples from the samples collected.

Resampling is the indispensable tool of modern statistical analysis which involves repeatedly drawing samples from training data and refitting to a model of interest on each sample in order to obtain additional information about the fitted model.

Now, let’s focus on the resampling methods available.

The major types of resampling are

  1. Cross-Validation
  2. Bootstrap Resampling.

Let’s understand each of the methods

Cross-Validation

Cross-validation is a process of evaluating the models on different training subsets and this method can be used to evaluate the algorithm on unseen data and to compare 2 algorithms for their flexibility on the data given.
There are 5 types of cross-validation techniques.

  • HoldOut Method

In this method, the original data is randomly divided into training and test data where test data is known as hold out data. On the training set, the algorithm is fitted and trained. Once training is completed, the model is evaluated on a test or Hold out data.

In the sklearn library, it is available in the package of model_Selection and import can be done as

From sklearn.model_selection import train_test_split

And the train and test set can be obtained as

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.30,random_state=42)

Now when using this validation, it is not clear which part or subsection of data is used for training i.e high chances one set of data may be exposed heavily from training but test data has entirely different samples. So this is the main drawback of train_test_split. Let’s explore another method of resampling.

  • Leave-One-Out Cross-Validation

It overcomes the drawback of the holdout validation method and involves splitting the set of observations into two parts. However, instead of creating two subsets of comparable size, a single observation (x1, y1) is used for the validation set, and the remaining observations {(x2, y2),…,(xn, yn)} make up the training set.

So now instead of training the model on n samples, the training is happening on n-1 samples and the one sample is used to test the model and because of this, it provides an approximate unbiased estimation on test data. The training is repeated on each sample and errors on each test sample is averaged leading to unbiased results.

scores = cross_val_score(model, X, y, scoring=’neg_mean_absolute_error’, cv=cv, n_jobs=-1)

scores = cross_val_score(model, X, y, scoring=’accuracy’, cv=cv, n_jobs=-1)

  • k-Fold Cross-Validation

This is an alternative to LOOCV. Here the samples are randomly divided into a set of observations into k groups or folds of equal size. The first fold is treated as a validation set while other k-1 folds are used to train the model and scores are calculated on each held-out the fold. This procedure is repeated k times and the final score is calculated by averaging the score of the held-out fold.

It gives the good bias-variance trade-off on k=5 or k=10

Bootstrap

Bootstrap Sampling is a method that involves drawing sample data repeatedly with replacement from a data source to estimate a population parameter.

The bagging technique in the ensemble approach is based on bootstrap where we try to draw samples that can reappear in the model creation leading to reduction of bias and variance.

Comparison of all resampling techniques

Hold out Validation Leave-One-Out Cross-Validation k-Fold Cross-Validation Bootstrap
Use it when data have less features Medium sized dataset Large dataset with large features Use it with any shape of data
High chance of bias Less bias Good bias and variance tradeoff Good bias and variance tradeoff
High chance of overfitting Generalized model Generalized model Generalized model

DataMites provides Machine Learning with Python course. Join now to become a expert in Machine learning.