Stratified k fold python

Stratified K Fold Cross Validation

In machine learning, When we want to train our ML model we split our entire dataset into training_set and test_set using train_test_split() class present in sklearn. Then we train our model on training_set and test our model on test_set. The problems that we are going to face in this method are:

Whenever we change the random_state parameter present in train_test_split(), We get different accuracy for different random_state and hence we can’t exactly point out the accuracy for our model.
The train_test_split() splits the dataset into training_test and test_set by random sampling. But stratified sampling is performed.

What are random sampling and Stratified sampling?
Suppose you want to take a survey and decided to call 1000 people from a particular state, If you pick either 1000 males completely or 1000 females completely or 900 females and 100 males (randomly) to ask their opinion on a particular product. Then based on these 1000 opinions you can’t decide the opinion of that entire state on your product. This is random sampling.
But in Stratified Sampling, Let the population for that state be 51.3% male and 48.7% female, Then for choosing 1000 people from that state if you pick 513 male ( 51.3% of 1000 ) and 487 female ( 48.7% for 1000 ) i.e 513 male + 487 female (Total=1000 people) to ask their opinion. Then these groups of people represent the entire state. This is called Stratified Sampling.

Why random sampling is not preferred in machine learning?
Let’s consider a binary-class classification problem. Let our dataset consists of 100 samples out of which 80 are negative class < 0 >and 20 are positive class < 1 >.

Читайте также:  Программы создания html кода

Random sampling:
If we do random sampling to split the dataset into training_set and test_set in an 8:2 ratio respectively.Then we might get all negative class in training_set i.e 80 samples in training_test and all 20 positive class in test_set.Now if we train our model on training_set and test our model on test_set, Then obviously we will get a bad accuracy score.

Stratified Sampling:
In stratified sampling, The training_set consists of 64 negative class ( 80% of 80 ) and 16 positive class ( 80% of 20 ) i.e. 64+16=80 samples in training_set which represents the original dataset in equal proportion and similarly test_set consists of 16 negative class ( 20% of 80 ) and 4 positive class ( 20% of 20 ) i.e. 16+4=20 samples in test_set which also represents the entire dataset in equal proportion.This type of train-test-split results in good accuracy.

What is the solution to mentioned problems?
The solution for the first problem where we were able to get different accuracy scores for different random_state parameter values is to use K-Fold Cross-Validation. But K-Fold Cross Validation also suffers from the second problem i.e. random sampling.
The solution for both the first and second problems is to use Stratified K-Fold Cross-Validation.

What is Stratified K-Fold Cross Validation?
Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold cross-validation, it does stratified sampling instead of random sampling.

Code: Python code implementation of Stratified K-Fold Cross-Validation

Источник

sklearn.model_selection .StratifiedKFold¶

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Parameters : n_splits int, default=5

Number of folds. Must be at least 2.

Changed in version 0.22: n_splits default value changed from 3 to 5.

Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

random_state int, RandomState instance or None, default=None

When shuffle is True, random_state affects the ordering of the indices, which controls the randomness of each fold for each class. Otherwise, leave random_state as None . Pass an int for reproducible output across multiple function calls. See Glossary .

Repeats Stratified K-Fold n times.

The implementation is designed to:

  • Generate test sets such that all contain the same distribution of classes, or as close as possible.
  • Be invariant to class label: relabelling y = [«Happy», «Sad»] to y = [1, 0] should not change the indices generated.
  • Preserve order dependencies in the dataset ordering, when shuffle=False : all samples from class k in some test set were contiguous in y, or separated in y by samples from classes other than k.
  • Generate test sets where the smallest and largest differ by at most one sample.

Changed in version 0.22: The previous implementation did not follow the last constraint.

>>> import numpy as np >>> from sklearn.model_selection import StratifiedKFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([0, 0, 1, 1]) >>> skf = StratifiedKFold(n_splits=2) >>> skf.get_n_splits(X, y) 2 >>> print(skf) StratifiedKFold(n_splits=2, random_state=None, shuffle=False) >>> for i, (train_index, test_index) in enumerate(skf.split(X, y)): . print(f"Fold i>:") . print(f" Train: index=train_index>") . print(f" Test: index=test_index>") Fold 0: Train: index=[1 3] Test: index=[0 2] Fold 1: Train: index=[0 2] Test: index=[1 3] 

Get metadata routing of this object.

Returns the number of splitting iterations in the cross-validator

Generate indices to split data into training and test set.

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns : routing MetadataRequest

A MetadataRequest encapsulating routing information.

get_n_splits ( X = None , y = None , groups = None ) [source] ¶

Returns the number of splitting iterations in the cross-validator

Parameters : X object

Always ignored, exists for compatibility.

Always ignored, exists for compatibility.

groups object

Always ignored, exists for compatibility.

Returns : n_splits int

Returns the number of splitting iterations in the cross-validator.

Generate indices to split data into training and test set.

Parameters : X array-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features.

Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

y array-like of shape (n_samples,)

The target variable for supervised learning problems. Stratification is done based on the y labels.

groups object

Always ignored, exists for compatibility.

Yields : train ndarray

The training set indices for that split.

test ndarray

The testing set indices for that split.

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

Источник

sklearn.model_selection.StratifiedKFold

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Number of folds. Must be at least 2.

Changed in version 0.20: n_splits default value will change from 3 to 5 in v0.22.

Whether to shuffle each stratification of the data before splitting into batches.

random_state : int, RandomState instance or None, optional, default=None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random . Used when shuffle == True.

RepeatedStratifiedKFold Repeats Stratified K-Fold n times.

Notes

Train and test sizes may be different in each fold, with a difference of at most n_classes .

Examples

>>> from sklearn.model_selection import StratifiedKFold >>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) >>> y = np.array([0, 0, 1, 1]) >>> skf = StratifiedKFold(n_splits=2) >>> skf.get_n_splits(X, y) 2 >>> print(skf) StratifiedKFold(n_splits=2, random_state=None, shuffle=False) >>> for train_index, test_index in skf.split(X, y): . print("TRAIN:", train_index, "TEST:", test_index) . X_train, X_test = X[train_index], X[test_index] . y_train, y_test = y[train_index], y[test_index] TRAIN: [1 3] TEST: [0 2] TRAIN: [0 2] TEST: [1 3]

Methods

get_n_splits ([X, y, groups]) Returns the number of splitting iterations in the cross-validator
split (X, y[, groups]) Generate indices to split data into training and test set.

__init__(n_splits=’warn’, shuffle=False, random_state=None) [source] get_n_splits(X=None, y=None, groups=None) [source]

Returns the number of splitting iterations in the cross-validator

Always ignored, exists for compatibility.

Always ignored, exists for compatibility.

Always ignored, exists for compatibility.

Returns the number of splitting iterations in the cross-validator.

Generate indices to split data into training and test set.

Training data, where n_samples is the number of samples and n_features is the number of features.

Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

y : array-like, shape (n_samples,)

The target variable for supervised learning problems. Stratification is done based on the y labels.

Always ignored, exists for compatibility.

The training set indices for that split.

The testing set indices for that split.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

Источник

Оцените статью