Cross val score python

Содержание

Using cross_val_score in sklearn, simply explained
Cross_val_score in sklearn, what is it?
What is cross_val_score used for?
How many folds should I use in cross_val_score?
Can I train my model using cross_val_score?
Can I use cross_val_score for classification and regression?
Which metrics can I use in cross_val_score
How to implement cross_val_score in Python
Function parameters for cross_val_score
Summary of the cross_val_score function
Related articles
References
Stephen Allwright Twitter
sklearn.model_selection .cross_val_score¶

Using cross_val_score in sklearn, simply explained

Cross_val_score is a common function to use during the testing and validation phase of your machine learning model development. In this post I will explain what it is, what you can use it for, and how to implement it in Python.

Cross_val_score in sklearn, what is it?

Cross_val_score is a function in the scikit-learn package which trains and tests a model over multiple folds of your dataset. This cross validation method gives you a better understanding of model performance over the whole dataset instead of just a single train/test split.

The process that cross_val_score uses is typical for cross validation and follows these steps:

The number of folds is defined, by default this is 5
The dataset is split up according to these folds, where each fold has a unique set of testing data
A model is trained and tested for each fold
Each fold returns a metric for it’s test data
The mean and standard deviation of these metrics can then be calculated to provide a single metric for the process

An illustration of how this works is shown below:

What is cross_val_score used for?

Cross_val_score is used as a simple cross validation technique to prevent over-fitting and promote model generalisation.

The typical process of model development is to train a model on one fold of data and then test on another. But how do we know that this single test dataset is representative? This is why we use cross_val_score and cross validation more generally, to train and test our model on multiple folds such that we can be sure out model generalises well across the whole dataset and not just a single portion.

If we see that the metrics for all folds in cross_val_score are uniform then it can be concluded that the model is able to generalise, however if there are significant differences between them then this may indicate over-fitting to certain folds and would need to be investigated further.

How many folds should I use in cross_val_score?

By default cross_val_score uses a 5-fold strategy, however this can be adjusted in the cv parameter.

But how many folds should you choose?

There is unfortunately no hard and fast rules when it comes to how many folds you should choose. A general rule of thumb though is that the number of folds should be as large as possible such that each split has enough observations to generalise from and be tested on.

Can I train my model using cross_val_score?

A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it’s parameters, and cannot be used for final training. Final training should take place on all available data and tested using a set of data that has been held back from the start.

Can I use cross_val_score for classification and regression?

Cross_val_score is a function which can be used for both classification and regression models. The only major difference between the two is that by default cross_val_score uses Stratified KFold for classification, and normal KFold for regression.

Which metrics can I use in cross_val_score

By default cross_val_score uses the chosen model’s default scoring metric, but this can be overridden with your metric of choice in the scoring parameter.

‘accuracy’
‘balanced_accuracy’
‘roc_auc’
‘f1’
‘neg_mean_absolute_error’
‘neg_root_mean_squared_error’
‘r2’

How to implement cross_val_score in Python

Create a dataset
Run hyper-parameter tuning
Create model object with desired parameters
Run cross_val_score to test model performance
Train final model on full dataset

Therefore, in order to use this function we need to first have an idea of the model we want to use and a prepared dataset to test it on. Let’s look at how this process would look in Python using a Linear Regression model and the Diabetes dataset from sklearn:

Function parameters for cross_val_score

estimator — The model object to use to fit the data
X — The data to fit the model on
y — The target of the model
scoring — The error metric to use
cv — The number of splits to use

Summary of the cross_val_score function

Cross_val_score is a method which runs cross validation on a dataset to test whether the model can generalise over the whole dataset. The function returns a list of one score per split, and the average of these scores can be calculated to provide a single metric value for the dataset. This is a function and a technique which you should add to your workflow to make sure you are developing highly performant models.

References

Stephen Allwright Twitter

I’m a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I’ve picked up along the way.

Источник

sklearn.model_selection .cross_val_score¶

sklearn.model_selection. cross_val_score ( estimator , X , y = None , * , groups = None , scoring = None , cv = None , n_jobs = None , verbose = 0 , fit_params = None , pre_dispatch = ‘2*n_jobs’ , error_score = nan ) [source] ¶

Evaluate a score by cross-validation.

Parameters : estimator estimator object implementing ‘fit’

The object to use to fit the data.

X array-like of shape (n_samples, n_features)

The data to fit. Can be for example a list, or an array.

y array-like of shape (n_samples,) or (n_samples, n_outputs), default=None

The target variable to try to predict in the case of supervised learning.

groups array-like of shape (n_samples,), default=None

Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold ).

scoring str or callable, default=None

A str (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y) which should return only a single value.

Similar to cross_validate but only a single metric is permitted.

If None , the estimator’s default scorer (if available) is used.

cv int, cross-validation generator or an iterable, default=None

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None , to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold ,
CV splitter ,
An iterable that generates (train, test) splits as arrays of indices.

For int / None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

Refer User Guide for the various cross-validation strategies that can be used here.

Changed in version 0.22: cv default value if None changed from 3-fold to 5-fold.

Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbose int, default=0

fit_params dict, default=None

Parameters to pass to the fit method of the estimator.

pre_dispatch int or str, default=’2*n_jobs’

Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

None , in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs
An int, giving the exact number of total jobs that are spawned
A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised.

Array of scores of the estimator for each run of the cross validation.

To run cross-validation on multiple metrics and also to return train scores, fit times and score times.

Get predictions from each split of cross-validation for diagnostic purposes.

Make a scorer from a performance metric or loss function.

>>> from sklearn import datasets, linear_model >>> from sklearn.model_selection import cross_val_score >>> diabetes = datasets.load_diabetes() >>> X = diabetes.data[:150] >>> y = diabetes.target[:150] >>> lasso = linear_model.Lasso() >>> print(cross_val_score(lasso, X, y, cv=3)) [0.3315057 0.08022103 0.03531816]