top of page
  • Writer's pictureHackers Realm

Mastering Cross-Validation Techniques in Python

Updated: 3 days ago

Cross-validation techniques in python are a fundamental and powerful tool in the field of machine learning and statistics. They play a pivotal role in assessing and improving the performance of predictive models by addressing one of the key challenges in data analysis: how to ensure that a model generalizes well to unseen data. Cross-validation methods are designed to provide a rigorous and unbiased evaluation of a model's performance, helping to mitigate issues like overfitting and underfitting.

KFold Cross Validation and Repeated Stratified KFold Cross Validation
KFold Cross Validation and Repeated Stratified KFold Cross Validation

In this exploration of cross-validation techniques, we will delve into the various methods and strategies used to ensure reliable model evaluation, discussing their advantages, disadvantages, and when to employ them.


You can watch the video-based tutorial with step by step explanation down below.


1) KFold Cross Validation


KFold cross-validation is a widely employed and valuable technique in the field of machine learning and model evaluation. It is a systematic and robust approach that helps us assess the performance and generalization capabilities of predictive models. KFold cross-validation is particularly useful when dealing with limited data, as it allows us to make the most of the available information by repeatedly partitioning the dataset into multiple subsets for training and testing.


Let us delve into the mechanics of this technique, its advantages, and its applications.


Load the Dataset


First we will load the data

df = pd.read_csv('data/bike sharing dataset.csv')
df = df.drop(columns=['instant', 'dteday', 'casual', 'registered'], axis=1)
df.head()
First 5 rows of the dataframe
First 5 rows of the dataframe
  • df = pd.read_csv('data/bike sharing dataset.csv'): This line reads a CSV file named 'bike sharing dataset.csv' from the 'data' directory and stores it as a DataFrame called 'df'. This is a common way to import data into Python for analysis using the pandas library.

  • df = df.drop(columns=['instant', 'dteday', 'casual', 'registered'], axis=1): In this line, you are removing specific columns from the DataFrame 'df'. The 'drop' method is used to eliminate columns. The columns being dropped are 'instant', 'dteday', 'casual', and 'registered'. The 'axis=1' parameter specifies that we are dropping columns (as opposed to rows).

  • df.head(): Finally, 'df.head()' is used to display the first few rows of the modified DataFrame. This is a common practice to quickly inspect the dataset and verify that the columns you wanted to remove have been successfully dropped.


Split input and output


Next we will prepare data for predicting the target variable 'cnt' based on the remaining columns in your DataFrame 'df'.

X = df.drop(columns=['cnt'], axis=1)
y = df['cnt']
  • X = df.drop(columns=['cnt'], axis=1): This line creates a new DataFrame 'X' by removing the 'cnt' column from your original DataFrame 'df'. The 'axis=1' parameter specifies that you are dropping a column.

  • y = df['cnt']: Here, you are creating a Series 'y' which contains the values of the 'cnt' column from the original DataFrame 'df'. This Series 'y' represents your target variable, the one you want to predict.

  • By splitting your data into 'X' (features) and 'y' (target), you are setting up your dataset for a typical supervised machine learning problem, where 'X' contains the input features, and 'y' contains the corresponding target values.


Implement KFold cross-validation with a Random Forest Regressor model


Next we will perform KFold cross-validation with a Random Forest Regressor model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score

cv = KFold(n_splits=5, random_state=42, shuffle=True)
model = RandomForestRegressor()
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}")

Error Mean: 0.9451816375903522 Error Std: 0.0034610555321333914

  • from sklearn.ensemble import RandomForestRegressor: You import the RandomForestRegressor class from scikit-learn, which is an ensemble machine learning model used for regression tasks.

  • from sklearn.model_selection import KFold, cross_val_score: You import KFold and cross_val_score, which are essential for performing cross-validation.

  • cv = KFold(n_splits=5, random_state=42, shuffle=True): You create a KFold cross-validation object named 'cv'. This object will be used to split the data into 5 folds (n_splits=5), ensuring reproducibility with 'random_state=42' and shuffling the data before splitting with 'shuffle=True'.

  • model = RandomForestRegressor(): You create an instance of the RandomForestRegressor model, which you'll use to build and evaluate the predictive model.

  • scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1): This line performs cross-validation using the specified model, features 'X', target 'y', and the KFold object 'cv'. The 'n_jobs=-1' parameter allows the cross-validation to use all available CPU cores for parallel processing.

  • print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}"): Finally, you calculate and print the mean and standard deviation of the cross-validation scores, which are indicative of the model's performance across the different folds.


Advantages

  • Reduced Variance in Model Evaluation: By dividing the dataset into multiple subsets and repeatedly training and testing the model on different portions of the data, K-Fold cross-validation provides a more stable and reliable estimate of a model's performance. It helps reduce the variance in the evaluation compared to a single train-test split.

  • Utilizes the Entire Dataset: Every data point is used for both training and testing in at least one of the K iterations. This ensures that the model gets exposed to as much data as possible, maximizing information utilization.

  • Helps Detect Data Anomalies: If there are anomalies or outliers in the data, K-Fold cross-validation can reveal their impact on the model's performance across different folds.

  • Reduces the Impact of Data Ordering: In some cases, the order of data points can affect model performance. K-Fold cross-validation helps mitigate this issue by randomizing the data and evaluating the model on different subsets.

  • Avoids Overfitting: By repeatedly testing the model on different data subsets, K-Fold cross-validation helps identify if the model is overfitting. Overfitting is less likely to go unnoticed since the model must perform consistently across different data splits.


Disadvantages

  • Increased Computational Cost: K-Fold cross-validation requires training and testing the model multiple times, which can be computationally expensive, especially when working with large datasets or complex models. It may not be feasible for extremely resource-intensive models.

  • Data Leakage in Feature Engineering: When performing feature engineering, there is a risk of data leakage if not done correctly. If you perform feature engineering before cross-validation and use information from the entire dataset, information from the test set may inadvertently influence the training process.

  • Sensitivity to the Value of K: The choice of the number of folds (K) can impact the results. A small K may lead to high variance in the estimates, while a large K may increase computational cost. Selecting the optimal K value can be a challenge.

  • Not Ideal for Time-Series Data: K-Fold cross-validation is not well-suited for time-series data, where the order of data points is important. In such cases, techniques like time series cross-validation or walk-forward validation are more appropriate.

  • Lack of Interpretability: K-Fold cross-validation focuses on performance evaluation and may not provide insights into the interpretability of the model or the importance of individual features.


2) Repeated Stratified KFold Cross Validation


Repeated Stratified K-Fold Cross Validation is an advanced and robust technique for evaluating machine learning models, especially in scenarios where ensuring both representative sampling and rigorous assessment of model performance is critical. This methodology combines the strengths of Stratified K-Fold Cross Validation and repetition to provide a comprehensive and statistically sound approach to model validation.


Implement Repeated Stratified KFold Cross Validation


Let us perform Repeated Stratified K-Fold Cross Validation with a RandomForestRegressor model.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
model = RandomForestRegressor()
scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1)
print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}")

Error Mean: 0.9450597389924781 Error Std: 0.0036935612975137313

  • from sklearn.ensemble import RandomForestRegressor: You import the RandomForestRegressor class from scikit-learn, which is an ensemble machine learning model used for regression tasks.

  • from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score: You import the RepeatedStratifiedKFold cross-validation class and cross_val_score function from scikit-learn, which are essential for performing repeated stratified K-Fold cross-validation.

  • cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42): You create a RepeatedStratifiedKFold cross-validation object named 'cv'. This object will be used to repeatedly split the data into 5 stratified folds, and this process will be repeated 3 times (n_repeats=3). The 'random_state=42' parameter ensures reproducibility.

  • model = RandomForestRegressor(): You create an instance of the RandomForestRegressor model, which you'll use to build and evaluate the predictive model.

  • scores = cross_val_score(model, X, y, cv=cv, n_jobs=-1): This line performs repeated stratified K-Fold cross-validation using the specified model, features 'X', target 'y', and the RepeatedStratifiedKFold object 'cv'. The 'n_jobs=-1' parameter allows the cross-validation to use all available CPU cores for parallel processing.

  • print(f"Error Mean: {np.mean(scores)} Error Std: {np.std(scores)}"): Finally, you calculate and print the mean and standard deviation of the cross-validation scores, which are indicative of the model's performance across the different iterations and folds.


Advantages

  • Robust Model Evaluation: By repeating the Stratified K-Fold Cross Validation process multiple times with different random splits, Repeated Stratified K-Fold Cross Validation provides a more robust and stable assessment of a model's performance. It helps reduce the impact of random data splitting, ensuring a more reliable evaluation.

  • Stratified Sampling: Like standard Stratified K-Fold Cross Validation, this method maintains class distribution balance in each fold. It is especially valuable for classification tasks with imbalanced datasets, preventing biased model evaluation and ensuring that each fold has a representative sample of all classes.

  • Mitigates Overfitting and Underfitting: The repetition in Repeated Stratified K-Fold Cross Validation helps detect overfitting or underfitting issues. If a model performs well consistently across different splits, it is a stronger indication of its generalization capability.

  • Effective for Imbalanced Datasets: Repeated Stratified K-Fold Cross Validation is particularly effective for imbalanced datasets, where some classes have significantly fewer samples. It prevents the model from overfitting to the majority class and helps evaluate its performance on minority classes.

  • Better Insight into Model Performance: Repeated Stratified K-Fold Cross Validation provides a more comprehensive view of a model's performance. It helps identify whether a model consistently performs well or if its success is limited to specific data splits.


Disadvantages

  • Increased Computational Cost: The repeated nature of this technique can significantly increase the computational cost, especially when performing numerous repetitions or using complex models. It may not be practical for very large datasets or resource-intensive algorithms.

  • Choice of Repetitions: Selecting the appropriate number of repetitions can be challenging. Too few repetitions may not provide sufficient robustness, while too many repetitions may be computationally expensive. The choice depends on the specific dataset and analysis goals.

  • Model Instability: The repetition of data splits may lead to model instability if the dataset is small or noisy. In such cases, model performance may fluctuate significantly across repetitions, making it challenging to draw reliable conclusions.

  • Difficulty in Model Interpretability: Repeated Stratified K-Fold Cross Validation focuses on model evaluation and performance assessment but does not provide insights into model interpretability or feature importance.


Final Thoughts

  • KFold Cross Validation and Repeated Stratified KFold Cross Validation are both valuable techniques in the field of machine learning for model evaluation, selection, and hyperparameter tuning.

  • KFold Cross Validation is a fundamental and widely used technique that provides a robust estimate of a model's performance. It is straightforward to implement and computationally efficient. Repeated Stratified KFold Cross Validation builds upon the strengths of KFold Cross Validation and introduces repetition to provide a more robust and stable assessment of model performance.

  • KFold Cross Validation is a reliable and widely used technique, while Repeated Stratified KFold Cross Validation offers enhanced robustness, particularly in cases of imbalanced datasets and when you need more stable performance estimates.

In conclusion, the choice between KFold Cross Validation and Repeated Stratified KFold Cross Validation depends on the specific characteristics of your dataset and your goals in model evaluation. Both methods play a crucial role in helping machine learning practitioners assess and improve the quality of their models. It is important to carefully consider the advantages and disadvantages of each technique and select the one that best suits your particular analysis.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

bottom of page