In machine learning projects, achieving optimal model performance requires paying attention to various steps in the training process. But before focusing on the technical aspects of model training, it is important to define the problem, understand the context, and analyze the dataset in detail.
Once you have a solid grasp of the problem and data, you can proceed to implement strategies that’ll help you build robust and efficient models. Here, we outline five actionable tips that are essential for training machine learning models.
Let’s get started.
1. Preprocess Your Data Efficiently
Data preprocessing is one of the most important steps in the machine learning pipeline. Properly preprocessed data can significantly enhance model performance and generalization. Here are some key preprocessing steps:
- Handle missing values: Use techniques such as mean/mode imputation or more advanced methods like K-Nearest Neighbors (KNN) imputation.
- Normalize or standardize features: Scale features if you’re using algorithms that are sensitive to feature scaling.
- Encode categorical variables: Convert categorical variables into numerical values using techniques like one-hot encoding or label encoding.
- Split into training and test sets: Split your data into training and test sets before applying any preprocessing steps to avoid data leakage.
The following code snippet shows a sample data preprocessing pipeline:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split
# Read data from a CSV file data = pd.read_csv(‘your_data.csv’)
# Specify the column name of the target variable target_column = ‘target’
# Split into features and target X = data.drop(target_column, axis=1) y = data[target_column]
# Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)
# Identify numeric and categorical columns numeric_features = X.select_dtypes(include=[‘int64’, ‘float64’]).columns.tolist() categorical_features = X.select_dtypes(include=[‘object’, ‘category’]).columns.tolist()
# Define preprocessing steps for numeric features numeric_transformer = Pipeline(steps=[ (‘imputer’, SimpleImputer(strategy=‘mean’)), (‘scaler’, StandardScaler()) ])
# Define preprocessing steps for categorical features categorical_transformer = Pipeline(steps=[ (‘encoder’, OneHotEncoder(drop=‘first’)) ])
# Combine preprocessing steps preprocessor = ColumnTransformer( transformers=[ (‘num’, numeric_transformer, numeric_features), (‘cat’, categorical_transformer, categorical_features) ] )
# Apply preprocessing to training data X_train_processed = preprocessor.fit_transform(X_train)
# Apply preprocessing to test data X_test_processed = preprocessor.transform(X_test) |
Preprocessing steps are defined for numerical and categorical features: numerical features are imputed with the mean and scaled using StandardScaler, while categorical features are one-hot encoded. These preprocessing steps are combined using ColumnTransformer and applied to both the training and test sets while avoiding data leakage.
2. Focus on Feature Engineering
Feature engineering is the systematic process of modifying existing features and creating new ones to improve model performance. Effective feature engineering can significantly boost the performance of machine learning models. Here are some key techniques.
Create Interaction Features
Interaction features capture the relationships between different variables. These features can provide additional insights that single features may not reveal.
Suppose you have ‘price’ and ‘qty_sold’ as features. An interaction feature could be the product of these two variables, indicating the total sales of the product:
# Create interaction feature data[‘price_qty_interaction’] = data[‘price’] * data[‘qty_sold’] |
Extract Info from Date and Time Features
Date and time data can be decomposed into meaningful components such as year, month, day, and day of the week. These components can reveal temporal patterns in the data.
Say you have a ‘date’ feature. You can extract various components—year, month, and day of the week—from this feature as shown:
# Extract date features data[‘date’] = pd.to_datetime([‘2020-01-01’, ‘2020-02-01’, ‘2020-03-01’, ‘2020-04-01’]) data[‘year’] = data[‘date’].dt.year data[‘month’] = data[‘date’].dt.month data[‘day_of_week’] = data[‘date’].dt.dayofweek |
Binning
Binning involves converting continuous features into discrete bins. This can help in reducing the impact of outliers and create more representative features.
Suppose you have the ‘income’ feature. You can create bins to categorize the income levels into low, medium, and high as follows:
# Binning continuous features data[‘income_bin’] = pd.cut(data[‘income’], bins=3, labels=[‘Low’, ‘Medium’, ‘High’]) |
By focusing on feature engineering, you can create more informative features that help the model understand the data better, leading to improved performance and generalization. Read Tips for Effective Feature Engineering in Machine Learning for actionable tips on feature engineering.
3. Handle Class Imbalance
Class imbalance is a common problem in real-world datasets where the target variable doesn’t have a uniform representation of all classes. The performance metrics of such models—trained on imbalanced datasets—are not reliable.
Handling class imbalance is necessary to ensure that the model performs well across all classes. Here are some techniques.
Resampling Techniques
Resampling techniques involve modifying the dataset to balance the class distribution. There are two main approaches:
- Oversampling: Increase the number of instances in the minority class by duplicating them or creating synthetic samples. Synthetic Minority Over-sampling Technique (SMOTE) is a popular method for generating synthetic samples.
- Undersampling: Decrease the number of instances in the majority class by randomly removing some of them.
Here’s an example of using SMOTE to oversample the minority class:
from imblearn.over_sampling import SMOTE from sklearn.model_selection import train_test_split
X = data.drop(‘target’, axis=1) y = data[‘target’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)
# Apply SMOTE smote = SMOTE(random_state=10) X_resampled, y_resampled = smote.fit_resample(X_train, y_train) |
Adjusting Class Weights
Adjusting class weights in machine learning algorithms can help to penalize misclassifications of the minority class, making the model more sensitive to the minority class.
Consider the following example:
import pandas as pd from sklearn.model_selection import train_test_split
# Read data from a CSV file data = pd.read_csv(‘your_data.csv’)
# Specify the column name of the target variable target_column = ‘target’
# Split into features and target X = data.drop(target_column, axis=1) y = data[target_column]
# Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10) |
You can compute class weights such that the minority class is assigned a higher weight—inversely proportional to class frequencies—and then use those weights when instantiating the classifier like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report from sklearn.utils.class_weight import compute_class_weight
# Compute class weights classes = np.unique(y_train) class_weights = compute_class_weight(class_weight=‘balanced’, classes=classes, y=y_train) class_weights_dict = dict(zip(classes, class_weights))
print(f“Class weights: {class_weights_dict}”)
model = RandomForestClassifier(class_weight=class_weights_dict, random_state=10) model.fit(X_train, y_train)
# Predict on the test set y_pred = model.predict(X_test)
# Evaluate the model print(f‘Accuracy: {accuracy_score(y_test, y_pred):.2f}’) print(classification_report(y_test, y_pred)) |
By using these techniques, you can handle class imbalance effectively, ensuring that your model performs well across all classes. To learn more about handling class imbalance, read 5 Effective Ways to Handle Imbalanced Data in Machine Learning.
4. Use Cross-Validation and Hyperparameter Tuning
Cross-validation and hyperparameter tuning are essential techniques for selecting the best model and avoiding overfitting. They help ensure that your model performs well on unseen data without drop in performance.
Cross Validation
Using a single train-test split results in a high variance model that’s influenced (more than desired) by the specific samples that end up in the train and test sets.
Cross-validation is a technique used to assess the performance of a model by dividing the data into multiple subsets or folds and training and testing the model on these folds.
The most common method is k-fold cross-validation, where the data is split into k subsets, and the model is trained and evaluated k times. One fold is used as the test set and the remaining (k-1) folds are used as the training set each time.
Let’s reuse the boilerplate code from before:
import pandas as pd from sklearn.model_selection import train_test_split
# Read data from a CSV file data = pd.read_csv(‘your_data.csv’)
# Specify the column name of the target variable target_column = ‘target’
# Split into features and target X = data.drop(target_column, axis=1) y = data[target_column] |
Here’s how you can use k-fold cross-validation to evaluate a RandomForestClassifier:
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score
# Initialize the model model = RandomForestClassifier(random_state=10)
# Perform 5-fold cross-validation cv_scores = cross_val_score(model, X, y, cv=5)
# Print cross-validation scores print(f‘Cross-Validation Scores: {cv_scores}’) print(f‘Mean CV Score: {cv_scores.mean():.2f}’) |
Hyperparameter Tuning
Hyperparameter tuning involves finding the optimal hyperparameters for your model. The two common techniques are:
- Grid search which involves an exhaustive search over a chosen parameter grid. This can be super expensive in most cases.
- Randomized search: Randomly samples parameter values from a specified distribution.
To learn more about hyperparameter tuning, read .Hyperparameter Tuning: GridSearchCV and RandomizedSearchCV, Explained.
Here’s an example of how you can use grid search to find the best hyperparameters:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV
# Define the parameter grid param_grid = { ‘n_estimators’: [50, 100, 200], ‘max_depth’: [None, 10, 20, 30], ‘min_samples_split’: [2, 5, 10] }
# Initialize the model model = RandomForestClassifier(random_state=10)
# Perform grid search with 5-fold cross-validation grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring=‘accuracy’) grid_search.fit(X, y)
print(f‘Best Parameters: {grid_search.best_params_}’) print(f‘Best Cross-Validation Score: {grid_search.best_score_:.2f}’) |
Cross-validation ensures that the model performs optimally on unseen data, while hyperparameter tuning helps in optimizing the model parameters for better performance.
5. Choose the Best Machine Learning Model
While you can use hyperparameter tuning to optimize a chosen model, selecting the appropriate model is just as necessary. Evaluating multiple models and choosing the one that best fits your dataset and the problem you’re trying to solve is important.
Cross-validation provides a reliable estimate of model performance on unseen data. So comparing different models using cross-validation scores helps in identifying the model that performs best on your data.
Here’s how you can use cross-validation to compare logistic regression and random forest classifiers (leaving out the starter code):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score
# Split into features and target X = data.drop(‘target’, axis=1) y = data[‘target’]
# Define models to compare models = { ‘Logistic Regression’: LogisticRegression(random_state=10), ‘Random Forest’: RandomForestClassifier(random_state=10), }
# Compare models using cross-validation for name, model in models.items(): cv_scores = cross_val_score(model, X, y, cv=5, scoring=‘accuracy’) print(f‘{name} CV Score: {cv_scores.mean():.2f}’) |
You can also use ensemble methods that combine multiple models to improve performance. They are particularly effective in reducing overfitting resulting in more robust models. You may find Tips for Choosing the Right Machine Learning Model for Your Data helpful to learn more on model selection.
Summary
I hope you learned a few helpful tips to keep in mind when training your machine learning models. Let’s wrap up by reviewing them:
- Handle missing values, scale features, and encode categorical variables as needed. Split data into training and test sets early ahead of any preprocessing.
- Create interaction features, extract useful date/time features, and use binning and other techniques to create more representative features.
- Handle class imbalance using resampling techniques and adjusting class weights accordingly.
- Implement k-fold cross-validation and hyperparameter optimization techniques like grid search or randomized search for robust model evaluation.
- Compare models using cross-validation scores and consider ensemble methods for improved performance.
Happy model building!