Feature engineering is an important step in the machine learning pipeline. It is the process of transforming data in its native format into meaningful features to help the machine learning model learn better from the data.
If done right, feature engineering can significantly enhance the performance of machine learning algorithms. Beyond the basics of understanding your data and preprocessing, effective feature engineering involves creating interaction terms, generating indicator variables, and binning features into buckets.
These techniques help extract relevant information from the data and help build robust machine learning solutions. In this guide, we’ll explore these feature engineering techniques by spinning up a sample housing dataset.
Note: You can code along to this tutorial in your preferred Jupyter notebook environment. You can also follow along with the Google Colab notebook for this tutorial.
1. Understand Your Data
Before jumping into feature engineering, you should first thoroughly understand your data. This includes:
- Exploring and visualizing your dataset to get an idea of the distribution and relationships between variables
- Know the types of features you have (categorical, numerical, datetime objects, and more) and understand their significance in your analysis
- Try to use domain knowledge to understand what each feature represents and how it might interact with other features. This insight can guide you in creating meaningful new features
Let’s create a sample housing dataset to work with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import pandas as pd import numpy as np
# Set random seed for reproducibility np.random.seed(42)
# Create sample data n_samples = 1000 data = { ‘price’: np.random.normal(200000, 50000, n_samples).astype(int), ‘size’: np.random.normal(1500, 500, n_samples).astype(int), ‘num_rooms’: np.random.randint(2, 8, n_samples), ‘num_bathrooms’: np.random.randint(1, 4, n_samples), ‘age’: np.random.randint(0, 40, n_samples), ‘neighborhood’: np.random.choice([‘A’, ‘B’, ‘C’, ‘D’, ‘E’], n_samples), ‘income’: np.random.normal(60000, 15000, n_samples).astype(int) }
df = pd.DataFrame(data)
print(df.head()) |
In addition to getting basic information on the dataset, you can generate distribution plots and count plots for numeric and categorical variables, respectively. The following code snippets show basic exploratory data analysis on the dataset.
First, we get some basic information on the dataframe:
# Basic data exploration on the entire dataset print(df.head()) print(df.info()) print(df.describe()) |
You can try to visualize the distribution of numeric features ‘size’ and ‘income’ in the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import matplotlib.pyplot as plt import seaborn as sns
# Visualize distributions using distplot for ‘size’ and ‘income’ plt.figure(figsize=(8, 6)) sns.histplot(df[‘size’], kde=True) plt.title(‘Distribution of House Sizes’) plt.xlabel(‘Size’) plt.ylabel(‘Frequency’) plt.show()
plt.figure(figsize=(8, 6)) sns.histplot(df[‘income’], kde=True) plt.title(‘Distribution of Household Income’) plt.xlabel(‘Income’) plt.ylabel(‘Frequency’) plt.show() |
For categorical variables, count plot can help understand how the different values are distributed:
# Count plot for ‘neighborhood’ plt.figure(figsize=(8, 6)) sns.countplot(x=‘neighborhood’, data=df, order=df[‘neighborhood’].value_counts().index) plt.title(‘Count of Houses per Neighborhood’) plt.xlabel(‘Neighborhood’) plt.ylabel(‘Count’) plt.xticks(rotation=45) plt.show() |
By understanding your data, you can identify key features and relationships between features that will inform the subsequent feature engineering steps. This step ensures that your preprocessing and feature creation efforts are grounded in a thorough understanding of the dataset.
2. Preprocess the Data Effectively
Effective preprocessing involves handling missing values and outliers, scaling numerical features, and encoding categorical variables. The choice of preprocessing techniques also depend on the data and the requirements of the machine learning algorithms.
We don’t have any missing values in the example dataframe. For most real-world datasets, you can handle missing values using suitable imputation strategies.
Before you go ahead with preprocessing, split the dataset into train and test sets:
from sklearn.model_selection import train_test_split
# Split data into features X and target label y X = df.drop(‘price’, axis=1) y = df[‘price’]
# Split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
To bring numeric features all to the same scale, you can use minmax or standard scaling. Here’s a generic code snippet to impute missing values and scale numeric features:
from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler
# Handling missing values imputer = SimpleImputer(strategy=‘mean’) X_train[‘feature_to_impute’] = imputer.fit_transform(X_train[[‘feature_to_impute’]]) X_test[‘feature_to_impute’] = imputer.transform(X_test[[‘features_to_impute’]])
# Scaling features scaler = StandardScaler() X_train[[‘features_to_scale’]] = scaler.fit_transform(X_train[[‘features_to_scale’]]) X_test[[‘features_to_scale’]] = scaler.transform(X_test[[‘features_to_scale’]]) |
Replace ‘features_to_impute’ and ‘features_to_scale’ with the specific features you’d like to impute. We’ll also look at creating more representative features from the existing features in the next sections.
In summary, effective preprocessing prepares your data for all downstream tasks by ensuring consistency and addressing any issues with the raw data. This step is essential for getting accurate and reliable results from your machine learning models.
3. Create Interaction Terms
Creating interaction terms involves generating new features that capture the interactions between existing features.
For our example dataset, we’ll generate interaction terms for ‘size’ and ‘num_rooms’ using PolynomialFeatures from scikit-learn:
from sklearn.preprocessing import PolynomialFeatures
# Creating polynomial and interaction features on training set poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) interaction_terms_train = poly.fit_transform(X_train[[‘size’, ‘num_rooms’]]) interaction_terms_test = poly.transform(X_test[[‘size’, ‘num_rooms’]])
interaction_df_train = pd.DataFrame(interaction_terms_train, columns=poly.get_feature_names_out([‘size’, ‘num_rooms’])) interaction_df_test = pd.DataFrame(interaction_terms_test, columns=poly.get_feature_names_out([‘size’, ‘num_rooms’]))
# Add the interaction terms X_train = pd.concat([X_train, interaction_df_train], axis=1) X_test = pd.concat([X_test, interaction_df_test], axis=1) |
Creating interaction terms can improve your model by capturing supposedly complex relationships between features.
4. Create Indicator Variables
You can create indicator variables to flag certain conditions or mark thresholds in your data. These variables take on values of 0 or 1, indicating the absence or presence of a particular value.
For example, suppose you have a dataset to predict loan default with a large number of defaults on student loans. It can be helpful to create an ‘is_student’ feature from the ‘professions’ categorical column.
In the housing dataset, we can create an indicator variable to denote if the houses are over 30 years old and create a count plot on the indicator variable ‘age_indicator’:
import seaborn as sns import matplotlib.pyplot as plt
# Creating an indicator variable for houses older than 30 years X_train[‘age_indicator’] = (X_train[‘age’] > 30).astype(int) X_test[‘age_indicator’] = (X_test[‘age’] > 30).astype(int)
# Visualize the indicator variables plt.figure(figsize=(10, 6)) sns.countplot(x=‘age_indicator’, data=X_train) plt.title(‘Count of Houses Based on Age Indicator (>30 years)’) plt.xlabel(‘Age Indicator’) plt.ylabel(‘Count’) plt.show() |
You can create indicator variable from the number of rooms, the ‘num_rooms’ column, as well. As seen, creating indicator variables can help encode additional information for machine learning models.
5. Create More Representative Features with Binning
Binning features into buckets involves grouping continuous variables into discrete intervals. Sometimes grouping features like age and income into bins can help find patterns that are hard to identify within continuous data.
For the example housing dataset, let’s bin the age of the house and income of the household into different bins with descriptive labels. You can use the cut() function in pandas to bin features into equal-width intervals like so:
# Creating income bins X_train[‘age_bin’] = pd.cut(X_train[‘age’], bins=3, labels=[‘new’, ‘moderate’, ‘old’]) X_test[‘age_bin’] = pd.cut(X_test[‘age’], bins=3, labels=[‘new’, ‘moderate’, ‘old’])
# Creating income bins X_train[‘income_bin’] = pd.cut(X_train[‘income’], q=4, labels=[‘low’, ‘medium’, ‘high’, ‘very_high’]) X_test[‘income_bin’] = pd.cut(X_test[‘income’], q=4, labels=[‘low’, ‘medium’, ‘high’, ‘very_high’]) |
Binning continuous features into discrete intervals can simplify the representation of continuous variables as features with more predictive power.
Summary
In this guide, we went over the following tips for effective feature engineering:
- Perform EDA and use visualizations to understand your data.
- Preprocess effectively by handling missing values, encoding categorical variables, removing outliers, and ensuring a proper train-test split.
- Create interaction terms that combine features to capture meaningful interactions.
- Create indicator variables as needed based on thresholds and specific values.
- to capture key categorical information.
- Bin features into buckets or discrete intervals to create more representative features.
Be sure to test out these feature engineering tips in your next machine learning project. Happy feature engineering!