The Search for the Sweet Spot in a Linear Regression with Numeric Features

Consistent with the principle of Occam’s razor, starting simple often leads to the most profound insights, especially when piecing together a predictive model. In this post, using the Ames Housing Dataset, we will first pinpoint the key features that shine on their own. Then, step by step, we’ll layer these insights, observing how their combined effect enhances our ability to forecast accurately. As we delve deeper, we will harness the power of the Sequential Feature Selector (SFS) to sift through the complexities and highlight the optimal combination of features. This methodical approach will guide us to the “sweet spot” — a harmonious blend where the selected features maximize our model’s predictive precision without overburdening it with unnecessary data.

Let’s get started.

The Search for the Sweet Spot in a Linear Regression with Numeric Features
Photo by Joanna Kosinska. Some rights reserved.

Overview

This post is divided into three parts; they are:

From Single Features to Collective Impact
Diving Deeper with SFS: The Power of Combination
Finding the Predictive “Sweet Spot”

From Individual Strengths to Collective Impact

Our first step is to identify which features out of the myriad available in the Ames dataset stand out as powerful predictors on their own. We turn to simple linear regression models, each dedicated to one of the top standalone features identified based on their predictive power for housing prices.

# Load the essential libraries and Ames dataset from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression import pandas as pd Ames = pd.read_csv(“Ames.csv”).select_dtypes(include=[“int64”, “float64”]) Ames.dropna(axis=1, inplace=True) X = Ames.drop(“SalePrice”, axis=1) y = Ames[“SalePrice”] # Initialize the Linear Regression model model = LinearRegression() # Prepare to collect feature scores feature_scores = {} # Evaluate each feature with cross-validation for feature in X.columns: X_single = X[[feature]] cv_scores = cross_val_score(model, X_single, y) feature_scores[feature] = cv_scores.mean() # Identify the top 5 features based on mean CV R² scores sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True) top_5 = sorted_features[0:5] # Display the top 5 features and their individual performance for feature, score in top_5: print(f”Feature: {feature}, Mean CV R²: {score:.4f}”)

# Load the essential libraries and Ames dataset

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

import pandas as pd

Ames = pd.read_csv(“Ames.csv”).select_dtypes(include=[“int64”, “float64”])

Ames.dropna(axis=1, inplace=True)

X = Ames.drop(“SalePrice”, axis=1)

y = Ames[“SalePrice”]

# Initialize the Linear Regression model

model = LinearRegression()

# Prepare to collect feature scores

feature_scores = {}

# Evaluate each feature with cross-validation

for feature in X.columns:

X_single = X[[feature]]

cv_scores = cross_val_score(model, X_single, y)

feature_scores[feature] = cv_scores.mean()

# Identify the top 5 features based on mean CV R² scores

sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)

top_5 = sorted_features[0:5]

# Display the top 5 features and their individual performance

for feature, score in top_5:

print(f“Feature: {feature}, Mean CV R²: {score:.4f}”)

This will output the top 5 features that can be used individually in a simple linear regression:

Feature: OverallQual, Mean CV R²: 0.6183 Feature: GrLivArea, Mean CV R²: 0.5127 Feature: 1stFlrSF, Mean CV R²: 0.3957 Feature: YearBuilt, Mean CV R²: 0.2852 Feature: FullBath, Mean CV R²: 0.2790

Feature: OverallQual, Mean CV R²: 0.6183

Feature: GrLivArea, Mean CV R²: 0.5127

Feature: 1stFlrSF, Mean CV R²: 0.3957

Feature: YearBuilt, Mean CV R²: 0.2852

Feature: FullBath, Mean CV R²: 0.2790

Curiosity leads us further: what if we combine these top features into a single multiple linear regression model? Will their collective power surpass their individual contributions?

# Extracting the top 5 features for our multiple linear regression top_features = [feature for feature, score in top_5] # Building the model with the top 5 features X_top = Ames[top_features] # Evaluating the model with cross-validation cv_scores_mlr = cross_val_score(model, X_top, y, cv=5, scoring=”r2″) mean_mlr_score = cv_scores_mlr.mean() print(f”Mean CV R² Score for Multiple Linear Regression Model: {mean_mlr_score:.4f}”)

# Extracting the top 5 features for our multiple linear regression

top_features = [feature for feature, score in top_5]

# Building the model with the top 5 features

X_top = Ames[top_features]

# Evaluating the model with cross-validation

cv_scores_mlr = cross_val_score(model, X_top, y, cv=5, scoring=“r2”)

mean_mlr_score = cv_scores_mlr.mean()

print(f“Mean CV R² Score for Multiple Linear Regression Model: {mean_mlr_score:.4f}”)

The initial findings are promising; each feature indeed has its strengths. However, when combined in a multiple regression model, we observe a “decent” improvement—a testament to the complexity of housing price predictions.

Mean CV R² Score for Multiple Linear Regression Model: 0.8003

Mean CV R² Score for Multiple Linear Regression Model: 0.8003

This result hints at untapped potential: Could there be a more strategic way to select and combine features for even greater predictive accuracy?

Diving Deeper with SFS: The Power of Combination

As we expand our use of Sequential Feature Selector (SFS) from $n=1$ to $n=5$, an important concept comes into play: the power of combination. Let’s illustrate as we build on the code above:

# Perform Sequential Feature Selector with n=5 and build on above code from sklearn.feature_selection import SequentialFeatureSelector sfs = SequentialFeatureSelector(model, n_features_to_select=5) sfs.fit(X, y) selected_features = X.columns[sfs.get_support()].to_list() print(f”Features selected by SFS: {selected_features}”) scores = cross_val_score(model, Ames[selected_features], y) print(f”Mean CV R² Score using SFS with n=5: {scores.mean():.4f}”)

# Perform Sequential Feature Selector with n=5 and build on above code

from sklearn.feature_selection import SequentialFeatureSelector

sfs = SequentialFeatureSelector(model, n_features_to_select=5)

sfs.fit(X, y)

selected_features = X.columns[sfs.get_support()].to_list()

print(f“Features selected by SFS: {selected_features}”)

scores = cross_val_score(model, Ames[selected_features], y)

print(f“Mean CV R² Score using SFS with n=5: {scores.mean():.4f}”)

Choosing $n=5$ doesn’t merely mean selecting the five best standalone features. Rather, it’s about identifying the set of five features that, when used together, optimize the model’s predictive ability:

Features selected by SFS: [‘GrLivArea’, ‘OverallQual’, ‘YearBuilt’, ‘1stFlrSF’, ‘KitchenAbvGr’] Mean CV R² Score using SFS with n=5: 0.8056

Features selected by SFS: [‘GrLivArea’, ‘OverallQual’, ‘YearBuilt’, ‘1stFlrSF’, ‘KitchenAbvGr’]

Mean CV R² Score using SFS with n=5: 0.8056

This outcome is particularly enlightening when we compare it to the top five features selected based on their standalone predictive power. The attribute “FullBath” (not selected by SFS) was replaced by “KitchenAbvGr” in the SFS selection. This divergence highlights a fundamental principle of feature selection: it’s the combination that counts. SFS doesn’t just look for strong individual predictors; it seeks out features that work best in concert. This might mean selecting a feature that, on its own, wouldn’t top the list but, when combined with others, improves the model’s accuracy.

If you wonder why this is the case, the features selected in the combination should be complementary to each other rather than correlated. In this way, each new feature provides new information for the predictor instead of agreeing with what is already known.

Finding the Predictive “Sweet Spot”

The journey to optimal feature selection begins by pushing our model to its limits. By initially considering the maximum possible features, we gain a comprehensive view of how model performance evolves by adding each feature. This visualization serves as our starting point, highlighting the diminishing returns on model predictability and guiding us toward finding the “sweet spot.” Let’s start by running a Sequential Feature Selector (SFS) across the entire feature set, plotting the performance to visualize the impact of each addition:

# Performance of SFS from 1 feature to maximum, building on code above: import matplotlib.pyplot as plt # Prepare to store the mean CV R² scores for each number of features mean_scores = [] # Iterate over a range from 1 feature to the maximum number of features available for n_features_to_select in range(1, len(X.columns)): sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select) sfs.fit(X, y) selected_features = X.columns[sfs.get_support()] score = cross_val_score(model, X[selected_features], y, cv=5, scoring=”r2″).mean() mean_scores.append(score) # Plot the mean CV R² scores against the number of features selected plt.figure(figsize=(10, 6)) plt.plot(range(1, len(X.columns)), mean_scores, marker=”o”) plt.title(“Performance vs. Number of Features Selected”) plt.xlabel(“Number of Features”) plt.ylabel(“Mean CV R² Score”) plt.grid(True) plt.show()

# Performance of SFS from 1 feature to maximum, building on code above:

import matplotlib.pyplot as plt

# Prepare to store the mean CV R² scores for each number of features

mean_scores = []

# Iterate over a range from 1 feature to the maximum number of features available

for n_features_to_select in range(1, len(X.columns)):

sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select)

sfs.fit(X, y)

selected_features = X.columns[sfs.get_support()]

score = cross_val_score(model, X[selected_features], y, cv=5, scoring=“r2”).mean()

mean_scores.append(score)

# Plot the mean CV R² scores against the number of features selected

plt.figure(figsize=(10, 6))

plt.plot(range(1, len(X.columns)), mean_scores, marker=“o”)

plt.title(“Performance vs. Number of Features Selected”)

plt.xlabel(“Number of Features”)

plt.ylabel(“Mean CV R² Score”)

plt.grid(True)

plt.show()

The plot below demonstrates how model performance improves as more features are added but eventually plateaus, indicating a point of diminishing returns:

Comparing the effect of adding features to the predictor

From this plot, you can see that using more than ten features has little benefit. Using three or fewer features, however, is suboptimal. You can use the “elbow method” to find where this curve bends and determine the optimal number of features. This is a subjective decision. This plot suggests anywhere from 5 to 9 looks right.

Armed with the insights from our initial exploration, we apply a tolerance (tol=0.005) to our feature selection process. This can help us determine the optimal number of features objectively and robustly:

# Apply Sequential Feature Selector with tolerance = 0.005, building on code above sfs_tol = SequentialFeatureSelector(model, n_features_to_select=”auto”, tol=0.005) sfs_tol.fit(X, y) # Get the number of features selected with tolerance n_features_selected = sum(sfs_tol.get_support()) # Prepare to store the mean CV R² scores for each number of features mean_scores_tol = [] # Iterate over a range from 1 feature to the Sweet Spot for n_features_to_select in range(1, n_features_selected + 1): sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select) sfs.fit(X, y) selected_features = X.columns[sfs.get_support()] score = cross_val_score(model, X[selected_features], y, cv=5, scoring=”r2″).mean() mean_scores_tol.append(score) # Plot the mean CV R² scores against the number of features selected plt.figure(figsize=(10, 6)) plt.plot(range(1, n_features_selected + 1), mean_scores_tol, marker=”o”) plt.title(“The Sweet Spot: Performance vs. Number of Features Selected”) plt.xlabel(“Number of Features”) plt.ylabel(“Mean CV R² Score”) plt.grid(True) plt.show()

# Apply Sequential Feature Selector with tolerance = 0.005, building on code above

sfs_tol = SequentialFeatureSelector(model, n_features_to_select=“auto”, tol=0.005)

sfs_tol.fit(X, y)

# Get the number of features selected with tolerance

n_features_selected = sum(sfs_tol.get_support())

# Prepare to store the mean CV R² scores for each number of features

mean_scores_tol = []

# Iterate over a range from 1 feature to the Sweet Spot

for n_features_to_select in range(1, n_features_selected + 1):

sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select)

sfs.fit(X, y)

selected_features = X.columns[sfs.get_support()]

score = cross_val_score(model, X[selected_features], y, cv=5, scoring=“r2”).mean()

mean_scores_tol.append(score)

# Plot the mean CV R² scores against the number of features selected

plt.figure(figsize=(10, 6))

plt.plot(range(1, n_features_selected + 1), mean_scores_tol, marker=“o”)

plt.title(“The Sweet Spot: Performance vs. Number of Features Selected”)

plt.xlabel(“Number of Features”)

plt.ylabel(“Mean CV R² Score”)

plt.grid(True)

plt.show()

This strategic move allows us to concentrate on those features that provide the highest predictability, culminating in the selection of 8 optimal features:

Finding the optimal number of features from a plot

We can now conclude our findings by showing the features selected by SFS:

# Print the selected features and their performance, building on the above: selected_features = X.columns[sfs_tol.get_support()] print(f”Number of features selected: {n_features_selected}”) print(f”Selected features: {selected_features.tolist()}”) print(f”Mean CV R² Score using SFS with tol=0.005: {mean_scores_tol[-1]:.4f}”)

# Print the selected features and their performance, building on the above:

selected_features = X.columns[sfs_tol.get_support()]

print(f“Number of features selected: {n_features_selected}”)

print(f“Selected features: {selected_features.tolist()}”)

print(f“Mean CV R² Score using SFS with tol=0.005: {mean_scores_tol[-1]:.4f}”)

Number of features selected: 8 Selected features: [‘GrLivArea’, ‘LotArea’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘1stFlrSF’, ‘BedroomAbvGr’, ‘KitchenAbvGr’] Mean CV R² Score using SFS with tol=0.005: 0.8239

Number of features selected: 8

Selected features: [‘GrLivArea’, ‘LotArea’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘1stFlrSF’, ‘BedroomAbvGr’, ‘KitchenAbvGr’]

Mean CV R² Score using SFS with tol=0.005: 0.8239

By focusing on these 8 features, we achieve a model that balances complexity with high predictability, showcasing the effectiveness of a measured approach to feature selection.

Summary

Through this three-part post, you have embarked on a journey from assessing the predictive power of individual features to harnessing their combined strength in a refined model. Our exploration has demonstrated that while more features can enhance a model’s ability to capture complex patterns, there comes a point where additional features no longer contribute to improved predictions. By applying a tolerance level to the Sequential Feature Selector, you have honed in on an optimal set of features that propel our model’s performance to its peak without overcomplicating the predictive landscape. This sweet spot—identified as eight key features—epitomizes the strategic melding of simplicity and sophistication in predictive modeling.

Specifically, you learned:

The Art of Starting Simple: Beginning with simple linear regression models to understand each feature’s standalone predictive value sets the foundation for more complex analyses.
Synergy in Selection: The transition to the Sequential Feature Selector underscores the importance of not just individual feature strengths but their synergistic impact when combined effectively.
Maximizing Model Efficacy: The quest for the predictive sweet spot through SFS with a set tolerance teaches us the value of precision in feature selection, achieving the most with the least.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

Source link

The Search for the Sweet Spot in a Linear Regression with Numeric Features

Overview

From Individual Strengths to Collective Impact

Diving Deeper with SFS: The Power of Combination

Finding the Predictive “Sweet Spot”

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

Overview

From Individual Strengths to Collective Impact

Diving Deeper with SFS: The Power of Combination

Finding the Predictive “Sweet Spot”

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

Related Posts