Consistent with the principle of Occam’s razor, starting simple often leads to the most profound insights, especially when piecing together a predictive model. In this post, using the Ames Housing Dataset, we will first pinpoint the key features that shine on their own. Then, step by step, we’ll layer these insights, observing how their combined effect enhances our ability to forecast accurately. As we delve deeper, we will harness the power of the Sequential Feature Selector (SFS) to sift through the complexities and highlight the optimal combination of features. This methodical approach will guide us to the “sweet spot” — a harmonious blend where the selected features maximize our model’s predictive precision without overburdening it with unnecessary data.
Let’s get started.
Overview
This post is divided into three parts; they are:
- From Single Features to Collective Impact
- Diving Deeper with SFS: The Power of Combination
- Finding the Predictive “Sweet Spot”
From Individual Strengths to Collective Impact
Our first step is to identify which features out of the myriad available in the Ames dataset stand out as powerful predictors on their own. We turn to simple linear regression models, each dedicated to one of the top standalone features identified based on their predictive power for housing prices.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# Load the essential libraries and Ames dataset from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression import pandas as pd
Ames = pd.read_csv(“Ames.csv”).select_dtypes(include=[“int64”, “float64”]) Ames.dropna(axis=1, inplace=True) X = Ames.drop(“SalePrice”, axis=1) y = Ames[“SalePrice”]
# Initialize the Linear Regression model model = LinearRegression()
# Prepare to collect feature scores feature_scores = {}
# Evaluate each feature with cross-validation for feature in X.columns: X_single = X[[feature]] cv_scores = cross_val_score(model, X_single, y) feature_scores[feature] = cv_scores.mean()
# Identify the top 5 features based on mean CV R² scores sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True) top_5 = sorted_features[0:5]
# Display the top 5 features and their individual performance for feature, score in top_5: print(f“Feature: {feature}, Mean CV R²: {score:.4f}”) |
This will output the top 5 features that can be used individually in a simple linear regression:
Feature: OverallQual, Mean CV R²: 0.6183 Feature: GrLivArea, Mean CV R²: 0.5127 Feature: 1stFlrSF, Mean CV R²: 0.3957 Feature: YearBuilt, Mean CV R²: 0.2852 Feature: FullBath, Mean CV R²: 0.2790 |
Curiosity leads us further: what if we combine these top features into a single multiple linear regression model? Will their collective power surpass their individual contributions?
# Extracting the top 5 features for our multiple linear regression top_features = [feature for feature, score in top_5]
# Building the model with the top 5 features X_top = Ames[top_features]
# Evaluating the model with cross-validation cv_scores_mlr = cross_val_score(model, X_top, y, cv=5, scoring=“r2”) mean_mlr_score = cv_scores_mlr.mean()
print(f“Mean CV R² Score for Multiple Linear Regression Model: {mean_mlr_score:.4f}”) |
The initial findings are promising; each feature indeed has its strengths. However, when combined in a multiple regression model, we observe a “decent” improvement—a testament to the complexity of housing price predictions.
Mean CV R² Score for Multiple Linear Regression Model: 0.8003 |
This result hints at untapped potential: Could there be a more strategic way to select and combine features for even greater predictive accuracy?
Diving Deeper with SFS: The Power of Combination
As we expand our use of Sequential Feature Selector (SFS) from $n=1$ to $n=5$, an important concept comes into play: the power of combination. Let’s illustrate as we build on the code above:
# Perform Sequential Feature Selector with n=5 and build on above code from sklearn.feature_selection import SequentialFeatureSelector
sfs = SequentialFeatureSelector(model, n_features_to_select=5) sfs.fit(X, y)
selected_features = X.columns[sfs.get_support()].to_list() print(f“Features selected by SFS: {selected_features}”)
scores = cross_val_score(model, Ames[selected_features], y) print(f“Mean CV R² Score using SFS with n=5: {scores.mean():.4f}”) |
Choosing $n=5$ doesn’t merely mean selecting the five best standalone features. Rather, it’s about identifying the set of five features that, when used together, optimize the model’s predictive ability:
Features selected by SFS: [‘GrLivArea’, ‘OverallQual’, ‘YearBuilt’, ‘1stFlrSF’, ‘KitchenAbvGr’] Mean CV R² Score using SFS with n=5: 0.8056 |
This outcome is particularly enlightening when we compare it to the top five features selected based on their standalone predictive power. The attribute “FullBath” (not selected by SFS) was replaced by “KitchenAbvGr” in the SFS selection. This divergence highlights a fundamental principle of feature selection: it’s the combination that counts. SFS doesn’t just look for strong individual predictors; it seeks out features that work best in concert. This might mean selecting a feature that, on its own, wouldn’t top the list but, when combined with others, improves the model’s accuracy.
If you wonder why this is the case, the features selected in the combination should be complementary to each other rather than correlated. In this way, each new feature provides new information for the predictor instead of agreeing with what is already known.
Finding the Predictive “Sweet Spot”
The journey to optimal feature selection begins by pushing our model to its limits. By initially considering the maximum possible features, we gain a comprehensive view of how model performance evolves by adding each feature. This visualization serves as our starting point, highlighting the diminishing returns on model predictability and guiding us toward finding the “sweet spot.” Let’s start by running a Sequential Feature Selector (SFS) across the entire feature set, plotting the performance to visualize the impact of each addition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Performance of SFS from 1 feature to maximum, building on code above: import matplotlib.pyplot as plt
# Prepare to store the mean CV R² scores for each number of features mean_scores = []
# Iterate over a range from 1 feature to the maximum number of features available for n_features_to_select in range(1, len(X.columns)): sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select) sfs.fit(X, y) selected_features = X.columns[sfs.get_support()] score = cross_val_score(model, X[selected_features], y, cv=5, scoring=“r2”).mean() mean_scores.append(score)
# Plot the mean CV R² scores against the number of features selected plt.figure(figsize=(10, 6)) plt.plot(range(1, len(X.columns)), mean_scores, marker=“o”) plt.title(“Performance vs. Number of Features Selected”) plt.xlabel(“Number of Features”) plt.ylabel(“Mean CV R² Score”) plt.grid(True) plt.show() |
The plot below demonstrates how model performance improves as more features are added but eventually plateaus, indicating a point of diminishing returns:
From this plot, you can see that using more than ten features has little benefit. Using three or fewer features, however, is suboptimal. You can use the “elbow method” to find where this curve bends and determine the optimal number of features. This is a subjective decision. This plot suggests anywhere from 5 to 9 looks right.
Armed with the insights from our initial exploration, we apply a tolerance (tol=0.005
) to our feature selection process. This can help us determine the optimal number of features objectively and robustly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# Apply Sequential Feature Selector with tolerance = 0.005, building on code above sfs_tol = SequentialFeatureSelector(model, n_features_to_select=“auto”, tol=0.005) sfs_tol.fit(X, y)
# Get the number of features selected with tolerance n_features_selected = sum(sfs_tol.get_support())
# Prepare to store the mean CV R² scores for each number of features mean_scores_tol = []
# Iterate over a range from 1 feature to the Sweet Spot for n_features_to_select in range(1, n_features_selected + 1): sfs = SequentialFeatureSelector(model, n_features_to_select=n_features_to_select) sfs.fit(X, y) selected_features = X.columns[sfs.get_support()] score = cross_val_score(model, X[selected_features], y, cv=5, scoring=“r2”).mean() mean_scores_tol.append(score)
# Plot the mean CV R² scores against the number of features selected plt.figure(figsize=(10, 6)) plt.plot(range(1, n_features_selected + 1), mean_scores_tol, marker=“o”) plt.title(“The Sweet Spot: Performance vs. Number of Features Selected”) plt.xlabel(“Number of Features”) plt.ylabel(“Mean CV R² Score”) plt.grid(True) plt.show() |
This strategic move allows us to concentrate on those features that provide the highest predictability, culminating in the selection of 8 optimal features:
We can now conclude our findings by showing the features selected by SFS:
# Print the selected features and their performance, building on the above: selected_features = X.columns[sfs_tol.get_support()] print(f“Number of features selected: {n_features_selected}”) print(f“Selected features: {selected_features.tolist()}”) print(f“Mean CV R² Score using SFS with tol=0.005: {mean_scores_tol[-1]:.4f}”) |
Number of features selected: 8 Selected features: [‘GrLivArea’, ‘LotArea’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘1stFlrSF’, ‘BedroomAbvGr’, ‘KitchenAbvGr’] Mean CV R² Score using SFS with tol=0.005: 0.8239 |
By focusing on these 8 features, we achieve a model that balances complexity with high predictability, showcasing the effectiveness of a measured approach to feature selection.
Further Reading
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
Through this three-part post, you have embarked on a journey from assessing the predictive power of individual features to harnessing their combined strength in a refined model. Our exploration has demonstrated that while more features can enhance a model’s ability to capture complex patterns, there comes a point where additional features no longer contribute to improved predictions. By applying a tolerance level to the Sequential Feature Selector, you have honed in on an optimal set of features that propel our model’s performance to its peak without overcomplicating the predictive landscape. This sweet spot—identified as eight key features—epitomizes the strategic melding of simplicity and sophistication in predictive modeling.
Specifically, you learned:
- The Art of Starting Simple: Beginning with simple linear regression models to understand each feature’s standalone predictive value sets the foundation for more complex analyses.
- Synergy in Selection: The transition to the Sequential Feature Selector underscores the importance of not just individual feature strengths but their synergistic impact when combined effectively.
- Maximizing Model Efficacy: The quest for the predictive sweet spot through SFS with a set tolerance teaches us the value of precision in feature selection, achieving the most with the least.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.