The Strategic Use of Sequential Feature Selector for Housing Price Predictions

To understand housing prices better, simplicity and clarity in our models are key. Our aim with this post is to demonstrate how straightforward yet powerful techniques in feature selection and engineering can lead to creating an effective, simple linear regression model. Working with the Ames dataset, we use a Sequential Feature Selector (SFS) to identify the most impactful numeric features and then enhance our model’s accuracy through thoughtful feature engineering.

Let’s get started.

The Strategic Use of Sequential Feature Selector for Housing Price Predictions
Photo by Mahrous Houses. Some rights reserved.

Overview

This post is divided into three parts; they are:

Identifying the Most Predictive Numeric Feature
Evaluating Individual Features’ Predictive Power
Enhancing Predictive Accuracy with Feature Engineering

Identifying the Most Predictive Numeric Feature

In the initial segment of our exploration, we embark on a mission to identify the most predictive numeric feature within the Ames dataset. This is achieved by applying Sequential Feature Selector (SFS), a tool designed to sift through features and select the one that maximizes our model’s predictive accuracy. The process is straightforward, focusing solely on numeric columns and excluding any with missing values to ensure a clean and robust analysis:

# Load only the numeric columns from the Ames dataset import pandas as pd Ames = pd.read_csv(‘Ames.csv’).select_dtypes(include=[‘int64’, ‘float64’]) # Drop any columns with missing values Ames = Ames.dropna(axis=1) # Import Linear Regression and Sequential Feature Selector from scikit-learn from sklearn.linear_model import LinearRegression from sklearn.feature_selection import SequentialFeatureSelector # Initializing the Linear Regression model model = LinearRegression() # Perform Sequential Feature Selector sfs = SequentialFeatureSelector(model, n_features_to_select=1) X = Ames.drop(‘SalePrice’, axis=1) # Features y = Ames[‘SalePrice’] # Target variable sfs.fit(X,y) # Uses a default of cv=5 selected_feature = X.columns[sfs.get_support()] print(“Feature selected for highest predictability:”, selected_feature[0])

# Load only the numeric columns from the Ames dataset

import pandas as pd

Ames = pd.read_csv(‘Ames.csv’).select_dtypes(include=[‘int64’, ‘float64’])

# Drop any columns with missing values

Ames = Ames.dropna(axis=1)

# Import Linear Regression and Sequential Feature Selector from scikit-learn

from sklearn.linear_model import LinearRegression

from sklearn.feature_selection import SequentialFeatureSelector

# Initializing the Linear Regression model

model = LinearRegression()

# Perform Sequential Feature Selector

sfs = SequentialFeatureSelector(model, n_features_to_select=1)

X = Ames.drop(‘SalePrice’, axis=1) # Features

y = Ames[‘SalePrice’] # Target variable

sfs.fit(X,y) # Uses a default of cv=5

selected_feature = X.columns[sfs.get_support()]

print(“Feature selected for highest predictability:”, selected_feature[0])

This will output:

Feature selected for highest predictability: OverallQual

Feature selected for highest predictability: OverallQual

This result notably challenges the initial presumption that the area might be the most predictive feature for housing prices. Instead, it underscores the significance of overall quality, suggesting that, contrary to initial expectations, quality is the paramount consideration for buyers. It is important to note that the Sequential Feature Selector utilizes cross-validation with a default of five folds (cv=5) to evaluate the performance of each feature subset. This approach ensures that the selected feature—reflected by the highest mean cross-validation R² score—is robust and likely to generalize well on unseen data.

Evaluating Individual Features’ Predictive Power

Building upon our initial findings, we delve deeper to rank features by their predictive capabilities. Employing cross-validation, we evaluate each feature independently, calculating their mean R² scores from cross-validation to ascertain their individual contributions to the model’s accuracy.

# Building on the earlier block of code: from sklearn.model_selection import cross_val_score # Dictionary to hold feature names and their corresponding mean CV R² scores feature_scores = {} # Iterate over each feature, perform CV, and store the mean R² score for feature in X.columns: X_single = X[[feature]] cv_scores = cross_val_score(model, X_single, y, cv=5) feature_scores[feature] = cv_scores.mean() # Sort features based on their mean CV R² scores in descending order sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True) # Print the top 3 features and their scores top_3 = sorted_features[0:3] for feature, score in top_3: print(f”Feature: {feature}, Mean CV R²: {score:.4f}”)

# Building on the earlier block of code:

from sklearn.model_selection import cross_val_score

# Dictionary to hold feature names and their corresponding mean CV R² scores

feature_scores = {}

# Iterate over each feature, perform CV, and store the mean R² score

for feature in X.columns:

X_single = X[[feature]]

cv_scores = cross_val_score(model, X_single, y, cv=5)

feature_scores[feature] = cv_scores.mean()

# Sort features based on their mean CV R² scores in descending order

sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)

# Print the top 3 features and their scores

top_3 = sorted_features[0:3]

for feature, score in top_3:

print(f“Feature: {feature}, Mean CV R²: {score:.4f}”)

This will output:

Feature: OverallQual, Mean CV R²: 0.6183 Feature: GrLivArea, Mean CV R²: 0.5127 Feature: 1stFlrSF, Mean CV R²: 0.3957

Feature: OverallQual, Mean CV R²: 0.6183

Feature: GrLivArea, Mean CV R²: 0.5127

Feature: 1stFlrSF, Mean CV R²: 0.3957

These findings underline the key role of overall quality (“OverallQual”), as well as the importance of living area (“GrLivArea”) and first-floor space (“1stFlrSF”) in the context of housing price predictions.

Enhancing Predictive Accuracy with Feature Engineering

In the final stride of our journey, we employ feature engineering to create a novel feature, “Quality Weighted Area,” by multiplying ‘OverallQual’ by ‘GrLivArea’. This fusion aims to synthesize a more powerful predictor, encapsulating both the quality and size dimensions of a property.

# Building on the earlier blocks of code: Ames[‘QualityArea’] = Ames[‘OverallQual’] * Ames[‘GrLivArea’] # Setting up the feature and target variable for the new ‘QualityArea’ feature X = Ames[[‘QualityArea’]] # New feature y = Ames[‘SalePrice’] # 5-Fold CV on Linear Regression model = LinearRegression() cv_scores = cross_val_score(model, X, y, cv=5) # Calculating the mean of the CV scores mean_cv_score = cv_scores.mean() print(f”Mean CV R² score using ‘Quality Weighted Area’: {mean_cv_score:.4f}”)

# Building on the earlier blocks of code:

Ames[‘QualityArea’] = Ames[‘OverallQual’] * Ames[‘GrLivArea’]

# Setting up the feature and target variable for the new ‘QualityArea’ feature

X = Ames[[‘QualityArea’]] # New feature

y = Ames[‘SalePrice’]

# 5-Fold CV on Linear Regression

model = LinearRegression()

cv_scores = cross_val_score(model, X, y, cv=5)

# Calculating the mean of the CV scores

mean_cv_score = cv_scores.mean()

print(f“Mean CV R² score using ‘Quality Weighted Area’: {mean_cv_score:.4f}”)

This will output:

Mean CV R² score using ‘Quality Weighted Area’: 0.7484

Mean CV R² score using ‘Quality Weighted Area’: 0.7484

This remarkable increase in R² score vividly demonstrates the efficacy of combining features to capture more nuanced aspects of data, providing a compelling case for the thoughtful application of feature engineering in predictive modeling.

Summary

Through this three-part exploration, you have navigated the process of pinpointing and enhancing predictors for housing price predictions with an emphasis on simplicity. Starting with identifying the most predictive feature using a Sequential Feature Selector (SFS), we discovered that overall quality is paramount. This initial step was crucial, especially since our goal was to create the best simple linear regression model, leading us to exclude categorical features for a streamlined analysis. The exploration led us from identifying overall quality as the key predictor using Sequential Feature Selector (SFS) to evaluating the impacts of living area and first-floor space. Creating “Quality Weighted Area,” a feature blending quality with size, notably enhanced our model’s accuracy. The journey through feature selection and engineering underscored the power of simplicity in improving real estate predictive models, offering deeper insights into what truly influences housing prices. This exploration emphasizes that with the right techniques, even simple models can yield profound insights into complex datasets like Ames’ housing prices.

Specifically, you learned:

The value of Sequential Feature Selection in revealing the most important predictors for housing prices.
The importance of quality over size when predicting housing prices in Ames, Iowa.
How merging features into a “Quality Weighted Area” enhances model accuracy.

Do you have experiences with feature selection or engineering you would like to share, or questions about the process? Please ask your questions or give us feedback in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises

See What’s Inside

About Vinod Chugani

Born in India and nurtured in Japan, I am a Third Culture Kid with a global perspective. My academic journey at Duke University included majoring in Economics, with the honor of being inducted into Phi Beta Kappa in my junior year. Over the years, I’ve gained diverse professional experiences, spending a decade navigating Wall Street’s intricate Fixed Income sector, followed by leading a global distribution venture on Main Street.
Currently, I channel my passion for data science, machine learning, and AI as a Mentor at the New York City Data Science Academy. I value the opportunity to ignite curiosity and share knowledge, whether through Live Learning sessions or in-depth 1-on-1 interactions.
With a foundation in finance/entrepreneurship and my current immersion in the data realm, I approach the future with a sense of purpose and assurance. I anticipate further exploration, continuous learning, and the opportunity to contribute meaningfully to the ever-evolving fields of data science and machine learning, especially here at MLM.

Source link

The Strategic Use of Sequential Feature Selector for Housing Price Predictions

Overview

Identifying the Most Predictive Numeric Feature

Evaluating Individual Features’ Predictive Power

Enhancing Predictive Accuracy with Feature Engineering

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

About Vinod Chugani

Overview

Identifying the Most Predictive Numeric Feature

Evaluating Individual Features’ Predictive Power

Enhancing Predictive Accuracy with Feature Engineering

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

Kick-start your data science journey with hands-on exercises

About Vinod Chugani

Related Posts