โ† Back to Portfolio

Machine Learning ยท Python ยท Kaggle

House Price Prediction

A blended regression model predicting residential sale prices, built for the Kaggle House Prices competition. The final submission scored in the top 5% of the leaderboard using an ensemble of 9 machine learning algorithms.

Python XGBoost LightGBM Scikit-learn Feature Engineering EDA
Top 5% Kaggle leaderboard finish
Kaggle leaderboard result
Final leaderboard position โ€” top 5% of all submissions on the House Prices: Advanced Regression Techniques competition.

Exploratory Data Analysis

The provided training set contains 1,460 rows and 81 columns โ€” one prediction target (sale price), one ID column, and 79 possible features. Before any modelling, it's important to understand what the data actually looks like.

Feature Distributions

Univariate distributions reveal high skew in several numeric features โ€” BsmtUnfSF is a clear example. Many categorical features also show very uneven distributions: Street is almost entirely "Pave", with "Grvl" appearing only rarely. Features like this offer little predictive value to a model.

Univariate distributions
Numeric feature distributions โ€” several show significant positive skew.
Categorical distributions
Categorical features โ€” many are heavily imbalanced with one dominant value.

Outliers & Correlation

Box plots surface a large number of potential outliers โ€” many caused by the skewed distributions themselves. With only ~1,400 rows, dropping them aggressively would hurt more than help, so most are retained.

Two specific points in the GrLivArea chart break the otherwise linear relationship between living area and sale price โ€” very large homes sold at unusually low prices, likely anomalies. Removing them strengthens the correlation.

Box plots showing outliers
Features vs sale price
Numeric features vs sale price. Red points in GrLivArea are removed before modelling.
Correlation heatmap
Correlation matrix โ€” coloured blocks show pairs above 0.75. One from each pair is dropped to reduce multicollinearity.

Feature Engineering

With the data understood, the next step is transforming it into a form models can learn from effectively. This involves five stages:

  1. Handle missing data โ€” Ordinal columns get "NA"; categorical columns take the mode; numeric columns take the median grouped by neighbourhood.
  2. Add new features โ€” Total bathrooms (upstairs + downstairs) often matters more than either component alone.
  3. Remove low-signal features โ€” Columns with near-zero correlation or a single value in 99%+ of rows are dropped.
  4. Transform skewed features โ€” Log transform applied to any feature with skew above 0.5.
  5. Encode categorical features โ€” Unordered categories use one-hot encoding; ordinal ratings are mapped to integers.
Log transformation of LotArea
Before and after log transform on LotArea โ€” skew is substantially reduced.

Model Training

Nine models are trained and evaluated using log RMSE โ€” the Kaggle leaderboard metric. All models are evaluated on a held-out validation set they never saw during training. Individual models cluster between 0.105โ€“0.115. The blended model, combining all nine with optimised weights, reaches 0.099.

Blending approach: 250,000 random weight combinations are tested. The combination with the lowest validation RMSE is kept for the final submission.

XGBoost
~0.108
LightGBM
~0.110
Gradient Boosting
~0.112
SVR
~0.113
Ridge Regression
~0.114
Lasso
~0.115
Elastic Net
~0.115
Random Forest
~0.113
Blended
0.099
RMSE comparison
Validation RMSE per model. Blended model wins at 0.099.

Result & what's next

The blended model achieves a top 5% leaderboard finish โ€” well past the original goal of top 10%. Knowing when to stop is itself a useful skill; chasing marginal gains on a competition dataset rarely translates to real-world value.

Areas to revisit if returning to this project:

  • Hyperparameter tuning on the individual models
  • Proper gradient-based weight optimisation for the blend (currently random search)
  • More aggressive feature engineering
  • Additional outlier removal based on residual analysis
  • Alternative scaling methods (Box-Cox transformation)