Machine Learning ยท Python ยท Kaggle
A blended regression model predicting residential sale prices, built for the Kaggle House Prices competition. The final submission scored in the top 5% of the leaderboard using an ensemble of 9 machine learning algorithms.
The provided training set contains 1,460 rows and 81 columns โ one prediction target (sale price), one ID column, and 79 possible features. Before any modelling, it's important to understand what the data actually looks like.
Univariate distributions reveal high skew in several numeric features โ BsmtUnfSF is a clear example. Many categorical features also show very uneven distributions: Street is almost entirely "Pave", with "Grvl" appearing only rarely. Features like this offer little predictive value to a model.
Box plots surface a large number of potential outliers โ many caused by the skewed distributions themselves. With only ~1,400 rows, dropping them aggressively would hurt more than help, so most are retained.
Two specific points in the GrLivArea chart break the otherwise linear relationship between living area and sale price โ very large homes sold at unusually low prices, likely anomalies. Removing them strengthens the correlation.
With the data understood, the next step is transforming it into a form models can learn from effectively. This involves five stages:
Nine models are trained and evaluated using log RMSE โ the Kaggle leaderboard metric. All models are evaluated on a held-out validation set they never saw during training. Individual models cluster between 0.105โ0.115. The blended model, combining all nine with optimised weights, reaches 0.099.
Blending approach: 250,000 random weight combinations are tested. The combination with the lowest validation RMSE is kept for the final submission.
The blended model achieves a top 5% leaderboard finish โ well past the original goal of top 10%. Knowing when to stop is itself a useful skill; chasing marginal gains on a competition dataset rarely translates to real-world value.
Areas to revisit if returning to this project: