Real Estate Price Prediction

This project is from self-learning experience. The task is to use California census data to build a model of housing prices in the state. I will do the data visualization and variable selection, and then model selection among several ML models based on their performances.

Dataset Summary

The data includes metrics such as population, median income, location, and median housing price for each block group in California. There are 20640 instances in the datasets and 10 attributes including both numerical variables and categorical variables.

Links to data source:

https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz

Here is the sample picture of the datasets that I will use in this project:

Sample data structure

Data Visualization

Since the data contains location information and also population, so the plot that demonstrates the relationship among location, population density and housing price will be nice.

Data Visualization

As the plot shown above, red dots are expensive and blue ones are cheap. Larger circles indicates that the lager population in that area. From the plot, we can know the housing price is related to location and population density.

 

Variables Correlation and Data Cleaning

For variable selection, the correlation map will be my first consideration to see the relationship between dependent variable and each independent variable.

Correlation Coefficients

Then select the features that have large absolute correlation coefficients to draw the matrix plot.

Stacked Histogram Plot

Now, we have some basic sense of the relationship between dependent variable and numerical variables. After imputing the missing data, we then did transformation. For categorical variables, we used one-hot encoding method to transform the categorical variables which is ocean_proximity in the case and used standardized normalization to transform the numerical data. In the end, the data is ready to train the ML models.

 

Model Training and Selection

After data preprocessing, we trained three models which are linear regression model, decision tree model and random forest model using sklearn package in Python. We used 10-fold Cross-Validation and RMSE, root mean squared error, to measure the model performance. And we have the output table as shown below:

 

Training sets

Testing sets

 

RMSE

CV scores mean

CV score std

RMSE

CV scores mean

CV score std

Linear reg

69050

69223

2657

67352

67630

2568

Decision Tree

0

69927

2690

74400

73493

5311

Random Forest

18307

49482

1859

53983

53950

2425

We started from the training datasets. In the training datasets, Decision tree model has the smallest RMSE because the nature of decision tree that usually has overfitting problem and will lead to higher variance in testing datasets. That’s why we also used CV scores to measure the performance. As we can see in the table under training sets and cv scores, decision tree has the worse score. Let’s look at linear model and random forest model. Random forest is nothing but many decision trees on random subset and then averaging out their predictions. In this case, random forest has smaller RMSE than linear model, and has smaller CV scores mean. From training sets performance test, we think random forest is the best model among them. Then, we also need to check the testing set. There is no surprise that decision tree has the highest RMSE because of overfitting problem, and random forest model is the best model as it is in training set.

Importance of variables

Once we tuned our best models, we can check the importance of all independent variables.

Stacked Histogram Plot

As we can see, the median income has the largest contribution to house price, and then followed by house location. For example, median income increases by 1 unit, the house price will increase by 0.48 unit. And number of households seems less useful.

Summary

By using Python script, we fetched data and cleaned the data to build three ML models to predict median house price in California. We tuned the random forest model is the best one based on RMSE and 10-fold cross-validation score. In the end, we found out that median income is the most important variable to determine the house price. In the further working, I may try to use the population density instead of population, and total rooms per person instead of total rooms and household columns to see if it will make any difference on the results.