• Home
  • About
    • Adrian's Data Science Blog photo

      Adrian's Data Science Blog

      I am a programmer and IT professional turned Data Scientist out of fascination for data driven work, problem-solving, and the high societal impact of big data.

    • Learn More
    • LinkedIn
    • Github
    • Codepen
  • Posts
    • All Posts
    • All Tags
  • Projects

Project Three House Prices

19 Feb 2017

Reading time ~1 minute

Predicting House Prices Using Linear regression

Main Problem and Objectives

  • Build a prediction model for house pricing in Ames, IA
  • Where are most sales taking place?
  • Where are the most expensive houses located?
  • Discuss Possible Improvements

Describing the Data and Limitations

  • Target Prediction Feature: Sale Price
  • Number of Instances: 1460
  • Number of Attributes Allowed: 18
  • Years of data collected: 2006 - 2010
  • Missing

Limitations

  • The attributes provided are not necessarily the best indicators of the house pricing
  • The data collected is mostly around a particular unstable time in the market

Understanding the Data

Looking at Correlations CorrelationMap Quality & Price Correlation SaleVsQual

Looking at Sale Prices Across Neighborhoods Sale Price Box Plot

Where are the most sales happening?

  • Most Sales Happening in :
    • North Ames
  • How many happened?
    • 225 (About 15.41 % of total sales)

Where are the most expensive homes?

Creating a Regression model

  • Type of Regression: Linear
  • Attributes Dropped: Utilities
  • Dummy Variable Selection: All except Lot Area and GrLivArea
    • Accuracy Testing:
    • R Squared = 0.899
    • Mean Absolute Error: 16261.24
    • Mean Squared Error: 637850318.41
    • Root Mean Squared Error: 25255.70
    • Cross Validation Score: 0.738
  • Limitations of the Model

There were a lot of outliers in the data that have caused the RMSE and MSE to be quite high The cross validation score was lower than the initial model made, although R Square was improved this could be a sign of some over fitting on this model.

Regression

Possible Improvements

  • More location based metrics such as surrounding business’, schools, police stations, etc

  • More insight into the overall condition and quality metric

  • More data points for expensive homes, to improve predictions on the expensive homes.



dataplottingpythonreal estateregression Share Tweet +1