itle: ‘San Francisco Housing Predictions: Predictive Modeling Using Linear Regression’
uthor: “Keon Monroe & Scott Mullarkey”
ate: “10/18/2019”
utput:
md_document:
variant: markdown

Machine Learning and Housing Predictions

Using linear regression modeling, we aim to predict the individual housing prices of a hold out sample dataset (holdOut == 1)

Data

Using publicly available, almost exclusively from San Francisco Open Data we compiled a series of variables that we suspected would be indicative of housing markets and the prices of homes. We looked for different types of variables to suggest different trends, while also considering a variety of data structures and spatial patterns.

Summary Statistics Table

Below is a table of summary statistics comprising each of the variables included in our prediction model.

Variable Name Category Description Intercept Coefficient
badcrime_nn5 Amenities/Public Service An average nearest-neighbor calculation for the nearest 5 crimes reported (from 5/9/2018 to 5/15/18) that were either larceny/theft, drugs/narcotrics, or assault. 2107735487 -133.0914557
Baths Internal Number of bathrooms. 779088.7063 203415.7342
Beds Internal Number of bedrooms. 1023145.263 71613.759
bookedcrime_nn40 Amenities/Public Service An average nearest-neighbor calculation for the nearest 40 crimes reported (from 5/9/2018 to 5/15/18), where the resolution ended in a booking / arrest of the perpetrator. 2107735488 -133.0914557
cherryp_nn40 Amenities/Public Service An average nearest-neighbor calculation for the nearest 40 Cherry Plum trees. 2107735487 -133.0914557
Downtown Amenities/Public Service A calculation of the distance from each property to the city’s downtown area. 2107734161 -133.0914559
FerryTerminals Amenities/Public Service A calculation of the distance from each property to the ferry terminals. 2107734152 -133.0914555
graffiti_nn10 Amenities/Public Service An average nearest-neighbor calculation for the nearest 10 graffiti incidents reported to the 311 hotline. 2107735482 -133.0914556
MedianIncome Spatial Structure The median income reported for the containing census tract. 148094.7636 19.41095716
nzxmas_nn30 Amenities/Public Service An average nearest-neighbor calculation for the nearest 30 New Zeleand Christmas trees. 2107735487 -133.0914557
OceanBeach Amenities/Public Service A calculation of the distance from each property to the beach. 2107734164 -133.0914557
PAbyCTM Spatial Structure A calculation of the average property area within the containing census tract, divided by the mean of each individual property. 611717.5512 532759.4612
Presidio Amenities/Public Service A calculation of the distance from each property to Presidio Park. 2107734159 -133.0914559
SaleYr Internal The sale year of each property. -865537.9438 149476.3447
smag_nn20 Amenities/Public Service An average nearest-neighbor calculation for the nearest 20 Southern Magnolia trees. 2107735487 -133.0914557
SPbyCTM Spatial Structure A calculation of the average sales price within the containing census tract, divided by the mean of sale price at individual property. -155940.7471 1210760.858
streetdefects_nn20 Spatial Structure An average nearest-neighbor calculation for the nearest 20 street defects reported to the 311 hotline (from 10/6/2016 to 2019) - within 500 meters. 2107735482 -133.0914556
USF Amenities/Public Service A calculation of the distance from each property to the ferry terminals. 2107734160 -133.0914557

Correlation Matrix

Below is a correlation matrix comprising each of the variables included in our prediction model.

Home Price Correlation Scatterplots

Below are four scatterplots of variables we found interesting.

Presidio Park, Street defects, booked crime, distance from USF

Map of Home Prices in San Francisco

Maps of 3 Independent Variables

Other Interesting Charts/Graphics

-Include any other maps/graphs/charts you think might be of interest. I don’t know of any others off the top of my head.

Methods

Program Setup Our machine learning program was set up in R. First a variety of libraries were installed and loaded which are used to run the various functions within our program.

#change sci notation
options(scipen=999)

#load libraries
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(ggmap)
library(sp)
library(dplyr)
library(stringr)
library(lwgeom)
library(tidycensus)

Map and plot themes were defined for use in visual presentations.

# define map and plotting themes
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}
#this function converts a column in to quintiles, to be used for mapping.
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

# define color palettes for quintile functions
palette5 <- c("#FA7800","#C48C04","#8FA108","#5AB60C","#25CB10")
palette6 <- c("#25CB10", "#5AB60C", "#8FA108","#C48C04", "#FA7800")

Participants Our base data included 10,133 residential home sales in San Francisco over a three-year period (2012-2015). These homes had a variety of internal characteristics that included year built (range of 1870 and 2016; average of 1934), number of bedrooms (range of 0 to 20; average of 2), number of bathrooms (range of 0 to 25; average of 2), and total property area (range of 187 to 24,308; average of 1,642). Addresses were included for all sales – as well as longitude and latitude – so the sales could be plotted on a map for analysis. Sale prices of 9,433 of these sales were included in our analysis (range of $100,001 to $4,750,003; average of $1,144,668) with the remaining 700 sales included in a holdout category with no sales price to test the accuracy of our model.

## BASE MAP
## Realtor Neighborhoods.geojson
## https://data.sfgov.org/Geographic-Locations-and-Boundaries/Realtor-Neighborhoods/5gzd-g9ns
baseMap <-  st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson", crs = 4326)


#--------- MIDTERM DATA
midtermDataLocation <- "data/midterm_data_sf.shp"

bayHousing <- 
  st_read(midtermDataLocation) 

# add in price per sq ft and per room
bayHousing <- bayHousing %>%
  #filter(PropArea != 0) %>%
  #filter(Rooms != 0) %>%
  mutate(PricePerSqFt = SalePrice / PropArea,
         PricePerRoom = SalePrice / Rooms)

# transform on basemap
bayHousing <- 
  bayHousing %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(baseMap))

Materials A variety of open data was collected to provide local context on amenities/public services as well as the spatial structure of San Francisco. The US Census Bureau provided current demographic data which was pulled down in R using the tidycensus library with assistance in data selection provided by Census Reporter. DataSF provided the following open data which was pulled down using a direct link from the site or a csv file was downloaded and converted to a direct link using a web resource: crime data for analysis Data, street tree locations Data, 311 cases for analysis Data, neighborhood boundaries Data, park locations Data, and census tract boundaries {Data}(https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods-2010-census-tracts-assigned/bwbp-wk3r). The Metropolitan Transportation Commission was an additional source of open data which was pulled down using a direct link: transit stops Data.

Several variables (badcrime_nn5, bookedcrime_nn40, cherryp_nn40, graffiti_nn10, nzxmas_nn30, smag_nn20, streetdefects_nn20) were created using the nearest neighbor function in R. Here the program generates the average distance between a sale and a specified number of events. For instance, the variable “badcrime_nn5” is the average distance between a sale and the nearest five bad crimes (arceny/theft, drugs/narcotrics, or assault). So the closer a sale is to a bad crimes the lower the variable value will be since it is a shorter distance to bad crimes in its area.

# define nn function for later processing
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
  output <- as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

# FEATURE - Crime incidents
policeIncds.sf <- st_read("https://data.sfgov.org/resource/tmnf-yvry.geojson", crs = 4326)

badcrimes <- policeIncds.sf %>%
  filter(category == "LARCENY/THEFT" | category == "ASSAULT" | category == "DRUG/NARCOTIC"  )

bookedcrimes <- policeIncds.sf %>%
  filter(resolution == "ARREST, BOOKED")

# run nearest neighbor on the defined datasets
bayHousing$badcrime_nn5 =
  nn_function(st_coordinates(bayHousing), st_coordinates(badcrimes), 5)
bayHousing$bookedcrime_nn40 =
  nn_function(st_coordinates(bayHousing), st_coordinates(bookedcrimes), 40)
bayHousing$bookedcrime_nn10 =
  nn_function(st_coordinates(bayHousing), st_coordinates(bookedcrimes), 10)


# FEATURE - TREES
treesgJSON = "https://data.sfgov.org/api/geospatial/tkzw-k3nq?method=export&format=GeoJSON"
trees <- st_read(treesgJSON, crs = 4326)

trees <- trees %>%
  mutate(long = unlist(map(trees$geometry, 1)),
         lat = unlist(map(trees$geometry, 2))) %>%
  filter(!is.na(long) | !is.na(lat))

newzxmas <- trees %>% filter(qspecies == "Metrosideros excelsa :: New Zealand Xmas Tree")
smag <- trees %>% filter(qspecies == "Magnolia grandiflora :: Southern Magnolia")
cherryp <- trees %>% filter(qspecies == "Prunus cerasifera :: Cherry Plum")
greengem <- trees %>% filter(qspecies == "Ficus microcarpa nitida 'Green Gem' :: Indian Laurel Fig Tree 'Green Gem'")

bayHousing$nzxmas_nn30 =
  nn_function(st_coordinates(bayHousing), st_coordinates(newzxmas), 30)   
bayHousing$smag_nn20 =
  nn_function(st_coordinates(bayHousing), st_coordinates(smag), 20)
bayHousing$cherryp_nn40 =
  nn_function(st_coordinates(bayHousing), st_coordinates(cherryp), 40)

## FEATURE - 311 CASES
# SF311 cases created since 7/1/2008 with location information. 
# filtered for 10/6/16 to 2019
csv311 = "data/311_Cases.csv"
sf311 <- st_read(csv311, options=c("X_POSSIBLE_NAMES=Longitude","Y_POSSIBLE_NAMES=Latitude"))

sf311 <- sf311 %>% st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant")
st_crs(sf311) = 4326

#remove some cols to make df smaller and fix geometries
takeOuts <- c('Opened', 'Closed', 'Status.Notes', 'Request.Details', 'Address', 'Street', 'Media.URL')

sf311 <- 
  sf311 %>% 
  dplyr::select(-takeOuts)


#isolate graffiti as separate df
graffiti <- sf311 %>% filter(Category == "Graffiti")
streetdefects <- sf311 %>% filter(Category == "Street Defects")


## ADD IN NEAREST NEIGHBOR FOR GRAFFITI AND STREET DEFECTS
graffiti <- graffiti %>% # fix broken geometries on graffiti
  mutate(long = unlist(map(graffiti$geometry, 1)),
         lat = unlist(map(graffiti$geometry, 2))) %>%
  filter(!is.na(long) | !is.na(lat))

streetdefects <- streetdefects %>% # fix broken geometries on streetdefects
  mutate(long = unlist(map(streetdefects$geometry, 1)),
         lat = unlist(map(streetdefects$geometry, 2))) %>%
  filter(!is.na(long) | !is.na(lat))

Additional variables (Downtown, FerryTerminals, OceanBeach, Presidio, and USF) were created using the distance function in R. Here the program generates the actual distance in meters between a sale and a specified location. For instance, the variable “Downtown” is the actual distance between a sale and downtown San Francisco. So the further a sale is to downtown the higher the variable value will be since it has a larger distance to downtown.

## Create Distance Function
distanceFunction <- function(x,y,z) {
  st_crs(x) = st_crs(y)
  rapid_M = st_transform(x, "+proj=utm +zone=42N +datum=WGS84 +units=m")
  x_buffer <- st_buffer(rapid_M , 10)
  x_buffer <-st_union(x_buffer)
  x_buffer <- 
    x_buffer %>% 
    st_transform(st_crs(baseMap)) %>%
    st_intersection(.,st_union(baseMap))
  y$a<-st_distance(y, x_buffer) 
  y$a<-as.numeric(as.character(y$a))
  bayHousing <<- y
  bayHousing %>%
    filter(holdOut == 0) %>%
    dplyr::select(a, SalePrice) %>%
    st_set_geometry(NULL) %>%
    gather(Variable, Value, -SalePrice) %>%
    ggplot(aes(Value, SalePrice)) +
    geom_point() +
    geom_smooth(method = "lm", se=F, colour = "#25CB10") +
    facet_wrap(~Variable) +
    labs(title = z)
}

## Create Downtown Variable
neighborhood <-  st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson")
neighborhood <- 
  neighborhood %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(baseMap))

downtown <- neighborhood %>% filter(nbrhood == "Downtown") 
title = "Correlations: Sales Prices x Downtown"
distanceFunction(downtown,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "Downtown"
#Downtown.reg <- lm(SalePrice ~ Downtown, data = filter(bayHousing, holdOut == 0)); summary(Downtown.reg)


## Create USF Variable
parks <- st_read("https://gist.githubusercontent.com/smullarkUPENN/04e8262a7d9c41abd2dd669864b9a1fc/raw/134f34726c3ac4380d9a1275034807ad3080e19f/map.geojson")

parks <- 
  parks %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(baseMap))

USF <- parks %>% filter(map_label == "Angelo J. Rossi Playground") 
title = "Correlations: Sales Prices x USF"
distanceFunction(USF,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "USF"
#USF.reg <- lm(SalePrice ~ USF, data = filter(bayHousing, holdOut == 0))
#summary(USF.reg)


## Create Presidio Variable
# Runs off of neighborhood 

Presidio <- neighborhood %>% filter(nbrhood == "Presidio") 
title = "Correlations: Sales Prices x Presidio Park"
distanceFunction(Presidio,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "Presidio"
#Presidio.reg <- lm(SalePrice ~ Presidio, data = filter(bayHousing, holdOut == 0))
#summary(Presidio.reg)


# ## Create Ferry Terminals Variable
# ferryTerminals <- st_read("https://opendata.arcgis.com/datasets/85e01015980d4d669bca8f125154edaf_0.geojson")
# 
# ferryTerminals <- ferryTerminals %>% filter(cityname_a == "San Francisco")
# ferryTerminals <- ferryTerminals[!(ferryTerminals$facility=="Alcatraz Ferry Terminal"),]
# title = "Correlations: Sales Prices x Ferry Terminals"
# distanceFunction(ferryTerminals,bayHousing,title)
# colnames(bayHousing)[which(names(bayHousing) == "a")] <- "FerryTerminals"
# #FerryTerminals.reg <- lm(SalePrice ~ FerryTerminals, data = filter(bayHousing, holdOut == 0)) 
# #summary(FerryTerminals.reg)
# 
# 
# 
# ## Create Ocean Beach Variable
# simpleRoutes <- st_read("https://opendata.arcgis.com/datasets/efd75b7bb3c04dbda06c6e7cd73e9336_0.geojson")
# 
# simpleRoutes <- 
#   simpleRoutes %>% 
#   st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
#   st_transform(st_crs(baseMap))
# 
# oceanBeach <- simpleRoutes %>% filter(station_na == "Judah St & La Playa St") 
# title = "Correlations: Sales Prices x Ocean Beach"
# distanceFunction(oceanBeach,bayHousing,title)
# colnames(bayHousing)[which(names(bayHousing) == "a")] <- "OceanBeach"
# #OceanBeach.reg <- lm(SalePrice ~ OceanBeach, data = filter(bayHousing, holdOut == 0)) 
# #summary(OceanBeach.reg)

A demographic variable (MedianIncome) was created using the tidycensus library in R. Here the program pulls the median income of each census tract down from the US Census Bureau site and then creates a variable where each sale is assigned the median income of its census tract. For instance, the sale with ParcelID 0823015 has a value for “MedianIncome” 64,375 which is the dollar amount of the median income for its census tract (census tract # 016400).

## FEATURE - Add In Census Tracts WITH ACS DATA
##Analysis Neighborhoods.geojson
##https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods-2010-census-tracts-assigned/bwbp-wk3r
censusTract <-  st_read("https://gist.githubusercontent.com/smullarkUPENN/3ab9ffc1338f0b7d0374180b07ce7c37/raw/fd5fba047a7c536f9298d2649db3141a240236aa/map.geojson")

censusTract <- censusTract[!(censusTract$nhood=="Treasure Island"),]

censusTract.sf <- 
  censusTract %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(bayHousing))

census_api_key("ADD KEY HERE") 

acsData <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B992523_001","B06011_001"), 
          year = 2017, state=06, county=075, geometry=T) %>%
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         OwnerOccupied = B992523_001,
         MedianIncome = B06011_001) 

acsData.sf <- 
  acsData %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(baseMap)) 

censusTract.sf$TotalPop <- acsData.sf$TotalPop[match(censusTract.sf$geoid, acsData.sf$GEOID)]
censusTract.sf$OwnerOccupied <- acsData.sf$OwnerOccupied[match(censusTract.sf$geoid, acsData.sf$GEOID)]
censusTract.sf$MedianIncome <- acsData.sf$MedianIncome[match(censusTract.sf$geoid, acsData.sf$GEOID)]

inTract <- st_intersection(bayHousing, censusTract.sf)
nonData <- c('nhood', 'geoid','shape_len') 

inTract <- 
  inTract %>% 
  dplyr::select(-nonData)
colnames(inTract)[which(names(inTract) == "tractce10")] <- "CensusTract"
colnames(inTract)[which(names(inTract) == "shape_area")] <- "CensusTractArea"

st_erase = function(x, y) st_difference(x, st_union(st_combine(y)))
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
outTract <- st_erase(bayHousing, inTract)
outTract$CensusTract<-"1"; 
outTract$TotalPop<-1; 
outTract$OwnerOccupied<-1; 
outTract$MedianIncome<-1; 
outTract$CensusTractArea<-1; 

bayHousing <- rbind(inTract, outTract)

Two variables (PAbyCTM and SPbyCTM) were created using formulas in R. Here the average property area and sales price was calculated for each census tract. Then the actual property area and sales price was divided by its census tract average. This shows how close each sales property area and sales price is to the average of its census tract. A value of 1 means it is equal to the average, below 1 means it is below the average, and higher than 1 means it is above the average.

# FEATURE - CENSUS AVERAGES
#Generates Mean Sale Price by Census Tract
censusTractMean <- aggregate( SalePrice ~ CensusTract, bayHousing, mean)

#Generates Mean Sale Price by Census Tract
PropAreaMean <- aggregate( PropArea ~ CensusTract, bayHousing, mean)


#Creates column called CensusTractMean in bayHousing and loops through file to assign each sale
#the mean sales price of its census track
bayHousing$CensusTractMean <- censusTractMean$SalePrice[match(bayHousing$CensusTract, 
                                                              censusTractMean$CensusTract)]
bayHousing$PropAreaMean <- PropAreaMean$PropArea[match(bayHousing$CensusTract,
                                                  PropAreaMean$CensusTract)]

#Creates column called SPbyCTM which divides sales price by mean sales price of census tract
bayHousing$SPbyCTM<- bayHousing$SalePrice / bayHousing$CensusTractMean

bayHousing$PAbyCTM <- bayHousing$PropArea / bayHousing$PropAreaMean

These variables were created and then tested in R to determine whether they would provide value to our model. A scatter plot function was applied to each variable so the relationship between the variable and sales price could be assessed within a graph. A regression function was also applied to each variable which provided a summary of the relationship between the average value of sales price and the values of the variable. Finally, a correlation function was applied to all variables which generated a table that showed the relationships between sales price and all variables.

##Sample scatterplot code
bayHousing %>%
  filter(holdOut == 0) %>%
  dplyr::select(starts_with("Downtown"), SalePrice) %>%
  st_set_geometry(NULL) %>%
  gather(Variable, Value, -SalePrice) %>%
  ggplot(aes(Value, SalePrice)) +
  geom_point() +
  geom_smooth(method = "lm", se=F, colour = "#25CB10") +
  facet_wrap(~Variable) +
  labs(title = "Correlations: Sales Prices x Downtown")


##Sample regression code
Downtown.reg <- lm(SalePrice ~ Downtown, data = filter(bayHousing, holdOut == 0))

summary(Downtown.reg)

knitr::opts_chunk$set(cache=TRUE)

##CORRELATION MATRIX
matrixVars <- c('SalePrice', 'badcrime_nn5', 'Baths', 'Beds', 'bookedcrime_nn40',
                'cherryp_nn40', 'Downtown', 'FerryTerminals', 
                'graffiti_nn10', 'MedianIncome',
                'nzxmas_nn30','OceanBeach', 'PAbyCTM',
                'Presidio', 'SaleYr', 'smag_nn20',
                'SPbyCTM', 'streetdefects_nn20','USF')
bayHousingCor2 <- 
  bayHousing %>% 
  filter(holdOut == 0) %>%
  filter(!is.na(PAbyCTM)) %>%
  st_set_geometry(NULL) %>% #converts to df
  dplyr::select(any_of(matrixVars))

M <- cor(bayHousingCor2)

corrplot(M, order = "alphabet")

A variable Neighborhood was created for use in the regression analysis. Each sale was assigned the neighborhood name in which it resides.

##Add in neighborhood

neighborhoodMap <-  st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson")

neighborhoodMap.sf <- 
  neighborhoodMap %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
  st_transform(st_crs(baseMap))


inHood <- st_intersection(bayHousing, neighborhoodMap.sf)
nonData <- c('nid', 'sfar_distr') 

inHood <- 
  inHood %>% 
  dplyr::select(-nonData)
colnames(inHood)[which(names(inHood) == "nbrhood")] <- "Neighborhood"
outHood <- st_erase(bayHousing, inHood)
outHood$Neighborhood<-"NA"

bayHousing <- rbind(inHood, outHood)

Procedures

## [1] "Columns in TRAIN:"
##  [1] "ParcelID"           "Address"            "PropClassC"        
##  [4] "SaleDate"           "LandUse"            "ZoningCode"        
##  [7] "X"                  "Y"                  "long"              
## [10] "lat"                "SalePrice"          "ConstType"         
## [13] "LotArea"            "PropArea"           "BuiltYear"         
## [16] "Stories"            "Units"              "Rooms"             
## [19] "Beds"               "Baths"              "SaleYr"            
## [22] "holdOut"            "PricePerSqFt"       "PricePerRoom"      
## [25] "Downtown"           "USF"                "Presidio"          
## [28] "FerryTerminals"     "OceanBeach"         "badcrime_nn5"      
## [31] "bookedcrime_nn40"   "bookedcrime_nn10"   "nzxmas_nn30"       
## [34] "smag_nn20"          "cherryp_nn40"       "CensusTractArea"   
## [37] "CensusTract"        "TotalPop"           "OwnerOccupied"     
## [40] "MedianIncome"       "graffiti_nn10"      "streetdefects_nn20"
## [43] "CensusTractMean"    "PropAreaMean"       "SPbyCTM"           
## [46] "PAbyCTM"            "Neighborhood"       "geometry"
## [1] "Columns in TEST:"
##  [1] "ParcelID"           "Address"            "PropClassC"        
##  [4] "SaleDate"           "LandUse"            "ZoningCode"        
##  [7] "X"                  "Y"                  "long"              
## [10] "lat"                "SalePrice"          "ConstType"         
## [13] "LotArea"            "PropArea"           "BuiltYear"         
## [16] "Stories"            "Units"              "Rooms"             
## [19] "Beds"               "Baths"              "SaleYr"            
## [22] "holdOut"            "PricePerSqFt"       "PricePerRoom"      
## [25] "Downtown"           "USF"                "Presidio"          
## [28] "FerryTerminals"     "OceanBeach"         "badcrime_nn5"      
## [31] "bookedcrime_nn40"   "bookedcrime_nn10"   "nzxmas_nn30"       
## [34] "smag_nn20"          "cherryp_nn40"       "CensusTractArea"   
## [37] "CensusTract"        "TotalPop"           "OwnerOccupied"     
## [40] "MedianIncome"       "graffiti_nn10"      "streetdefects_nn20"
## [43] "CensusTractMean"    "PropAreaMean"       "SPbyCTM"           
## [46] "PAbyCTM"            "Neighborhood"       "geometry"
## [1] 358567.8
## [1] "36.04 %"

## Linear Regression 
## 
## 10744 samples
##    13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 10637, 10636, 10638, 10636, 10637, 10636, ... 
## Resampling results:
## 
##   RMSE    Rsquared   MAE     
##   508966  0.4664593  358200.8
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
##        RMSE  Rsquared      MAE Resample
## 1  597718.5 0.5129947 400185.6  Fold001
## 2  479586.4 0.5276294 356532.2  Fold002
## 3  529729.5 0.4821907 378535.4  Fold003
## 4  606055.0 0.4085681 401637.7  Fold004
## 5  484053.3 0.4737479 338838.4  Fold005
## 6  443232.9 0.5214877 320547.5  Fold006
## 7  517003.6 0.4952176 341656.4  Fold007
## 8  475326.2 0.5024867 319383.8  Fold008
## 9  459137.5 0.4662155 337992.3  Fold009
## 10 536572.4 0.4171872 388396.8  Fold010
## [1] 34249.22
## [1] 1149641

##Results ### In-Sample (Training) Results -Provide a polished table of your in-sample (training set) model results.

In-Same (Training) Results

-Provide a polished table of R^2, mean absolute error and MAPE for the test set. Check out the “kable” function for markdown to create tables.

Cross Validation Results

-Provide the results of your cross-validation tests on the training set. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model overfit?

Predicted vs. Observed

-Plot predicted prices as a function of observed prices

ggplot(trainTable, aes(SalePrice.Predict, SalePrice)) + 
  geom_point() +
  stat_smooth(data=trainTable, aes(SalePrice, SalePrice), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(data=trainTable, aes(SalePrice, SalePrice.Predict), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  labs(title="Predicted sale price as a function of\nobserved price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  theme(plot.title = element_text(size = 18,colour = "black"))

Map - Model Residuals and Moran’s I test

-Provide a map of your residuals for your test set. Include a Moran’s I test.

grid.arrange(
  ggplot() +
    geom_sf(data = baseMap, fill = "grey40") +
    geom_sf(data = trainTable2, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                     labels=qBr(trainTable,"SalePrice"),
                     name="Quintile\nBreaks") +
    labs(title="Test set sale prices") +
    mapTheme(),
  
  ggplot() +
    geom_sf(data = baseMap, fill = "grey40") +
    geom_sf(data = trainTable2, aes(colour = q5(SalePrice.Error)), show.legend = "point", size = 1) +
    scale_colour_manual(values = palette5,
                     labels=qBr(trainTable,"SalePrice.Error"),
                     name="Quintile\nBreaks") +
    labs(title="Sale price errors on the test set") +
    mapTheme(),
  ncol = 2)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in red",
       x="Moran's I",
       y="Count",
       caption="Public Policy Analytics, Figure 6.8") +
  plotTheme()

Map - Predicted Values

-Provide a map of your predicted values for the entire dataset

ggplot() +
  geom_sf(data = baseMap, fill = "grey40") +
  geom_sf(data = trainTable3, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = 1) +
  scale_colour_manual(values = palette5,
                   labels=qBr(trainTable,"SalePrice.Predict"),
                   name="Quintile\nBreaks") +
  labs(title="Entire set predicted sale prices") +
  mapTheme()

Map - Mean Absolute Percentage Error (MAPE) by Neighborhood

-Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood.

Scatterplot - Mean Absolute Percentage Error (MAPE) by Neighborhood

-Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood.

Generalizability

-Using tidycensus, split your city in to two groups (perhaps by race or income) and test your model’s generalizability.

acsData2 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B02008_001","B06011_001"), 
          year = 2017, state=06, county=075, geometry=T) %>%
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B02008_001,
         MedianIncome = B06011_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedianIncome > 32322, "High Income", "Low Income"))

acsData2 <- acsData2[!(acsData2$GEOID=="06075017902"),]

grid.arrange(
  ggplot() + 
    geom_sf(data = na.omit(acsData2), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"),
                      name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  
  ggplot() + 
    geom_sf(data = na.omit(acsData2), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"),
                      name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"), ncol = 2)

Is your model generalizable?