itle: ‘San Francisco Housing Predictions: Predictive Modeling Using Linear Regression’ |
uthor: “Keon Monroe & Scott Mullarkey” |
ate: “10/18/2019” |
utput: |
md_document: |
variant: markdown |
Using linear regression modeling, we aim to predict the individual housing prices of a hold out sample dataset (holdOut == 1)
Using publicly available, almost exclusively from San Francisco Open Data we compiled a series of variables that we suspected would be indicative of housing markets and the prices of homes. We looked for different types of variables to suggest different trends, while also considering a variety of data structures and spatial patterns.
Below is a table of summary statistics comprising each of the variables included in our prediction model.
Variable Name | Category | Description | Intercept | Coefficient |
---|---|---|---|---|
badcrime_nn5 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 5 crimes reported (from 5/9/2018 to 5/15/18) that were either larceny/theft, drugs/narcotrics, or assault. | 2107735487 | -133.0914557 |
Baths | Internal | Number of bathrooms. | 779088.7063 | 203415.7342 |
Beds | Internal | Number of bedrooms. | 1023145.263 | 71613.759 |
bookedcrime_nn40 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 40 crimes reported (from 5/9/2018 to 5/15/18), where the resolution ended in a booking / arrest of the perpetrator. | 2107735488 | -133.0914557 |
cherryp_nn40 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 40 Cherry Plum trees. | 2107735487 | -133.0914557 |
Downtown | Amenities/Public Service | A calculation of the distance from each property to the city’s downtown area. | 2107734161 | -133.0914559 |
FerryTerminals | Amenities/Public Service | A calculation of the distance from each property to the ferry terminals. | 2107734152 | -133.0914555 |
graffiti_nn10 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 10 graffiti incidents reported to the 311 hotline. | 2107735482 | -133.0914556 |
MedianIncome | Spatial Structure | The median income reported for the containing census tract. | 148094.7636 | 19.41095716 |
nzxmas_nn30 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 30 New Zeleand Christmas trees. | 2107735487 | -133.0914557 |
OceanBeach | Amenities/Public Service | A calculation of the distance from each property to the beach. | 2107734164 | -133.0914557 |
PAbyCTM | Spatial Structure | A calculation of the average property area within the containing census tract, divided by the mean of each individual property. | 611717.5512 | 532759.4612 |
Presidio | Amenities/Public Service | A calculation of the distance from each property to Presidio Park. | 2107734159 | -133.0914559 |
SaleYr | Internal | The sale year of each property. | -865537.9438 | 149476.3447 |
smag_nn20 | Amenities/Public Service | An average nearest-neighbor calculation for the nearest 20 Southern Magnolia trees. | 2107735487 | -133.0914557 |
SPbyCTM | Spatial Structure | A calculation of the average sales price within the containing census tract, divided by the mean of sale price at individual property. | -155940.7471 | 1210760.858 |
streetdefects_nn20 | Spatial Structure | An average nearest-neighbor calculation for the nearest 20 street defects reported to the 311 hotline (from 10/6/2016 to 2019) - within 500 meters. | 2107735482 | -133.0914556 |
USF | Amenities/Public Service | A calculation of the distance from each property to the ferry terminals. | 2107734160 | -133.0914557 |
Below is a correlation matrix comprising each of the variables included in our prediction model.
Below are four scatterplots of variables we found interesting.
Presidio Park, Street defects, booked crime, distance from USF
-Include any other maps/graphs/charts you think might be of interest. I don’t know of any others off the top of my head.
Program Setup Our machine learning program was set up in R. First a variety of libraries were installed and loaded which are used to run the various functions within our program.
#change sci notation
options(scipen=999)
#load libraries
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(corrplot)
library(viridis)
library(stargazer)
library(tigris)
library(ggmap)
library(sp)
library(dplyr)
library(stringr)
library(lwgeom)
library(tidycensus)
Map and plot themes were defined for use in visual presentations.
# define map and plotting themes
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
#this function converts a column in to quintiles, to be used for mapping.
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
# define color palettes for quintile functions
palette5 <- c("#FA7800","#C48C04","#8FA108","#5AB60C","#25CB10")
palette6 <- c("#25CB10", "#5AB60C", "#8FA108","#C48C04", "#FA7800")
Participants Our base data included 10,133 residential home sales in San Francisco over a three-year period (2012-2015). These homes had a variety of internal characteristics that included year built (range of 1870 and 2016; average of 1934), number of bedrooms (range of 0 to 20; average of 2), number of bathrooms (range of 0 to 25; average of 2), and total property area (range of 187 to 24,308; average of 1,642). Addresses were included for all sales – as well as longitude and latitude – so the sales could be plotted on a map for analysis. Sale prices of 9,433 of these sales were included in our analysis (range of $100,001 to $4,750,003; average of $1,144,668) with the remaining 700 sales included in a holdout category with no sales price to test the accuracy of our model.
## BASE MAP
## Realtor Neighborhoods.geojson
## https://data.sfgov.org/Geographic-Locations-and-Boundaries/Realtor-Neighborhoods/5gzd-g9ns
baseMap <- st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson", crs = 4326)
#--------- MIDTERM DATA
midtermDataLocation <- "data/midterm_data_sf.shp"
bayHousing <-
st_read(midtermDataLocation)
# add in price per sq ft and per room
bayHousing <- bayHousing %>%
#filter(PropArea != 0) %>%
#filter(Rooms != 0) %>%
mutate(PricePerSqFt = SalePrice / PropArea,
PricePerRoom = SalePrice / Rooms)
# transform on basemap
bayHousing <-
bayHousing %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(baseMap))
Materials A variety of open data was collected to provide local context on amenities/public services as well as the spatial structure of San Francisco. The US Census Bureau provided current demographic data which was pulled down in R using the tidycensus library with assistance in data selection provided by Census Reporter. DataSF provided the following open data which was pulled down using a direct link from the site or a csv file was downloaded and converted to a direct link using a web resource: crime data for analysis Data, street tree locations Data, 311 cases for analysis Data, neighborhood boundaries Data, park locations Data, and census tract boundaries {Data}(https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods-2010-census-tracts-assigned/bwbp-wk3r). The Metropolitan Transportation Commission was an additional source of open data which was pulled down using a direct link: transit stops Data.
Several variables (badcrime_nn5, bookedcrime_nn40, cherryp_nn40, graffiti_nn10, nzxmas_nn30, smag_nn20, streetdefects_nn20) were created using the nearest neighbor function in R. Here the program generates the average distance between a sale and a specified number of events. For instance, the variable “badcrime_nn5” is the average distance between a sale and the nearest five bad crimes (arceny/theft, drugs/narcotrics, or assault). So the closer a sale is to a bad crimes the lower the variable value will be since it is a shorter distance to bad crimes in its area.
# define nn function for later processing
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
output <- as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
# FEATURE - Crime incidents
policeIncds.sf <- st_read("https://data.sfgov.org/resource/tmnf-yvry.geojson", crs = 4326)
badcrimes <- policeIncds.sf %>%
filter(category == "LARCENY/THEFT" | category == "ASSAULT" | category == "DRUG/NARCOTIC" )
bookedcrimes <- policeIncds.sf %>%
filter(resolution == "ARREST, BOOKED")
# run nearest neighbor on the defined datasets
bayHousing$badcrime_nn5 =
nn_function(st_coordinates(bayHousing), st_coordinates(badcrimes), 5)
bayHousing$bookedcrime_nn40 =
nn_function(st_coordinates(bayHousing), st_coordinates(bookedcrimes), 40)
bayHousing$bookedcrime_nn10 =
nn_function(st_coordinates(bayHousing), st_coordinates(bookedcrimes), 10)
# FEATURE - TREES
treesgJSON = "https://data.sfgov.org/api/geospatial/tkzw-k3nq?method=export&format=GeoJSON"
trees <- st_read(treesgJSON, crs = 4326)
trees <- trees %>%
mutate(long = unlist(map(trees$geometry, 1)),
lat = unlist(map(trees$geometry, 2))) %>%
filter(!is.na(long) | !is.na(lat))
newzxmas <- trees %>% filter(qspecies == "Metrosideros excelsa :: New Zealand Xmas Tree")
smag <- trees %>% filter(qspecies == "Magnolia grandiflora :: Southern Magnolia")
cherryp <- trees %>% filter(qspecies == "Prunus cerasifera :: Cherry Plum")
greengem <- trees %>% filter(qspecies == "Ficus microcarpa nitida 'Green Gem' :: Indian Laurel Fig Tree 'Green Gem'")
bayHousing$nzxmas_nn30 =
nn_function(st_coordinates(bayHousing), st_coordinates(newzxmas), 30)
bayHousing$smag_nn20 =
nn_function(st_coordinates(bayHousing), st_coordinates(smag), 20)
bayHousing$cherryp_nn40 =
nn_function(st_coordinates(bayHousing), st_coordinates(cherryp), 40)
## FEATURE - 311 CASES
# SF311 cases created since 7/1/2008 with location information.
# filtered for 10/6/16 to 2019
csv311 = "data/311_Cases.csv"
sf311 <- st_read(csv311, options=c("X_POSSIBLE_NAMES=Longitude","Y_POSSIBLE_NAMES=Latitude"))
sf311 <- sf311 %>% st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant")
st_crs(sf311) = 4326
#remove some cols to make df smaller and fix geometries
takeOuts <- c('Opened', 'Closed', 'Status.Notes', 'Request.Details', 'Address', 'Street', 'Media.URL')
sf311 <-
sf311 %>%
dplyr::select(-takeOuts)
#isolate graffiti as separate df
graffiti <- sf311 %>% filter(Category == "Graffiti")
streetdefects <- sf311 %>% filter(Category == "Street Defects")
## ADD IN NEAREST NEIGHBOR FOR GRAFFITI AND STREET DEFECTS
graffiti <- graffiti %>% # fix broken geometries on graffiti
mutate(long = unlist(map(graffiti$geometry, 1)),
lat = unlist(map(graffiti$geometry, 2))) %>%
filter(!is.na(long) | !is.na(lat))
streetdefects <- streetdefects %>% # fix broken geometries on streetdefects
mutate(long = unlist(map(streetdefects$geometry, 1)),
lat = unlist(map(streetdefects$geometry, 2))) %>%
filter(!is.na(long) | !is.na(lat))
Additional variables (Downtown, FerryTerminals, OceanBeach, Presidio, and USF) were created using the distance function in R. Here the program generates the actual distance in meters between a sale and a specified location. For instance, the variable “Downtown” is the actual distance between a sale and downtown San Francisco. So the further a sale is to downtown the higher the variable value will be since it has a larger distance to downtown.
## Create Distance Function
distanceFunction <- function(x,y,z) {
st_crs(x) = st_crs(y)
rapid_M = st_transform(x, "+proj=utm +zone=42N +datum=WGS84 +units=m")
x_buffer <- st_buffer(rapid_M , 10)
x_buffer <-st_union(x_buffer)
x_buffer <-
x_buffer %>%
st_transform(st_crs(baseMap)) %>%
st_intersection(.,st_union(baseMap))
y$a<-st_distance(y, x_buffer)
y$a<-as.numeric(as.character(y$a))
bayHousing <<- y
bayHousing %>%
filter(holdOut == 0) %>%
dplyr::select(a, SalePrice) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
facet_wrap(~Variable) +
labs(title = z)
}
## Create Downtown Variable
neighborhood <- st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson")
neighborhood <-
neighborhood %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(baseMap))
downtown <- neighborhood %>% filter(nbrhood == "Downtown")
title = "Correlations: Sales Prices x Downtown"
distanceFunction(downtown,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "Downtown"
#Downtown.reg <- lm(SalePrice ~ Downtown, data = filter(bayHousing, holdOut == 0)); summary(Downtown.reg)
## Create USF Variable
parks <- st_read("https://gist.githubusercontent.com/smullarkUPENN/04e8262a7d9c41abd2dd669864b9a1fc/raw/134f34726c3ac4380d9a1275034807ad3080e19f/map.geojson")
parks <-
parks %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(baseMap))
USF <- parks %>% filter(map_label == "Angelo J. Rossi Playground")
title = "Correlations: Sales Prices x USF"
distanceFunction(USF,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "USF"
#USF.reg <- lm(SalePrice ~ USF, data = filter(bayHousing, holdOut == 0))
#summary(USF.reg)
## Create Presidio Variable
# Runs off of neighborhood
Presidio <- neighborhood %>% filter(nbrhood == "Presidio")
title = "Correlations: Sales Prices x Presidio Park"
distanceFunction(Presidio,bayHousing,title)
colnames(bayHousing)[which(names(bayHousing) == "a")] <- "Presidio"
#Presidio.reg <- lm(SalePrice ~ Presidio, data = filter(bayHousing, holdOut == 0))
#summary(Presidio.reg)
# ## Create Ferry Terminals Variable
# ferryTerminals <- st_read("https://opendata.arcgis.com/datasets/85e01015980d4d669bca8f125154edaf_0.geojson")
#
# ferryTerminals <- ferryTerminals %>% filter(cityname_a == "San Francisco")
# ferryTerminals <- ferryTerminals[!(ferryTerminals$facility=="Alcatraz Ferry Terminal"),]
# title = "Correlations: Sales Prices x Ferry Terminals"
# distanceFunction(ferryTerminals,bayHousing,title)
# colnames(bayHousing)[which(names(bayHousing) == "a")] <- "FerryTerminals"
# #FerryTerminals.reg <- lm(SalePrice ~ FerryTerminals, data = filter(bayHousing, holdOut == 0))
# #summary(FerryTerminals.reg)
#
#
#
# ## Create Ocean Beach Variable
# simpleRoutes <- st_read("https://opendata.arcgis.com/datasets/efd75b7bb3c04dbda06c6e7cd73e9336_0.geojson")
#
# simpleRoutes <-
# simpleRoutes %>%
# st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
# st_transform(st_crs(baseMap))
#
# oceanBeach <- simpleRoutes %>% filter(station_na == "Judah St & La Playa St")
# title = "Correlations: Sales Prices x Ocean Beach"
# distanceFunction(oceanBeach,bayHousing,title)
# colnames(bayHousing)[which(names(bayHousing) == "a")] <- "OceanBeach"
# #OceanBeach.reg <- lm(SalePrice ~ OceanBeach, data = filter(bayHousing, holdOut == 0))
# #summary(OceanBeach.reg)
A demographic variable (MedianIncome) was created using the tidycensus library in R. Here the program pulls the median income of each census tract down from the US Census Bureau site and then creates a variable where each sale is assigned the median income of its census tract. For instance, the sale with ParcelID 0823015 has a value for “MedianIncome” 64,375 which is the dollar amount of the median income for its census tract (census tract # 016400).
## FEATURE - Add In Census Tracts WITH ACS DATA
##Analysis Neighborhoods.geojson
##https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods-2010-census-tracts-assigned/bwbp-wk3r
censusTract <- st_read("https://gist.githubusercontent.com/smullarkUPENN/3ab9ffc1338f0b7d0374180b07ce7c37/raw/fd5fba047a7c536f9298d2649db3141a240236aa/map.geojson")
censusTract <- censusTract[!(censusTract$nhood=="Treasure Island"),]
censusTract.sf <-
censusTract %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(bayHousing))
census_api_key("ADD KEY HERE")
acsData <-
get_acs(geography = "tract", variables = c("B01001_001E","B992523_001","B06011_001"),
year = 2017, state=06, county=075, geometry=T) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
OwnerOccupied = B992523_001,
MedianIncome = B06011_001)
acsData.sf <-
acsData %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(baseMap))
censusTract.sf$TotalPop <- acsData.sf$TotalPop[match(censusTract.sf$geoid, acsData.sf$GEOID)]
censusTract.sf$OwnerOccupied <- acsData.sf$OwnerOccupied[match(censusTract.sf$geoid, acsData.sf$GEOID)]
censusTract.sf$MedianIncome <- acsData.sf$MedianIncome[match(censusTract.sf$geoid, acsData.sf$GEOID)]
inTract <- st_intersection(bayHousing, censusTract.sf)
nonData <- c('nhood', 'geoid','shape_len')
inTract <-
inTract %>%
dplyr::select(-nonData)
colnames(inTract)[which(names(inTract) == "tractce10")] <- "CensusTract"
colnames(inTract)[which(names(inTract) == "shape_area")] <- "CensusTractArea"
st_erase = function(x, y) st_difference(x, st_union(st_combine(y)))
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
outTract <- st_erase(bayHousing, inTract)
outTract$CensusTract<-"1";
outTract$TotalPop<-1;
outTract$OwnerOccupied<-1;
outTract$MedianIncome<-1;
outTract$CensusTractArea<-1;
bayHousing <- rbind(inTract, outTract)
Two variables (PAbyCTM and SPbyCTM) were created using formulas in R. Here the average property area and sales price was calculated for each census tract. Then the actual property area and sales price was divided by its census tract average. This shows how close each sales property area and sales price is to the average of its census tract. A value of 1 means it is equal to the average, below 1 means it is below the average, and higher than 1 means it is above the average.
# FEATURE - CENSUS AVERAGES
#Generates Mean Sale Price by Census Tract
censusTractMean <- aggregate( SalePrice ~ CensusTract, bayHousing, mean)
#Generates Mean Sale Price by Census Tract
PropAreaMean <- aggregate( PropArea ~ CensusTract, bayHousing, mean)
#Creates column called CensusTractMean in bayHousing and loops through file to assign each sale
#the mean sales price of its census track
bayHousing$CensusTractMean <- censusTractMean$SalePrice[match(bayHousing$CensusTract,
censusTractMean$CensusTract)]
bayHousing$PropAreaMean <- PropAreaMean$PropArea[match(bayHousing$CensusTract,
PropAreaMean$CensusTract)]
#Creates column called SPbyCTM which divides sales price by mean sales price of census tract
bayHousing$SPbyCTM<- bayHousing$SalePrice / bayHousing$CensusTractMean
bayHousing$PAbyCTM <- bayHousing$PropArea / bayHousing$PropAreaMean
These variables were created and then tested in R to determine whether they would provide value to our model. A scatter plot function was applied to each variable so the relationship between the variable and sales price could be assessed within a graph. A regression function was also applied to each variable which provided a summary of the relationship between the average value of sales price and the values of the variable. Finally, a correlation function was applied to all variables which generated a table that showed the relationships between sales price and all variables.
##Sample scatterplot code
bayHousing %>%
filter(holdOut == 0) %>%
dplyr::select(starts_with("Downtown"), SalePrice) %>%
st_set_geometry(NULL) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se=F, colour = "#25CB10") +
facet_wrap(~Variable) +
labs(title = "Correlations: Sales Prices x Downtown")
##Sample regression code
Downtown.reg <- lm(SalePrice ~ Downtown, data = filter(bayHousing, holdOut == 0))
summary(Downtown.reg)
knitr::opts_chunk$set(cache=TRUE)
##CORRELATION MATRIX
matrixVars <- c('SalePrice', 'badcrime_nn5', 'Baths', 'Beds', 'bookedcrime_nn40',
'cherryp_nn40', 'Downtown', 'FerryTerminals',
'graffiti_nn10', 'MedianIncome',
'nzxmas_nn30','OceanBeach', 'PAbyCTM',
'Presidio', 'SaleYr', 'smag_nn20',
'SPbyCTM', 'streetdefects_nn20','USF')
bayHousingCor2 <-
bayHousing %>%
filter(holdOut == 0) %>%
filter(!is.na(PAbyCTM)) %>%
st_set_geometry(NULL) %>% #converts to df
dplyr::select(any_of(matrixVars))
M <- cor(bayHousingCor2)
corrplot(M, order = "alphabet")
A variable Neighborhood was created for use in the regression analysis. Each sale was assigned the neighborhood name in which it resides.
##Add in neighborhood
neighborhoodMap <- st_read("https://gist.githubusercontent.com/smullarkUPENN/82ecee5ce1c267e5f8594b39bfb41193/raw/9c569e4293c1303bae8a8414cb7dfe99047c9b2b/map.geojson")
neighborhoodMap.sf <-
neighborhoodMap %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(baseMap))
inHood <- st_intersection(bayHousing, neighborhoodMap.sf)
nonData <- c('nid', 'sfar_distr')
inHood <-
inHood %>%
dplyr::select(-nonData)
colnames(inHood)[which(names(inHood) == "nbrhood")] <- "Neighborhood"
outHood <- st_erase(bayHousing, inHood)
outHood$Neighborhood<-"NA"
bayHousing <- rbind(inHood, outHood)
Procedures
## [1] "Columns in TRAIN:"
## [1] "ParcelID" "Address" "PropClassC"
## [4] "SaleDate" "LandUse" "ZoningCode"
## [7] "X" "Y" "long"
## [10] "lat" "SalePrice" "ConstType"
## [13] "LotArea" "PropArea" "BuiltYear"
## [16] "Stories" "Units" "Rooms"
## [19] "Beds" "Baths" "SaleYr"
## [22] "holdOut" "PricePerSqFt" "PricePerRoom"
## [25] "Downtown" "USF" "Presidio"
## [28] "FerryTerminals" "OceanBeach" "badcrime_nn5"
## [31] "bookedcrime_nn40" "bookedcrime_nn10" "nzxmas_nn30"
## [34] "smag_nn20" "cherryp_nn40" "CensusTractArea"
## [37] "CensusTract" "TotalPop" "OwnerOccupied"
## [40] "MedianIncome" "graffiti_nn10" "streetdefects_nn20"
## [43] "CensusTractMean" "PropAreaMean" "SPbyCTM"
## [46] "PAbyCTM" "Neighborhood" "geometry"
## [1] "Columns in TEST:"
## [1] "ParcelID" "Address" "PropClassC"
## [4] "SaleDate" "LandUse" "ZoningCode"
## [7] "X" "Y" "long"
## [10] "lat" "SalePrice" "ConstType"
## [13] "LotArea" "PropArea" "BuiltYear"
## [16] "Stories" "Units" "Rooms"
## [19] "Beds" "Baths" "SaleYr"
## [22] "holdOut" "PricePerSqFt" "PricePerRoom"
## [25] "Downtown" "USF" "Presidio"
## [28] "FerryTerminals" "OceanBeach" "badcrime_nn5"
## [31] "bookedcrime_nn40" "bookedcrime_nn10" "nzxmas_nn30"
## [34] "smag_nn20" "cherryp_nn40" "CensusTractArea"
## [37] "CensusTract" "TotalPop" "OwnerOccupied"
## [40] "MedianIncome" "graffiti_nn10" "streetdefects_nn20"
## [43] "CensusTractMean" "PropAreaMean" "SPbyCTM"
## [46] "PAbyCTM" "Neighborhood" "geometry"
## [1] 358567.8
## [1] "36.04 %"
## Linear Regression
##
## 10744 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 10637, 10636, 10638, 10636, 10637, 10636, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 508966 0.4664593 358200.8
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
## RMSE Rsquared MAE Resample
## 1 597718.5 0.5129947 400185.6 Fold001
## 2 479586.4 0.5276294 356532.2 Fold002
## 3 529729.5 0.4821907 378535.4 Fold003
## 4 606055.0 0.4085681 401637.7 Fold004
## 5 484053.3 0.4737479 338838.4 Fold005
## 6 443232.9 0.5214877 320547.5 Fold006
## 7 517003.6 0.4952176 341656.4 Fold007
## 8 475326.2 0.5024867 319383.8 Fold008
## 9 459137.5 0.4662155 337992.3 Fold009
## 10 536572.4 0.4171872 388396.8 Fold010
## [1] 34249.22
## [1] 1149641
##Results ### In-Sample (Training) Results -Provide a polished table of your in-sample (training set) model results.
-Provide a polished table of R^2, mean absolute error and MAPE for the test set. Check out the “kable” function for markdown to create tables.
-Provide the results of your cross-validation tests on the training set. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model overfit?
-Plot predicted prices as a function of observed prices
ggplot(trainTable, aes(SalePrice.Predict, SalePrice)) +
geom_point() +
stat_smooth(data=trainTable, aes(SalePrice, SalePrice),
method = "lm", se = FALSE, size = 1, colour="#FA7800") +
stat_smooth(data=trainTable, aes(SalePrice, SalePrice.Predict),
method = "lm", se = FALSE, size = 1, colour="#25CB10") +
labs(title="Predicted sale price as a function of\nobserved price",
subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
theme(plot.title = element_text(size = 18,colour = "black"))
-Provide a map of your residuals for your test set. Include a Moran’s I test.
grid.arrange(
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = trainTable2, aes(colour = q5(SalePrice)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(trainTable,"SalePrice"),
name="Quintile\nBreaks") +
labs(title="Test set sale prices") +
mapTheme(),
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = trainTable2, aes(colour = q5(SalePrice.Error)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(trainTable,"SalePrice.Error"),
name="Quintile\nBreaks") +
labs(title="Sale price errors on the test set") +
mapTheme(),
ncol = 2)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count",
caption="Public Policy Analytics, Figure 6.8") +
plotTheme()
-Provide a map of your predicted values for the entire dataset
ggplot() +
geom_sf(data = baseMap, fill = "grey40") +
geom_sf(data = trainTable3, aes(colour = q5(SalePrice.Predict)), show.legend = "point", size = 1) +
scale_colour_manual(values = palette5,
labels=qBr(trainTable,"SalePrice.Predict"),
name="Quintile\nBreaks") +
labs(title="Entire set predicted sale prices") +
mapTheme()
-Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood.
-Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood.
-Using tidycensus, split your city in to two groups (perhaps by race or income) and test your model’s generalizability.
acsData2 <-
get_acs(geography = "tract", variables = c("B01001_001E","B02008_001","B06011_001"),
year = 2017, state=06, county=075, geometry=T) %>%
dplyr::select(variable, estimate, GEOID) %>%
spread(variable, estimate) %>%
rename(TotalPop = B01001_001,
NumberWhites = B02008_001,
MedianIncome = B06011_001) %>%
mutate(percentWhite = NumberWhites / TotalPop,
raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedianIncome > 32322, "High Income", "Low Income"))
acsData2 <- acsData2[!(acsData2$GEOID=="06075017902"),]
grid.arrange(
ggplot() +
geom_sf(data = na.omit(acsData2), aes(fill = raceContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Race Context") +
labs(title = "Race Context") +
mapTheme() + theme(legend.position="bottom"),
ggplot() +
geom_sf(data = na.omit(acsData2), aes(fill = incomeContext)) +
scale_fill_manual(values = c("#25CB10", "#FA7800"),
name="Income Context") +
labs(title = "Income Context") +
mapTheme() + theme(legend.position="bottom"), ncol = 2)
Is your model generalizable?