PPA_Mid_Term

Loading and Cleaning Data

library(dplyr)
library(tidycensus)
library(ggplot2)
library(caret)
library(car)
library(nlme)
library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(RColorBrewer)
library(ggthemes) 
library(ggpmisc)  # For regression equation annotation
library(ggpubr)

property_point <- st_read("C:/Users/goyal/Desktop/Spring_2025/PPA_Spring_2025/Mid_term/studentData.geojson")
schools <- st_read("C:/Users/goyal/Desktop/Spring_2025/PPA_Spring_2025/Mid_term/Schools_(polygons).geojson")
parks <- st_read("C:/Users/goyal/Desktop/Spring_2025/PPA_Spring_2025/Mid_term/Park_Boundaries.geojson")

census_api_key("02524912081801068f04f286a086d19e9fc641ca")

acs_variables <- load_variables( year= 2022,dataset = "acs5")
variables <- c(median_income = "B19013_001", 
               median_age= "B01002_001", 
               unemployment = "B23025_005")

census_data <- get_acs( output = "wide",
  geography = "tract",
  county = "Mecklenburg",
  state = "NC",
  variables = variables,
  year = 2022,
  survey = "acs5", 
  geometry= TRUE)

head(census_data)

Feature Enginnering

# Filter property data for Charlotte
property <- property_point %>% 
  filter(city == "CHARLOTTE")

# Convert parks to centroids
parks <- st_centroid(parks)

property <- st_transform(property, crs = 2264)
parks <- st_transform(parks, crs = 2264)
census_data <- st_transform(census_data, crs = 2264)
schools<- st_transform(schools, crs = 2264)

# Calculate distance to the nearest park
property <- property %>%
  mutate(distance_to_nearest_park = st_distance(., parks) %>% apply(1, min))

st_crs(property)
st_crs(census_data)

# Perform a spatial join 
property <- property %>% 
  st_join(census_data %>% select(median_ageE, median_incomeE, unemploymentE), left = TRUE)

school_buffer <- schools %>% 
  st_buffer(dist= 10560) #schools in a 2 mile radius

property <- property %>% 
  st_join(school_buffer %>%  select(Ownership, Enrollment), left = TRUE) ##added type of schools in a two mile walking distance

property <- property %>% 
  mutate(pvt_school = case_when(Ownership == "Private" ~ 1, TRUE ~ 0)) ## creating a dummy variable to check if presence of a private school in the area increases the landvalue

Renaming

glimpse (property)

property <- property %>%
  rename(Bathroom = fullbaths,
         Bedroom = bedrooms,
         Price = price,
         Length = shape_Leng,
         Area = shape_Area,
         Park = distance_to_nearest_park,
         Age = median_ageE,
         Income = median_incomeE,
         School = Ownership,
         Pvt_School = pvt_school, 
         Sale_Year= sale_year)

Scatter Plots

# Remove geometry column (if present), calculate Building Age, and select relevant variables
property_cleaned <- st_drop_geometry(property) %>% 
  dplyr::select(Price, Bedroom, Age, Income, Park, Area, Length) 
  
# Reshape data from wide to long format. This is for mapping purposes below. 
property_long <- gather(property_cleaned, Variable, Value, -Price) 

# scatterplot with linear regression lines for each variable
ggplot(property_long, aes(Value, Price)) +   
  geom_point(size = .5) +                       
  geom_smooth(method = "lm", se = FALSE, colour = "#bf0603") +  
  facet_wrap(~Variable, ncol = 6, scales = "free") + 
  labs(title = "Price as a function of continuous variables") + 
  theme_minimal()

Age- As properties get older, the price tends to decrease slightly, but the relationship is very weak and noisy.

Area- As the area increases, price increases sharply. Area seems to have one of the strongest positive relationships with price.

Bedroom- As the number of bedrooms increases, price increases. However, fewer unique values (since bedrooms are discrete), so the points are stacked, not fully continuous.

Income- Higher neighborhood income is weakly associated with higher prices. But again, the trend is not very strong, lots of overlap and spread.

Length- As length increases, price slightly increases.But spread is large, many properties with large lengths still have low prices.

Park- Proximity to park does not show a strong relationship with price. Very little trend visible, almost horizontal.

Correlation Matrix

# Select only numeric variables and remove rows with missing values
numericVars <- property %>%
  st_drop_geometry() %>%  
  select_if(is.numeric) %>%  
  na.omit()  

# Calculate correlation matrix
correlation_matrix <- cor(numericVars)

# Create correlation plot
ggcorrplot(
  correlation_matrix,  
  p.mat = cor_pmat(numericVars), 
  colors = c("#8d0801", "#f4d58d", "#001427"),  
  type = "lower",  
  insig = "blank"  
) +
labs(title = "Correlation across numeric variables")

The correlation matrix shows that Price has the strongest positive correlations with Area and Heated Area, meaning larger homes tend to be more expensive. Bedrooms, Bathrooms, and Income also show a positive correlation with Price — homes with more rooms and homes in higher-income areas generally sell for more. Year built has a slight positive correlation too, suggesting newer homes may cost more. On the other hand, Age shows a slight negative correlation with Price — older homes tend to be a little cheaper. Other variables like Park proximity and Enrollment show very weak or no meaningful correlation with Price.

Map 1- Home Pricing

##MAP 1- HOME PRICES MAP
property <- property %>%
  mutate(quintiles = ntile(Price, 5))

palette5 <- c("#e8c1c5", "#d499b9", "#9055a2", "#2e294e", "#011638")

# Define custom labels function
custom_labels <- function(x) {
  paste("Quintile", x)
}

ggplot() +
  # Background Census Tracts
  geom_sf(data = census_data, fill = "grey90", color = "white", size = 0.3) +
  
  # Home value points
  geom_sf(data = property, aes(color = factor(quintiles)), 
          size = 1, alpha = 0.6, show.legend = TRUE) +
  
  # Color scale
  scale_color_manual(values = palette5,
                     labels = custom_labels(unique(property$quintiles)),
                     name = "Prices (Quintiles)") +

  labs(
    title = "Home Prices in Charlotte",
    subtitle = "Quintile-based price distribution",
    caption = "Data Source: Given"
  ) +
  
  theme_void(base_size = 14) +
  theme(
    legend.position = "bottom",
    legend.title = element_text(size = 12, face = "bold"),
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 13, color = "grey40"),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  )

Map 2- Park Density

## MAP-2 : Density of Parks in the City
ggplot() +

  stat_density_2d(data = data.frame(st_coordinates(parks)), 
                  aes(x = X, y = Y, fill = after_stat(level)), 
                  geom = "polygon", size = 0.1, bins = 20) +
geom_sf(data = census_data, fill = "transparent", color = "black") +
  scale_fill_gradient(low = "#FFEDA0", high = "#800026", name = "Density", na.value = "grey40") +
  labs(title = "Density Map", subtitle = "Density of Parks") +
  theme_void()

Map 3- Schools

ggplot() +
  geom_sf(data = census_data, fill = "grey90", color = "white", size = 0.3) +  
  geom_sf(data = schools, aes(fill = Enrollment)) +  
  scale_fill_viridis_c(option = "D", name = "Enrollment") +  
  labs(title = "Choropleth Map of School Enrollment") +  
  theme_minimal() +  
  theme(legend.position = "bottom")

Map 4- Bubble Map of Income

## **Map 4- BUBBLE MAP FOR INCOME**

##(had to take help from gpt)

census_data <- st_transform(census_data, crs = 2264)

centroid_census <- census_data %>% 
  st_centroid(census_data)

# Extract coordinates 
coords <- st_coordinates(centroid_census)

# Create a data frame with longitude and latitude
coords_df <- data.frame(longitude = coords[, 1], latitude = coords[, 2])

coords_sf <- coords_df %>% 
  select(latitude, longitude) %>% 
  st_as_sf(coords = c("longitude", "latitude"), crs = 2264, remove= FALSE)

centroid_census <- centroid_census %>% 
  st_join(coords_sf)

ggplot() +
  geom_sf(data = census_data, fill = "#FFEDA0", color = "white", size = 0.3) +
  geom_sf(data = centroid_census, aes(fill = median_incomeE)) +  # Base map with centroid data
  geom_point(data = centroid_census, aes(x = longitude, y = latitude, size = median_incomeE), color = "#800026") +  # Bubble size based on Income
  scale_size_continuous(name = "Median Income", range = c(0.5, 5)) +  # Adjust bubble sizes
  labs(title = "Bubble Map of Income") +
  theme_minimal()

Simple Regression Model

property <- property %>% 
  mutate( log_Park = log1p(Park), log_bathroom= log1p(Bathroom), 
  log_Age = log(Age), 
  log_Area= log(Area)
)

model <- lm(Price ~ Area + log_bathroom + Income + log_Age  + unemploymentE + heatedarea + storyheigh + dateofsale + foundation +numfirepla + log_Park +bldggrade + units , data = property)

summary(model)

## 
## Call:
## lm(formula = Price ~ Area + log_bathroom + Income + log_Age + 
##     unemploymentE + heatedarea + storyheigh + dateofsale + foundation + 
##     numfirepla + log_Park + bldggrade + units, data = property)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2656538   -91738    -2570    78020 18042634 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.834e+06  5.227e+04 -73.349  < 2e-16 ***
## Area                          3.662e+00  1.778e-02 205.970  < 2e-16 ***
## log_bathroom                  8.359e+04  2.998e+03  27.884  < 2e-16 ***
## Income                        1.777e+00  1.401e-02 126.831  < 2e-16 ***
## log_Age                      -2.827e+05  3.777e+03 -74.826  < 2e-16 ***
## unemploymentE                -2.348e+02  5.191e+00 -45.237  < 2e-16 ***
## heatedarea                    1.438e+02  9.210e-01 156.133  < 2e-16 ***
## storyheigh1 STORY             1.313e+05  3.617e+04   3.629 0.000285 ***
## storyheigh1.5 STORY           1.353e+05  3.620e+04   3.737 0.000186 ***
## storyheigh2.0 STORY           6.625e+04  3.615e+04   1.833 0.066872 .  
## storyheigh2.5 STORY           6.182e+04  3.624e+04   1.706 0.087995 .  
## storyheigh3.0 STORY          -4.392e+04  3.716e+04  -1.182 0.237251    
## storyheighBI-LEVEL            2.085e+04  3.686e+04   0.566 0.571587    
## storyheighCAPE COD            9.785e+04  4.630e+04   2.113 0.034579 *  
## storyheighRANCH W/BSMT        3.905e+04  3.643e+04   1.072 0.283769    
## storyheighSPLIT LEVEL         7.061e+04  3.626e+04   1.947 0.051538 .  
## dateofsale                    2.639e+02  1.873e+00 140.946  < 2e-16 ***
## foundationCRAWL SPACE        -3.556e+04  3.510e+03 -10.131  < 2e-16 ***
## foundationPIER-NO FOUND WALL  5.826e+05  4.109e+04  14.181  < 2e-16 ***
## foundationSLAB-ABV GRD        2.211e+05  1.831e+04  12.076  < 2e-16 ***
## foundationSLAB-COM           -8.116e+04  7.363e+04  -1.102 0.270350    
## foundationSLAB-RES           -8.765e+04  3.663e+03 -23.928  < 2e-16 ***
## numfirepla                   -3.942e+03  1.112e+03  -3.546 0.000392 ***
## log_Park                     -4.247e+04  8.060e+02 -52.691  < 2e-16 ***
## bldggradeCUSTOM               1.423e+06  5.630e+03 252.703  < 2e-16 ***
## bldggradeEXCELLENT            6.710e+05  3.369e+03 199.160  < 2e-16 ***
## bldggradeFAIR                -1.525e+04  3.731e+03  -4.088 4.36e-05 ***
## bldggradeGOOD                 8.833e+04  1.433e+03  61.617  < 2e-16 ***
## bldggradeMINIMUM             -1.036e+05  2.104e+04  -4.923 8.55e-07 ***
## bldggradeVERY GOOD            2.555e+05  2.109e+03 121.126  < 2e-16 ***
## units                         2.035e+04  2.708e+03   7.514 5.73e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 265000 on 289176 degrees of freedom
##   (361 observations deleted due to missingness)
## Multiple R-squared:  0.7125, Adjusted R-squared:  0.7125 
## F-statistic: 2.389e+04 on 30 and 289176 DF,  p-value: < 2.2e-16

vif(model)

##                   GVIF Df GVIF^(1/(2*Df))
## Area          1.042391  1        1.020976
## log_bathroom  2.642469  1        1.625567
## Income        1.899243  1        1.378130
## log_Age       1.279919  1        1.131335
## unemploymentE 1.119049  1        1.057851
## heatedarea    4.503684  1        2.122189
## storyheigh    2.349180  9        1.048592
## dateofsale    1.020886  1        1.010389
## foundation    1.533772  5        1.043701
## numfirepla    1.246011  1        1.116249
## log_Park      1.221458  1        1.105196
## bldggrade     3.125932  6        1.099634
## units         1.048771  1        1.024095

Train and Test Data and Machine Learning

# Split dataset into training and test sets using stratified sampling
inTrain <- createDataPartition(
  y = paste(property$log_bathroom, property$Area, 
            property$Income, property$log_Park, 
            property$heatedarea, property$unemploymentE, property$storyheigh, property$log_Age, property$dateofsale, property$foundation, property$numfirepla, property$bldggrade,  property$units),
  p = 0.80, list = FALSE
)

# Subset into training and test sets
property.training <- property[inTrain, ]  
property.test <- property[-inTrain, ]

# Fit a linear regression model using selected predictors
reg.training <- lm(
  Price ~ Area + log_bathroom + Income + log_Age  + unemploymentE + heatedarea + storyheigh + dateofsale + foundation +numfirepla + log_Park +bldggrade + units,
  data = property.training
)

# Check summary and VIF for multicollinearity assessment
summary(reg.training)

## 
## Call:
## lm(formula = Price ~ Area + log_bathroom + Income + log_Age + 
##     unemploymentE + heatedarea + storyheigh + dateofsale + foundation + 
##     numfirepla + log_Park + bldggrade + units, data = property.training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2639896   -91104    -2412    77063 18041883 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -3.805e+06  5.691e+04 -66.866  < 2e-16 ***
## Area                          3.635e+00  1.920e-02 189.345  < 2e-16 ***
## log_bathroom                  8.337e+04  3.263e+03  25.551  < 2e-16 ***
## Income                        1.747e+00  1.526e-02 114.518  < 2e-16 ***
## log_Age                      -2.823e+05  4.121e+03 -68.507  < 2e-16 ***
## unemploymentE                -2.348e+02  5.641e+00 -41.625  < 2e-16 ***
## heatedarea                    1.416e+02  9.986e-01 141.756  < 2e-16 ***
## storyheigh1 STORY             1.289e+05  3.946e+04   3.268 0.001085 ** 
## storyheigh1.5 STORY           1.330e+05  3.949e+04   3.368 0.000758 ***
## storyheigh2.0 STORY           6.481e+04  3.944e+04   1.643 0.100315    
## storyheigh2.5 STORY           5.948e+04  3.953e+04   1.505 0.132439    
## storyheigh3.0 STORY          -4.513e+04  4.055e+04  -1.113 0.265764    
## storyheighBI-LEVEL            2.085e+04  4.021e+04   0.519 0.604082    
## storyheighCAPE COD            9.658e+04  5.051e+04   1.912 0.055890 .  
## storyheighRANCH W/BSMT        3.749e+04  3.974e+04   0.943 0.345484    
## storyheighSPLIT LEVEL         7.021e+04  3.956e+04   1.775 0.075935 .  
## dateofsale                    2.625e+02  2.033e+00 129.085  < 2e-16 ***
## foundationCRAWL SPACE        -3.403e+04  3.820e+03  -8.906  < 2e-16 ***
## foundationPIER-NO FOUND WALL  5.736e+05  4.482e+04  12.798  < 2e-16 ***
## foundationSLAB-ABV GRD        2.155e+05  1.991e+04  10.828  < 2e-16 ***
## foundationSLAB-COM           -7.848e+04  7.971e+04  -0.984 0.324890    
## foundationSLAB-RES           -8.647e+04  3.982e+03 -21.716  < 2e-16 ***
## numfirepla                   -3.816e+03  1.207e+03  -3.163 0.001561 ** 
## log_Park                     -4.229e+04  8.752e+02 -48.327  < 2e-16 ***
## bldggradeCUSTOM               1.433e+06  6.146e+03 233.219  < 2e-16 ***
## bldggradeEXCELLENT            6.792e+05  3.668e+03 185.172  < 2e-16 ***
## bldggradeFAIR                -1.704e+04  4.066e+03  -4.191 2.77e-05 ***
## bldggradeGOOD                 8.957e+04  1.555e+03  57.609  < 2e-16 ***
## bldggradeMINIMUM             -1.055e+05  2.283e+04  -4.621 3.82e-06 ***
## bldggradeVERY GOOD            2.593e+05  2.287e+03 113.350  < 2e-16 ***
## units                         2.221e+04  2.932e+03   7.576 3.59e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 263900 on 243709 degrees of freedom
##   (320 observations deleted due to missingness)
## Multiple R-squared:  0.7106, Adjusted R-squared:  0.7106 
## F-statistic: 1.995e+04 on 30 and 243709 DF,  p-value: < 2.2e-16

vif_values <- vif(reg.training)
print(vif_values)

##                   GVIF Df GVIF^(1/(2*Df))
## Area          1.042738  1        1.021145
## log_bathroom  2.638029  1        1.624201
## Income        1.897922  1        1.377651
## log_Age       1.283847  1        1.133070
## unemploymentE 1.117350  1        1.057048
## heatedarea    4.475869  1        2.115625
## storyheigh    2.342408  9        1.048424
## dateofsale    1.021850  1        1.010866
## foundation    1.539967  5        1.044122
## numfirepla    1.244700  1        1.115661
## log_Park      1.221337  1        1.105141
## bldggrade     3.078518  6        1.098235
## units         1.049706  1        1.024552

# Make predictions on the test set
property.test <- property.test %>%  
  mutate(
    Regression = "Baseline Regression",
    Price.Predict = predict(reg.training, property.test),
    Price.Error = Price.Predict - Price,
    Price.AbsError = abs(Price.Predict - Price),
    Price.APE = (abs(Price.Predict - Price)) / Price
  )

This model explains about 71% of the variation in property prices in Charlotte (R-squared = 0.71), meaning most of the price differences can be predicted using the selected factors. However, there’s still some error, as seen in the residuals (differences between predicted and actual prices), which range widely. Some variables, like heated area and building grade, have a strong impact on price, while others, like certain foundation types, seem less significant. Overall, the model captures key trends in housing prices in Charlotte.

Model Diagnostics

Residual Diagnostics

par(mfrow = c(2, 2))
plot(reg.training)

library(car)

## Warning: package 'car' was built under R version 4.4.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.2

vif_values <- vif(reg.training)
print(vif_values)

##                   GVIF Df GVIF^(1/(2*Df))
## Area          1.042738  1        1.021145
## log_bathroom  2.638029  1        1.624201
## Income        1.897922  1        1.377651
## log_Age       1.283847  1        1.133070
## unemploymentE 1.117350  1        1.057048
## heatedarea    4.475869  1        2.115625
## storyheigh    2.342408  9        1.048424
## dateofsale    1.021850  1        1.010866
## foundation    1.539967  5        1.044122
## numfirepla    1.244700  1        1.115661
## log_Park      1.221337  1        1.105141
## bldggrade     3.078518  6        1.098235
## units         1.049706  1        1.024552

Residuals vs Fitted: In the Residuals vs Fitted plot, most of the points are clustered near zero for smaller fitted values but spread out slightly as fitted values grow, showing a mild fan shape. This suggests that the model might have heteroskedasticity — the variance of the residuals is not constant. There are also a few clear outliers, meaning some points are behaving differently from the rest.

Normal Q-Q Plot: The Normal Q-Q plot shows that the residuals roughly follow the normal line at the center but deviate strongly at the ends, especially in the upper tail. This means the residuals are not perfectly normally distributed and there are some extreme values (outliers) pulling the distribution upward.

Scale-Location Plot: The Scale-Location plot shows that the spread of residuals slightly increases as fitted values increase. Ideally, the points should form a flat line, but here, there’s a slight upward trend. This again points to heteroskedasticity, meaning the error variance grows with the size of the predictions.

Residuals vs Leverage Plot: The Residuals vs Leverage plot shows that while most points have low leverage and low standardized residuals, a few points stand out far away from the others and are close to the Cook’s distance lines. This indicates the presence of influential points that could heavily affect the model’s results.

MAE

mean(property.test$Price.AbsError, na.rm = T)

## [1] 134927.4

The average error in the property price predictions is about $134,927, which means, on average, the model’s predictions are off by this amount.

Cross Validation

fitControl <- trainControl(method = "cv", number = 100)
set.seed(825)

# Train a linear regression model using cross-validation
reg.cv <- 
  train(Price ~ ., 
        data = st_drop_geometry(property) %>% 
               dplyr::select(Price, Area, log_bathroom, Income,log_Age, unemploymentE, heatedarea, storyheigh, dateofsale, foundation, numfirepla, log_Park, bldggrade, units), 
        method = "lm",  
        trControl = fitControl,  
        na.action = na.pass)  

reg.cv

## Linear Regression 
## 
## 289568 samples
##     13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 286672, 286672, 286671, 286673, 286673, 286673, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   254742.3  0.7300342  133548.4
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

reg.cv$resample[1:5,]

##       RMSE  Rsquared      MAE Resample
## 1 239989.3 0.7561913 134299.2  Fold001
## 2 395036.3 0.5021389 137144.1  Fold002
## 3 226717.7 0.7501263 130179.8  Fold003
## 4 234913.8 0.7487178 134088.6  Fold004
## 5 234035.8 0.8294117 129848.0  Fold005

This linear regression model estimates property values in Charlotte, NC, using 13 factors and was tested with 100-fold cross-validation to check accuracy. It explains about 73% of the variation in home prices, but the predictions have an average error of $133,548. The RMSE of $254,182 suggests there’s still some variability, likely due to differences in neighborhoods, home features, or market trends. While the model does a solid job capturing overall patterns, there’s room for improvement in refining predictions.

Spatial Lags

# Extract the coordinates from the property dataset
coords_pr <- st_coordinates(property)

# Compute spatial weights using k-nearest neighbors (k = 8)
nb <- knearneigh(as.matrix(coords_pr), k = 8)
listw <- nb2listw(knn2nb(nb), style = "W")

# Compute the spatially lagged price variable
property$lagged_price <- lag.listw(listw, property$Price)

# Add spatial lag variable to the regression model
model_spatial <- lm(Price ~ lagged_price, Income, unemploymentE, data=property)

summary(model_spatial)

## 
## Call:
## lm(formula = Price ~ lagged_price, data = property, subset = Income, 
##     weights = unemploymentE)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -607780   -4801   -3053    -894 1485403 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6.468e+02  3.834e+01    16.87   <2e-16 ***
## lagged_price 9.992e-01  8.586e-05 11637.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 126900 on 289352 degrees of freedom
##   (174 observations deleted due to missingness)
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9979 
## F-statistic: 1.354e+08 on 1 and 289352 DF,  p-value: < 2.2e-16

The model shows signs of overfitting, as the Rsquared value is very high (0.9979), meaning it fits the data almost perfectly. The lagged price coefficient is nearly 1, and the t-value is extremely large, suggesting the model is too closely tied to the data. The residuals also vary widely, which indicates the model may not perform well on new data.

Residual Analysis

# Compute predicted prices
property.test$Price.Predict <- predict(reg.training, newdata = property.test)

# Plot predicted vs. observed prices
plot(property.test$Price, property.test$Price.Predict, 
     main = "Predicted vs. Observed Prices", 
     xlab = "Observed Prices", 
     ylab = "Predicted Prices",
     col = "blue", pch = 19) + 
  abline(0,1, col = "red")

## integer(0)

# Residuals
property.test$residuals <- property.test$Price - property.test$Price.Predict

# Map residuals by neighborhood
ggplot() +
  geom_sf(data = census_data, fill = "grey90", color = "white", size = 0.3) +
  geom_sf(data = property.test, aes(color = residuals), size = 1.2) +
  scale_color_gradient2(low = "blue", mid = "#ffcdb2", high = "red", midpoint = 0, name = "Residuals") +
  labs(title = "Map of Regression Residuals") +
  theme_minimal()

Spatial Autocorrelation

# Moran's I for spatial autocorrelation of residuals
library(spdep)

# Remove missing or non-finite values first
property.test <- property.test %>%
  filter(!is.na(Price.Error)) %>%  # Remove rows with NA values
  mutate(Price.Error = ifelse(!is.finite(Price.Error), 0, Price.Error))  # Replace non-finite values with 0

# Extract coordinates **after** filtering
coords.test <- st_coordinates(property.test)

# Recompute the spatial neighbors and weights on the cleaned dataset
neighborList.test <- knn2nb(knearneigh(coords.test, 5))

## Warning in knearneigh(coords.test, 5): knearneigh: identical points found

## Warning in knearneigh(coords.test, 5): knearneigh: kd_tree not available for
## identical points

spatialWeights.test <- nb2listw(neighborList.test, style="W")

# Now apply the spatial lag
property.test <- property.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, Price.Error))

# Perform Moran's I test
moranTest <- moran.mc(property.test$Price.Error, spatialWeights.test, nsim = 999)

# Print Moran's I statistic
moranTest$statistic

## statistic 
## 0.5379529

A Moran’s I value of 0.54 indicates moderate positive spatial autocorrelation in property price prediction errors in Charlotte. This means properties with similar characteristics tend to have similar pricing errors, suggesting that neighboring areas may have consistent over- or underpricing.

Discussing the Best Model

# Plot predicted vs. observed prices (again for final evaluation)
ggplot(property.test, aes(x = Price, y = Price.Predict)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Predicted vs. Observed Prices", x = "Observed Prices", y = "Predicted Prices") +
  theme_minimal()

# Compute Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R-squared
mean(property.test$Price.AbsError, na.rm = TRUE) #MAE

## [1] 134927.4

sqrt(mean(property.test$Price.Error^2, na.rm = TRUE)) #RMSE

## [1] 270812.1

summary(reg.training)$r.squared #R Squared

## [1] 0.7106471

The model performs best for properties with lower observed prices, where most of the blue points are densely clustered close to the red dashed line, showing that predicted prices are quite close to the true values. However, the model performs worse for properties with higher observed prices — the spread of points becomes wider, and predictions tend to underestimate the actual prices. At the extreme high end (very expensive properties), the predictions are much lower than the observed prices, indicating that the model struggles to accurately capture very high-value outliers.

Potential for Improvements- To improve the model, it would be important to better capture the behavior of high-priced properties, where predictions are currently underestimating observed prices. This could involve adding interaction terms, using nonlinear models to reduce the impact of extreme values.

Best Model- This model is doing a good job of predicting property prices in Charlotte. On average, the predictions are off by about $134,927, which isn’t too far off. It also does a decent job of handling larger errors, with the largest mistakes being around $270,812. The model explains about 71% of the factors that affect property prices, meaning it captures most of the important trends. Overall, it’s a solid model for predicting property values in the area.

PPA_Mid_Term

Anaushka Goyal

2025-03-20

Residual Diagnostics