Linear Regression for Urban Data

This project applies linear regression analysis to explore patterns in urban and socioeconomic data. The goal was to move beyond raw observation and use statistical models to identify relationships between variables, test assumptions, and generate interpretable insights.

Objective

The focus was to:

Build simple and multiple regression models.
Interpret coefficients to understand how predictors influence outcomes.
Evaluate model fit using R Squared, adjusted R squared, and residual diagnostics.
Detect issues such as heteroskedasticity, multicollinearity, and outliers.

Methods

Data Preparation → Cleaned and structured raw datasets into usable formats.

Modeling → Fitted Ordinary Least Squares (OLS) regression models.

Diagnostics → Used residual plots, scatterplots, and correlation checks to validate assumptions.

Comparisons → Tested both simple and multiple regression models to see how adding predictors affected explanatory power.

1. Load Packages & Data

#install.packages(c("tidyverse", "broom", "ggplot2", "caret", "car", "MASS"))
library(tidyverse)
library(broom)
library(ggplot2)
library(caret)
library(car)
library(MASS)

Load New Dataset

I have used the built-in Boston dataset from the MASS package, which contains housing data from different neighborhoods in Boston.

data("Boston")
boston_data <- as_tibble(Boston)
head(boston_data)

2. Exploratory Data Analysis

2.1 Summary Statistics

summary(boston_data)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Visualizations

Scatterplots with fitted lines → Showed linear trends between variables.

Residual plots → Checked whether errors were randomly distributed.

Coefficient tables → Summarized how strongly predictors influenced the outcome.

2.2 Scatter Plots for Relationship Exploration

ggplot(boston_data, aes(x = rm, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Relationship Between Number of Rooms and Median Home Value")

3. Fit a Linear Regression Model

The response variable is medv (median home value), and predictors include rm (average rooms per house), lstat (percentage of lower-income population), and crim (crime rate per capita).

model <- lm(medv ~ rm + lstat + crim, data = boston_data)
summary(model)

## 
## Call:
## lm(formula = medv ~ rm + lstat + crim, data = boston_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.925  -3.566  -1.157   1.906  29.024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.56225    3.16602  -0.809  0.41873    
## rm           5.21695    0.44203  11.802  < 2e-16 ***
## lstat       -0.57849    0.04767 -12.135  < 2e-16 ***
## crim        -0.10294    0.03202  -3.215  0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.49 on 502 degrees of freedom
## Multiple R-squared:  0.6459, Adjusted R-squared:  0.6437 
## F-statistic: 305.2 on 3 and 502 DF,  p-value: < 2.2e-16

Interpretation Questions:

What does the coefficient for rm suggest about the impact of more rooms on home prices?
Does lstat negatively or positively affect home value? Is this expected?
How well does the model fit the data (look at R² and Adjusted R²)?

4. Checking OLS Assumptions

4.1 Residual Diagnostics

par(mfrow = c(2, 2))
plot(model)

- Residuals vs Fitted Plot: If a pattern exists, non-linearity may be present. - Q-Q Plot: Checks if residuals are normally distributed. - Scale-Location Plot: Detects heteroscedasticity. - Residuals vs Leverage: Identifies influential observations.

4.2 Multicollinearity Check

vif(model)

##       rm    lstat     crim 
## 1.616468 1.941883 1.271372

VIF values above 5 indicate high multicollinearity.

5. Model Refinement: Log Transformations

If assumptions are violated, log-transforming variables can improve the model.

boston_data <- boston_data %>% mutate(
  log_medv = log(medv),
  log_lstat = log(lstat + 1)
)

log_model <- lm(log_medv ~ rm + log_lstat + crim, data = boston_data)
summary(log_model)

## 
## Call:
## lm(formula = log_medv ~ rm + log_lstat + crim, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70866 -0.11815 -0.01845  0.11886  0.89289 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.587817   0.155975  23.003  < 2e-16 ***
## rm           0.101366   0.017589   5.763 1.44e-08 ***
## log_lstat   -0.464013   0.024455 -18.974  < 2e-16 ***
## crim        -0.011523   0.001178  -9.780  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2078 on 502 degrees of freedom
## Multiple R-squared:  0.743,  Adjusted R-squared:  0.7415 
## F-statistic: 483.8 on 3 and 502 DF,  p-value: < 2.2e-16

Compare the R² of both models. Does the transformation improve fit?

6. Model Performance & Cross-Validation

set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ rm + lstat + crim, data = boston_data, method = "lm", trControl = train_control)
cv_model

## Linear Regression 
## 
## 506 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 455, 456, 456, 456, 456, 456, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   5.487973  0.6455425  3.921115
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Key Learnings

Adding predictors improves explanatory power but can also introduce multicollinearity if variables overlap.

Residual analysis is essential to detect heteroskedasticity and violations of linear assumptions.
Outliers can disproportionately influence regression slopes and must be carefully evaluated.
Regression is powerful for pattern detection but should not be mistaken for causal proof.

Reflection

This project demonstrates how regression modeling provides a bridge between descriptive data analysis and predictive insight. By applying statistical rigor, we can uncover patterns that guide urban planning, policy analysis, and decision-making. However, every model is only as strong as its assumptions—highlighting the importance of diagnostics and careful interpretation.