This project applies linear regression analysis to explore patterns in urban and socioeconomic data. The goal was to move beyond raw observation and use statistical models to identify relationships between variables, test assumptions, and generate interpretable insights.
The focus was to:
Build simple and multiple regression models.
Interpret coefficients to understand how predictors influence outcomes.
Evaluate model fit using R Squared, adjusted R squared, and residual diagnostics.
Detect issues such as heteroskedasticity, multicollinearity, and outliers.
Data Preparation → Cleaned and structured raw datasets into usable formats.
Modeling → Fitted Ordinary Least Squares (OLS) regression models.
Diagnostics → Used residual plots, scatterplots, and correlation checks to validate assumptions.
Comparisons → Tested both simple and multiple regression models to see how adding predictors affected explanatory power.
#install.packages(c("tidyverse", "broom", "ggplot2", "caret", "car", "MASS"))
library(tidyverse)
library(broom)
library(ggplot2)
library(caret)
library(car)
library(MASS)
I have used the built-in Boston dataset from the
MASS package, which contains housing data from different
neighborhoods in Boston.
data("Boston")
boston_data <- as_tibble(Boston)
head(boston_data)
summary(boston_data)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Scatterplots with fitted lines → Showed linear trends between variables.
Residual plots → Checked whether errors were randomly distributed.
Coefficient tables → Summarized how strongly predictors influenced the outcome.
ggplot(boston_data, aes(x = rm, y = medv)) +
geom_point() +
geom_smooth(method = "lm", color = "red") +
labs(title = "Relationship Between Number of Rooms and Median Home Value")
The response variable is medv (median home value), and
predictors include rm (average rooms per house),
lstat (percentage of lower-income population), and
crim (crime rate per capita).
model <- lm(medv ~ rm + lstat + crim, data = boston_data)
summary(model)
##
## Call:
## lm(formula = medv ~ rm + lstat + crim, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.925 -3.566 -1.157 1.906 29.024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.56225 3.16602 -0.809 0.41873
## rm 5.21695 0.44203 11.802 < 2e-16 ***
## lstat -0.57849 0.04767 -12.135 < 2e-16 ***
## crim -0.10294 0.03202 -3.215 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.49 on 502 degrees of freedom
## Multiple R-squared: 0.6459, Adjusted R-squared: 0.6437
## F-statistic: 305.2 on 3 and 502 DF, p-value: < 2.2e-16
rm suggest about the
impact of more rooms on home prices?lstat negatively or positively affect home value?
Is this expected?par(mfrow = c(2, 2))
plot(model)
- Residuals vs Fitted Plot: If a pattern exists,
non-linearity may be present. - Q-Q Plot: Checks if
residuals are normally distributed. - Scale-Location
Plot: Detects heteroscedasticity. - Residuals vs
Leverage: Identifies influential observations.
vif(model)
## rm lstat crim
## 1.616468 1.941883 1.271372
If assumptions are violated, log-transforming variables can improve the model.
boston_data <- boston_data %>% mutate(
log_medv = log(medv),
log_lstat = log(lstat + 1)
)
log_model <- lm(log_medv ~ rm + log_lstat + crim, data = boston_data)
summary(log_model)
##
## Call:
## lm(formula = log_medv ~ rm + log_lstat + crim, data = boston_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.70866 -0.11815 -0.01845 0.11886 0.89289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.587817 0.155975 23.003 < 2e-16 ***
## rm 0.101366 0.017589 5.763 1.44e-08 ***
## log_lstat -0.464013 0.024455 -18.974 < 2e-16 ***
## crim -0.011523 0.001178 -9.780 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2078 on 502 degrees of freedom
## Multiple R-squared: 0.743, Adjusted R-squared: 0.7415
## F-statistic: 483.8 on 3 and 502 DF, p-value: < 2.2e-16
set.seed(123)
train_control <- trainControl(method = "cv", number = 10)
cv_model <- train(medv ~ rm + lstat + crim, data = boston_data, method = "lm", trControl = train_control)
cv_model
## Linear Regression
##
## 506 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 455, 456, 456, 456, 456, 456, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 5.487973 0.6455425 3.921115
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Adding predictors improves explanatory power but can also introduce multicollinearity if variables overlap.
Residual analysis is essential to detect heteroskedasticity and violations of linear assumptions.
Outliers can disproportionately influence regression slopes and must be carefully evaluated.
Regression is powerful for pattern detection but should not be mistaken for causal proof.
This project demonstrates how regression modeling provides a bridge between descriptive data analysis and predictive insight. By applying statistical rigor, we can uncover patterns that guide urban planning, policy analysis, and decision-making. However, every model is only as strong as its assumptions—highlighting the importance of diagnostics and careful interpretation.