nycflights13 and
simple linear modelsnycflights13 and simple linear modelsIn this instruction you will work with the flights table
from the nycflights13 package. The goals are:
air_time ~ distancelog(air_time) ~ log(distance)Install and load the package (if needed):
The package contains several related tables. We use
flights:
Important columns for this lab:
air_time - time spent in the air (minutes),distance - distance (miles),dep_delay, arr_delay - delays
(minutes),month, day, carrier,
origin, dest - useful grouping variables.flightsStart with quick summaries for selected numeric columns:
Compute mean, median, and standard deviation (ignoring missing values):
vars <- c("air_time", "distance", "dep_delay", "arr_delay")
means <- sapply(flights[, vars], mean, na.rm = TRUE)
medians <- sapply(flights[, vars], median, na.rm = TRUE)
sds <- sapply(flights[, vars], sd, na.rm = TRUE)
means
medians
sdsSome simple grouped summaries:
# Average departure delay by month
tapply(flights$dep_delay, flights$month, mean, na.rm = TRUE)
# Number of flights by origin airport
table(flights$origin)
# Average air_time by origin
tapply(flights$air_time, flights$origin, mean, na.rm = TRUE)Task 2.1
distance and
air_time.dep_delay and comment
briefly on skewness.Task 2.2
origin airport has the longest average
air_time.The full flights table is very large, so we create a
smaller clean sample.
Set a seed for reproducibility and sample rows:
set.seed(123)
n_small <- 3000
idx <- sample(seq_len(nrow(fl_clean)), size = n_small)
fl_small <- fl_clean[idx, ]
dim(fl_small)
head(fl_small)Note: for large tables, you can use sample.int() for a
small performance improvement:
Optional: focus only on selected months or one origin airport:
fl_small_jja <- fl_small[fl_small$month %in% c(6, 7, 8), ] # summer flights
fl_small_ewr <- fl_small[fl_small$origin == "EWR", ] # one airportIf you use the optional subsets, adjust later plots/models accordingly.
distance vs
air_timeorigin (factor variable)cols <- as.numeric(factor(fl_small$origin))
plot(fl_small$distance, fl_small$air_time,
pch = 19, cex = 0.5, col = cols,
xlab = "Distance (miles)", ylab = "Air time (minutes)",
main = "air_time vs distance by origin")
legend("topleft", legend = levels(factor(fl_small$origin)),
col = seq_along(levels(factor(fl_small$origin))),
pch = 19, bty = "n")Task 4.1
We will use three continuous variables:
x = distance,y = air_time,z = dep_delay.scatterplot3dInstall and load package:
Plot:
rglinstall.packages("rgl")
library(rgl)
cols <- as.numeric(factor(fl_small$origin))
plot3d(
x = fl_small$distance,
y = fl_small$air_time,
z = fl_small$dep_delay,
col = cols,
size = 3,
xlab = "distance",
ylab = "air_time",
zlab = "dep_delay"
)
legend3d("topright",
legend = levels(factor(fl_small$origin)),
pch = 16,
col = seq_along(levels(factor(fl_small$origin))),
cex = 1)Task 5.1
scatterplot3d.rgl.air_time vs
distanceNow fit and compare two models on the same cleaned data.
For fairness, we ensure positive values and remove missing values:
model_df <- fl_small[
complete.cases(fl_small[, c("air_time", "distance")]) &
fl_small$air_time > 0 &
fl_small$distance > 0,
c("air_time", "distance", "origin")
]
dim(model_df)Plot with fitted line:
Log-log diagnostic plot:
Compare basic indicators:
Residual checks:
par(mfrow = c(1, 2))
plot(fitted(m_lin), resid(m_lin),
pch = 19, cex = 0.5, col = "gray40",
xlab = "Fitted", ylab = "Residuals",
main = "Residuals: Model A")
abline(h = 0, lty = 2)
plot(fitted(m_log), resid(m_log),
pch = 19, cex = 0.5, col = "gray40",
xlab = "Fitted", ylab = "Residuals",
main = "Residuals: Model B")
abline(h = 0, lty = 2)
par(mfrow = c(1, 1))Task 7.1
m_lin and m_log.R-squared values and compare them.Task 7.2
Task 7.3 (optional extension)
Add origin as a factor predictor and check whether it
improves the fit:
In this lab you:
nycflights13,The same workflow (cleaning -> subsampling -> visualization -> modeling) is common in practical data analysis.