Replacing averages with random intercepts from a basic multilevel model.
I will use the tidyverse, packages from the nflverse, and lme4 for the multilevel model.
Yards per play is a basic measure of snap-to-snap team quality. In Pittsburgh, analyst Matt Williamson frequently cites yards/play in his, noting that former Steelers GM Kevin Colbert found great value in the metric. This makes sense because there are many plays in an NFL game: a true per snap advantage is likely to build to a larger advantage over the course of a game. The yard/play idea was also likely inspired by The Success Equation: Untangling Skill and Luck in Business, Sports, and Investing by Michael J. Mauboussin (Chapter 4), which I read earlier in 2022.
Basic yards/play averages do not fully capture a true advantage. Some teams play a much weaker slate of opponents. There is sampling error or luck, good and bad, in different measures. This post hopes to improve on the simple arithmetic mean via a very basic multilevel model.
While penalty assessment is a bit arbitrary in the NFL, penalty yardage is earned by poor technique and related to the tactics employed. Therefore, net penalty yardage is considered just the same as yards gained or lost conventionally, even if the down is replayed. This is a bit like preferring plate appearances to at bats in baseball (since at bats exclude walks, scorer assessed sacrifices, etc.).
Since they are on the field for most of the snaps in a game, analysis is confined to the offensive and defensive units. Plays spikes and kneels, where the purpose of the down is not gain yardage, are discarded.
pbp_df <- nflfastR::load_pbp(2021) %>%
filter(season_type == "REG") %>%
filter(play_type_nfl %in% c("GAME_START",
"KICK_OFF",
"PUNT",
"TIMEOUT",
"FIELD_GOAL",
"XP_KICK",
"END_QUARTER",
"END_GAME",
"PAT2",
"FREE_KICK",
"COMMENT") == F) %>%
filter(is.na(play_type_nfl) == F) %>%
filter(play_type %in% c("qb_kneel",
"qb_spike") == F) %>%
filter(grepl("(Punt formation)", desc) == F) %>%
filter(grepl(", offsetting.", desc) == F) %>%
penalty_yards_gained()
“… multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach.” - McElreath, Statistical Rethinking, pg. 15
Multilevel models have a number of advantages over simpler regression models. Random intercepts models are the simplest multilevel models and are generally used as a baseline for which to compare more complicated multilevel models against. Here, a random intercepts model is proposed as an improvement on averaging.
Importantly, multilevel models make more realistic assumptions about the independence of observations in a data set. Context rightly shapes the meaning we derive from statistics. For example, gaining 5 yards/play against the NFL’s best defense reflects more positively on an offense than gaining 5 yards/play against the league’s worst defense. For some games, weather conditions severely depress offense (think the Bills/Patriots game in the 2021 regular season). Averages and simpler regression models, however, do not recognize any such context; all observations are assumed to be independent.
Clustering is expected in yards/play data, along a number of different dimensions. Consider the following examples.
Multilevel models acknowledge and take advantage of these dependencies in the structure of the data.
The observations in one cluster of our data are also informative for other clusters. This may be less intuitive, but consider the following.
Sharing of information between the results of all plays across the league, regardless of cluster, and the individual clusters in the data provides regularized estimates. The regularization (or shrinkage) performed by multilevel models tempers the conclusions drawn in each cluster of the data by considering all clusters, thus minimizing the adverse affects of overfitting (roughly, taking the noise in the sample too seriously). Regularization can thus improve out-of-sample prediction.
Complete pooling -> predict grand mean for all teams -> maximum underfitting
No pooling -> predict group mean for each team -> maximum overfitting
Partial pooling -> predict random intercept for each team -> a principled balance.
Readers of the Football Outsiders may be familiar with the Plexiglass Principle, which roughly states that a big improvement (or decline) in performance will on average be followed by a relapse (or bounce back). The Plexiglass Principle was coined by the legendary baseball sabermetrician Bill James, who later renamed it the Whirlpool Principle: “All teams are drawn forcefully toward the center. Most of the teams which had winning records in 1982 will decline in 1983; most of the teams which had losing records in 1982 will improve in 1983.” 1
Traditional statisticians call this the Paradox of Regression to the Mean. Exceptionally good performance is likely the result of a combination of high ability and good luck; subsequent performance is likely to be better than average but not as good as the previous exceptional outcome. 2
We will use Empirical Bayes estimates, or shrinkage estimates, to regularize NFL offensive and defensive yards/play averages.3
How much shrinkage is appropriate? The multilevel model will determine the strength of the whirlpool.
We begin by building a league-wide normal distribution centered on the mean yards gained for all observations in the data set. The standard deviation of this distribution is the standard error of the league-wide mean, assuming n = 1125 (a nominal number of plays for a team’s offensive or defensive unit that is representative for the 2021 season) rather than the total number of plays in the regular season. We’ll plot this curve as a league-wide nominal distribution, since its parameters are estimated considering all plays in the 2021 NFL regular season.
Next, we plot the average yards/play calculated for each offense and defense under the league-wide nominal distribution. This allows us to compare the distribution of offensive unit averages and the distribution of the defensive unit averages against all plays.
# Estimate NFL Distribution using mean for all plays...
avg_yards_per_play <- pbp_df %>%
summarize(mean = mean(yards_gained)) %>%
pull(mean)
# and standard error of the mean, which assumes
# 1125 plays in a season for a unit.
sd_yards_per_play <- pbp_df %>%
summarize(sd = sd(yards_gained)) %>%
pull(sd)
se_unit <- sd_yards_per_play/sqrt(1125)
# Build data frame to plot estimated distribution
ypp <- seq(0, 10, by = 0.01)
d_ypp <- dnorm(ypp, mean = avg_yards_per_play, sd = se_unit)
df_ypp <- data.frame(ypp, d_ypp)
rm(ypp, d_ypp)
# Calculate (Raw, Unadjusted) Unit Averages
off_avg_df <- pbp_df %>%
group_by(posteam) %>%
summarize(off_yards_per_play = mean(yards_gained),
off_n = n()) %>%
rename(team = posteam)
def_avg_df <- pbp_df %>%
group_by(defteam) %>%
summarize(def_yards_per_play = mean(yards_gained),
def_n = n()) %>%
rename(team = defteam)
# Plot unit averages against league-wide estimated distribution
# Stack of plots will show much greater spread among offenses
# compared to defenses.
p_top <- df_ypp %>%
ggplot(aes(x = ypp,
y = d_ypp)) +
geom_vline(xintercept = c(avg_yards_per_play,
avg_yards_per_play - 2*se_unit,
avg_yards_per_play + 2*se_unit),
color = "dark gray",
linetype = "dashed") +
geom_line() +
geom_point(data = off_avg_df,
aes(x = off_yards_per_play,
y = 0,
color = team)) +
scale_color_nfl() +
theme_light() +
scale_x_continuous(minor_breaks = NULL) +
scale_y_continuous(minor_breaks = NULL) +
coord_cartesian(xlim = c(4.0, 6.25)) +
labs(x = "Offensive Yards Per Play",
y = "Density",
title = "Team Unit Averages Compared to NFL Distribution",
subtitle = "NFL Standard Error Assumes 1125 Plays for Unit")
p_bottom <- df_ypp %>%
ggplot(aes(x = ypp,
y = d_ypp)) +
geom_vline(xintercept = c(avg_yards_per_play,
avg_yards_per_play - 2*se_unit,
avg_yards_per_play + 2*se_unit),
color = "dark gray",
linetype = "dashed") +
geom_line() +
geom_point(data = def_avg_df,
aes(x = def_yards_per_play,
y = 0,
color = team)) +
scale_color_nfl() +
theme_light() +
scale_x_continuous(minor_breaks = NULL) +
scale_y_continuous(minor_breaks = NULL) +
coord_cartesian(xlim = c(4.0, 6.25)) +
labs(x = "Defensive Yards Per Play",
y = "Density")
library(patchwork)
p_units <- p_top / p_bottom
ggsave("nfl_2021_unit_dist.png",
plot = p_units,
height = 5.25,
width = 5,
units = "in",
dpi = "retina")
A few observations:
Let’s look at the actual shrinkage in the Empirical Bayes estimate.
mod_rand_int <- lmer(yards_gained ~ 1 + (1|posteam) + (1|defteam) + (1|game_id),
data = pbp_df)
To visualize the “whirlpool” effect of the multilevel model, we build a data frame with the unadjusted unit averages and their Empirical Bayes (random intercepts) estimates.
off_rand_int <- coef(mod_rand_int)$posteam %>%
rownames_to_column(var = "team") %>%
rename(off_estimate = `(Intercept)`)
off_avg_df <- off_avg_df %>%
left_join(off_rand_int,
by = "team") %>%
mutate(off_estimate_adjust = off_estimate - off_yards_per_play)
def_rand_int <- coef(mod_rand_int)$defteam %>%
rownames_to_column(var = "team") %>%
rename(def_estimate = `(Intercept)`)
def_avg_df <- def_avg_df %>%
left_join(def_rand_int,
by = "team") %>%
mutate(def_estimate_adjust = def_estimate - def_yards_per_play)
est_df <- left_join(off_avg_df,
def_avg_df,
by = "team") %>%
mutate(combined_estimate = off_estimate - def_estimate) %>%
arrange(desc(combined_estimate)) %>%
mutate(estimate_rank = row_number(),
percentile_estimate = scale(combined_estimate)) %>%
mutate(percentile_estimate = pnorm(percentile_estimate))
rm(off_avg_df,
def_avg_df)
Then, we plot the unadjusted unit averages and their regularized estimate to show the shrinkage.
p_pooling <- est_df %>%
ggplot(aes(x = off_yards_per_play,
y = def_yards_per_play)) +
geom_hline(yintercept = avg_yards_per_play) +
geom_vline(xintercept = avg_yards_per_play) +
geom_point(aes(color = team)) +
geom_segment(aes(xend = off_estimate,
yend = def_estimate,
color = team)) +
geom_nfl_logos(aes(x = off_estimate,
y = def_estimate,
team_abbr = team),
width = 0.05) +
scale_y_reverse(breaks = seq(4.0, 6.0, by = 0.2),
minor_breaks = NULL) +
scale_x_continuous(breaks = seq(4.0, 6.0, by = 0.2),
minor_breaks = NULL) +
scale_color_nfl() +
labs(x = "Offensive Yards Per Play",
y = "Defensive Yards Per Play",
title = "Adjusted Yards Per Play",
subtitle = "The Plexiglass Principle Adaptively Applied Via Partial Pooling",
caption = "Data: nflfastR | Plot: nflplotR | Model: lme4") +
theme_light() +
coord_fixed()
ggsave("nfl_2021_yards_per_play_pooling.png",
plot = p_pooling,
units = "in",
height = 5.25,
width = 5,
dpi = "retina")
To better quantify the average shrinkage applied to deviations from average for offenses, we can fit an ordinary least squares regression to a centered data frame (i.e., the league average is subtracted from both the regularized estimates and the unadjusted averages).
centered_df <- est_df %>%
mutate(off_estimate = off_estimate - avg_yards_per_play,
off_yards_per_play = off_yards_per_play - avg_yards_per_play,
def_estimate = def_estimate - avg_yards_per_play,
def_yards_per_play = def_yards_per_play - avg_yards_per_play)
lm(off_estimate ~ off_yards_per_play,
data = centered_df) %>%
coefficients()
(Intercept) off_yards_per_play
-0.001170907 0.666408276
est_df %>%
ggplot(aes(x = off_yards_per_play - avg_yards_per_play,
y = off_estimate - avg_yards_per_play)) +
geom_smooth(method = "lm",
se = FALSE) +
geom_text(aes(color = team,
label = team)) +
scale_color_nfl() +
scale_x_continuous(breaks = seq(-10, 10, by = 0.2),
minor_breaks = NULL) +
scale_y_continuous(breaks = seq(-10, 10, by = 0.2),
minor_breaks = NULL) +
theme_light() +
coord_fixed(xlim = c(-0.9, 0.9),
ylim = c(-0.8, 0.8))
We repeat for the defensive units. As anticipated, more shrinkage is applied to the defensive units (i.e., the slope of the defensive line is closer to 0 than the offensive line).
lm(def_estimate ~ def_yards_per_play,
data = centered_df) %>%
coefficients()
(Intercept) def_yards_per_play
-0.001126752 0.309755311
est_df %>%
ggplot(aes(x = def_yards_per_play - avg_yards_per_play,
y = def_estimate - avg_yards_per_play)) +
geom_smooth(method = "lm",
se = FALSE) +
geom_text(aes(color = team,
label = team)) +
scale_color_nfl() +
scale_x_continuous(breaks = seq(-10, 10, by = 0.2),
minor_breaks = NULL) +
scale_y_continuous(breaks = seq(-10, 10, by = 0.2),
minor_breaks = NULL) +
theme_light() +
coord_fixed(xlim = c(-0.9, 0.9),
ylim = c(-0.8, 0.8))
As our point estimate of team quality, we consider adjusted yard/play differential (i.e., offense yards/play regularized estimate - defense yards/play regularized estimate). To more easily visualize the relative spread in team quality by this metric, we plot the results in terms of percentiles.
p_2021 <- est_df %>%
ggplot(aes(x = percentile_estimate,
y = estimate_rank)) +
geom_hline(yintercept = c(0.5, 8.5, 16.5, 24.5, 32.5),
color = "dark gray",
linetype = "dashed") +
geom_vline(xintercept = c(0, 0.25, 0.5, 0.75, 1),
color = "dark gray") +
geom_nfl_logos(aes(team_abbr = team),
width = 0.05) +
scale_x_continuous(minor_breaks = NULL,
labels = scales::percent) +
scale_y_reverse(breaks = NULL,
minor_breaks = NULL) +
scale_color_nfl() +
theme_light() +
labs(title = "2021 NFL Regular Season Per Play Rankings",
subtitle = "Excludes Special Teams/Kneels/Spikes; Includes Penalty Yardage",
x = "Adjusted Yards Per Play Differential: Percentile",
y = NULL,
caption = "Data: nflfastR | Plot: nflplotR | Model: lme4")
ggsave("nfl_2021_reg_season_per_play.png",
plot = p_2021,
units = "in",
height = 5.25,
width = 5,
dpi = "retina")
As a quick back test, we compare how predictive regular season adjusted yard/play differential was for the 2021 NFL playoffs. The Vegas spreads below4 show the winning team was favored in 8 of 13 games. In this small sample, the adjusted yard/play differential model preferred the winning team in 12 of 13 games. This is not a rigorous test by any means, but it demonstrates the model likely carries some genuine insight.
Away | Home | Result | Vegas Spread | Model |
---|---|---|---|---|
CIN | LV | LV 19, CIN 26 | CIN -5.5 (Hit) | CIN (Hit) |
NE | BUF | NE 17, BUF 47 | BUF -4 (Hit) | BUF (Hit) |
PHI | TB | PHI 15, TB 31 | TB -8.5 (Hit) | TB (Hit) |
SF | DAL | SF 23, DAL 17 | DAL -3 (Miss) | SF (Hit) |
PIT | KC | PIT 21, KC 42 | KC -12.5 (Hit) | KC (Hit) |
ARI | LAR | ARI 11, LAR 34 | LAR -3.5 (Hit) | LAR (Hit) |
CIN | TEN | CIN 19, TEN 16 | TEN -4 (Miss) | CIN (Hit) |
SF | GB | SF 13, GB 10 | GB -5.5 (Miss) | SF (Hit) |
LAR | TB | LAR 30, TB 27 | TB -3 (Miss) | LAR (Hit) |
BUF | KC | BUF 36, KC 42 | KC -2 (Hit) | BUF (Miss) |
CIN | KC | CIN 27, KC 24 | KC -7 (Miss) | CIN (Hit) |
SF | LAR | SF 17, LAR 20 | LAR -3.5 (Hit) | LAR (Hit) |
LAR | CIN | LAR 23, CIN 20 | LAR -4.5 (Hit) | LAR (Hit) |
Quotation and background repeated from this excellent blog post.↩︎
See Section 6.5 in Regression and Other Stories by Gelman, Hill, and Vehtari, from which this summary of regression to the mean is adapted.↩︎
See Section 8.7.1 of Beyond Multiple Linear Regression by Roback and Legler or this NBER seminar on Empirical Bayes methods.↩︎
Retrieved from Sports Odds History.com.↩︎