zhaopinxinle.com

Predicting NBA Hall of Fame Inductions Using Machine Learning

Written on

In the world of basketball, the criteria for induction into the Hall of Fame has sparked endless discussions across various eras. Icons like Bill Russell, Wilt Chamberlain, Michael Jordan, and Kobe Bryant represent just a few of the elite players who have earned a place in this prestigious institution.

Recently, LeBron James shared his thoughts on this topic via a tweet on X:

While players must be retired for three full seasons to be eligible for the Hall of Fame, we can leverage machine learning techniques to predict who might eventually be honored. This piece will illustrate the development of a logistic regression model to assess the likelihood of current and recently retired NBA players receiving this accolade, while also examining the achievements and statistics that contribute to their Hall of Fame candidacy.

For those who are primarily interested in the outcomes, you can check the complete predictions for every NBA player here: https://nba-hof-prediction.streamlit.app/

Preparing the Data

In this analysis, we will utilize datasets available on Kaggle: - https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats - https://www.kaggle.com/datasets/ryanschubertds/all-nba-aba-players-bio-stats-accolades

Career Information

# Load necessary libraries library(tidyverse) library(readr) library(caret) library(ROCR)

career <- read_csv("Player Career Info.csv") career <- career %>%

select(c("player_id", "player", "hof", "first_seas", "last_seas"))

head(career)

Career Average Statistics

To compute each player's average statistics over their careers, we will analyze the "Player Totals" CSV and divide the total points, assists, rebounds, steals, and blocks by the number of games played.

totals <- read_csv("Player Totals.csv")

totals <- totals %>%

filter(tm != "TOT") # Exclude duplicate data from trades

total_data <- totals %>%

select(player_id, player, g, pts, trb, ast, stl, blk)

# The NBA began recording steals and blocks in the 1973-74 season and rebounds in the 1950-51 season # Replace NA values with 0 for now total_data$stl[is.na(totals$stl)] <- 0 total_data$blk[is.na(totals$blk)] <- 0 total_data$trb[is.na(totals$trb)] <- 0

# Calculate career averages for each player grouped_data <- total_data %>%

group_by(player_id)

career_totals <- grouped_data %>%

summarise(

sum_g = sum(g),

sum_pts = sum(pts),

sum_trb = sum(trb),

sum_ast = sum(ast),

sum_stl = sum(stl),

sum_blk = sum(blk)

)

career_averages <- career_totals %>%

mutate(

ppg = sum_pts / sum_g,

apg = sum_ast / sum_g,

rpg = sum_trb / sum_g,

spg = sum_stl / sum_g,

bpg = sum_blk / sum_g

)

career_averages <- career_averages %>%

select(c(player_id, ppg, apg, rpg, spg, bpg))

# Merge with career data full_data <- inner_join(career_averages, career, by = "player_id")

All-Star Selections

To determine how many times each player has been selected as an All-Star, we will analyze the "All-Star Selections" CSV, which does not include a "player_id" column. Thus, we will need to handle duplicate names when merging the data.

all_star <- read_csv("All-Star Selections.csv")

all_star_counts <- table(all_star$player)

# Convert to a data frame and rename columns for clarity all_star_counts_df <- as.data.frame(all_star_counts) colnames(all_star_counts_df) <- c("player", "AllStarCount")

# Merge with full_data full_data <- merge(full_data, all_star_counts_df, by = "player", all.x = TRUE)

# Fill NA values in the "AllStarCount" column with 0 (players with no all-star selections) full_data$AllStarCount[is.na(full_data$AllStarCount)] <- 0

duplicate_names <- full_data$player[duplicated(full_data$player)]

full_data %>%

filter(player %in% duplicate_names, AllStarCount > 0) %>%

select(player, player_id, ppg, first_seas, last_seas, AllStarCount)

When merging by player, the "AllStarCount" value was assigned to both players in cases of duplicate names. We will manually adjust the "AllStarCount" column using Basketball-Reference for verification:

full_data[full_data$player_id %in% c("906", "3657", "941", "1993", "240", "620", "1763", "3542", "3967", "1471", "1968"), "AllStarCount"] <- 0

Awards

To incorporate awards into our data, we will utilize the “Player Award Shares” CSV.

awards <- read_csv("Player Award Shares.csv")

# Count the number of times a player has won each award award_counts <- awards %>%

group_by(player_id, award) %>%

summarise(wins = sum(winner)) %>%

filter(wins > 0)

# Reshape the data frame with a column for each award reshaped_awards <- award_counts %>%

pivot_wider(names_from = award, values_from = wins, values_fill = 0)

# Merge with full data full_data <- merge(full_data, reshaped_awards, by = "player_id", all.x = TRUE)

# Fill NA values with 0 (for players who did not win any awards) full_data[is.na(full_data)] <- 0

All-NBA, All-Defense, and All-Rookie Teams

To add All-League teams, we will analyze the “End of Season Teams” CSV.

all_nba_teams <- read_csv("End of Season Teams.csv")

# Change formatting all_nba_teams$team <- paste(all_nba_teams$type, all_nba_teams$number_tm, "Team", sep = " ")

# Count the number of times a player has made each type of All-League team team_counts <- all_nba_teams %>%

group_by(player_id, team) %>%

summarise(count = n())

# Reshape the data frame with a column for each All-League team reshaped_teams <- team_counts %>%

pivot_wider(names_from = team, values_from = count, values_fill = 0)

# Merge with full data full_data <- merge(full_data, reshaped_teams, by = "player_id", all.x = TRUE)

# Fill NA values with 0 (for players who did not make any All-League teams) full_data[is.na(full_data)] <- 0

Championships and Finals MVP

For championships and Finals MVP data, we refer to the second Kaggle dataset: https://www.kaggle.com/datasets/ryanschubertds/all-nba-aba-players-bio-stats-accolades. This dataset only goes up to June 15, 2022, so we will need to add 1 to the championships column for players on the Nuggets' 2022–23 roster and account for Nikola Jokic winning Finals MVP.

nba_data <- read_csv("NBA_players_clean.csv")

# Some players have an asterisk at the end of their name. We want to remove it: nba_data$Player <- sub("\*", "", nba_data$Player)

nba_data <- nba_data %>%

select(Player, From, "Finals MVP", Championships)

names(nba_data)[names(nba_data) == "Player"] <- "player" names(nba_data)[names(nba_data) == "From"] <- "first_seas"

# Inner join nba_data with our career data to match player IDs to players championships <- inner_join(nba_data, career, by = c("player", "first_seas")) %>%

select(player_id, "Finals MVP", Championships)

# Merge with full data full_data <- merge(full_data, championships, by = "player_id", all.x = TRUE)

# Replace NA values with 0 (players who have not won any championships or Finals MVPs) full_data[is.na(full_data)] <- 0

After merging, we notice an increase in observations from 5178 to 5182. This discrepancy arose because there were two players named Bill Bradley and two named Tony Mitchell, leading to duplicate combinations in the joined data frame. One Bill Bradley has two championships, while the other has none. We will rectify this by removing incorrect rows.

full_data <- full_data[!(full_data$player_id == 905 & full_data$Championships == 0), , drop = FALSE] full_data <- full_data[!(full_data$player_id == 906 & full_data$Championships == 2), , drop = FALSE] full_data <- full_data[!duplicated(full_data), ]

Next, we will update the data to reflect the Nuggets' championship win in 2023 and Nikola Jokic's Finals MVP award.

nuggets <- totals %>%

filter(tm == "DEN", season == 2023)

full_data <- full_data %>%

mutate(Championships = ifelse(player %in% nuggets$player, Championships + 1, Championships))

full_data[full_data$player == "Nikola Jokic", "Finals MVP"] <- 1

Hall of Fame Coaches and Contributors

The Hall of Fame also honors coaches, referees, and significant contributors, not solely players. Our final task is to reclassify former players who have been inducted as coaches or contributors in the "hof" column.

full_data <- full_data %>%

mutate(hof = ifelse(player %in% c("George Karl", "John Thompson", "Alfred McGuire", "Chuck Cooper", "Phil Jackson", "Pat Riley", "Rick Adelman", "Earl Lloyd", "Al Attles", "Bob Houbregs", "Tom Sanders", "Slick Leonard", "Nat Clifton", "Don Nelson", "Rod Thorn", "Don Barksdale", "Larry Brown", "Larry Costello", "Bill Bradley", "Wayne Embry", "Jerry Sloan", "Rudy Tomjanovich", "Alex Hannum", "Red Holzman"), FALSE, hof))

HOF vs Non-HOF

Now, we can visualize how exclusive the Hall of Fame is.

# Include only Hall of Fame eligible players visualization_data <- full_data %>%

filter(last_seas <= 2019)

ggplot(visualization_data, aes(x = factor(hof), fill = factor(hof))) +

geom_bar(position = "stack", stat = "count", width = 0.7, show.legend = TRUE) +

scale_fill_manual(values = c("blue", "red")) +

labs(title = "Number of Hall of Fame Players", x = "Hall of Fame", y = "Count") +

theme(legend.position = "none")

Only a very small percentage of NBA players—3.39%—gain induction into the Hall of Fame. We can also visualize the statistical differences between Hall of Fame and non-Hall of Fame players:

ggplot(visualization_data, aes(x = ppg, fill = factor(hof))) +

geom_histogram(binwidth = 2, position = "identity", alpha = 0.7, color = "white") +

scale_fill_manual(values = c("blue", "red")) +

labs(title = "Distribution of Points Per Game (PPG)", x = "Points Per Game (PPG)", y = "Frequency", fill = "Hall of Fame") +

theme(legend.position = "top")

As anticipated, Hall of Fame players generally average significantly higher points per game compared to non-HOF players throughout their careers.

# Box plots to compare HOF and non-HOF players in various categories melted_data <- gather(visualization_data, key = "statistic", value = "value",

c(ppg, apg, rpg, AllStarCount, "All-NBA 1st Team", Championships))

statistic_order <- c("ppg", "AllStarCount", "apg", "rpg", "All-NBA 1st Team", "Championships") melted_data$statistic <- factor(melted_data$statistic, levels = statistic_order)

ggplot(melted_data, aes(x = hof, y = value, fill = hof)) +

geom_boxplot() +

facet_wrap(~statistic, scales = "free_y", ncol = 2) +

labs(title = "Comparison of HOF vs Non-HOF Players by Statistic", x = "Hall of Fame", y = "Value") +

scale_fill_manual(values = c("red", "blue"),

breaks = c(TRUE, FALSE)) +

scale_x_discrete(labels = c("Non-HOF", "HOF")) +

labs(fill = "Hall of Fame")

Overall, Hall of Fame players generally achieve more in terms of points, assists, and rebounds per game compared to their non-HOF counterparts. They also tend to collect more honors, including All-Star selections and All-NBA 1st Team accolades. The median for Hall of Fame players includes one championship, one All-NBA 1st Team title, and seven All-Star selections, underscoring the significance of both individual and team accomplishments.

Building the Logistic Regression Model

Now we will proceed to train a logistic regression model aimed at predicting Hall of Fame probabilities.

For this model, we will exclude steals per game, blocks per game, and the Most Improved Player and Sixth Man of the Year awards, as they show weak correlation with Hall of Fame induction. With the exclusion of steals and blocks, we will focus on players who began their careers in the 1950–51 season or later, when rebounds started being tracked. Additionally, we will only consider players who last played in the 2018–19 season or earlier, as they must be retired for three complete seasons to be eligible for the Hall of Fame.

full_data_model <- full_data %>%

filter(first_seas >= 1951, last_seas <= 2019) %>%

select(-c(player_id, player, first_seas, last_seas, mip, smoy, spg, bpg))

set.seed(123) train_index <- createDataPartition(full_data_model$hof, p = 0.8, list = FALSE) train_data <- full_data_model[train_index, ] test_data <- full_data_model[-train_index, ]

# Fit logistic regression model lr_model <- glm(hof ~ ., data = train_data, family = "binomial")

# Predictions on the test set predictions_lr_test <- predict(lr_model, newdata = test_data, type = "response") predicted_classes_lr <- ifelse(predictions_lr_test > 0.5, 1, 0)

# Confusion Matrix conf_matrix_lr <- table(predicted_classes_lr, test_data$hof) accuracy_lr <- sum(diag(conf_matrix_lr)) / sum(conf_matrix_lr)

print(conf_matrix_lr) print(paste("Accuracy:", accuracy_lr))

The model displays remarkable accuracy at 0.99346 on the test dataset.

Next, we can visualize a precision-recall curve to further evaluate the model:

pr_obj <- prediction(predictions_lr_test, test_data$hof) pr_perf <- performance(pr_obj, "prec", "rec")

plot(pr_perf, main = "Precision-Recall Curve for Logistic Regression Model",

col = "green", lwd = 2)

Feature Importance

From our logistic regression model, we can evaluate the importance of different features by examining the coefficients of each variable.

var_importance <- coef(lr_model) var_importance <- var_importance[!names(var_importance) %in% "(Intercept)"]

plot_data <- data.frame(variable = names(var_importance), importance = var_importance)

# Bar plot for variable importance ggplot(plot_data, aes(x = reorder(variable, importance), y = importance)) +

geom_bar(stat = "identity", fill = "skyblue", color = "black") +

labs(title = "Feature Coefficients From Logistic Regression",

x = "Variable", y = "Coefficient") +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1),

axis.title.x = element_blank())

The NBA MVP award stands out as the most significant variable influencing the likelihood of being inducted into the Hall of Fame.

To assess the coefficients of the other variables more clearly, we can recreate the plot without the "NBA MVP" variable.

var_importance2 <- var_importance[!names(var_importance) %in% c("(Intercept)", "nba mvp")]

plot_data2 <- data.frame(variable = names(var_importance2), importance = var_importance2)

ggplot(plot_data2, aes(x = reorder(variable, importance), y = importance)) +

geom_bar(stat = "identity", fill = "skyblue", color = "black") +

labs(title = "Feature Coefficients From Logistic Regression",

x = "Variable", y = "Coefficient") +

theme_minimal() +

theme(axis.text.x = element_text(angle = 45, hjust = 1),

axis.title.x = element_blank())

Among the remaining variables, "Finals MVP" emerges as the next most influential, followed by "Championships," "Defensive Player of the Year," "All-NBA 1st Team," "All-Star Count," "All-Defense 1st Team," and "All-Rookie 1st Team." Interestingly, both "All-Defense 2nd Team" and "All-Rookie 2nd Team" exhibit negative coefficients, suggesting a negative correlation with Hall of Fame probability. Another surprising finding is that "All-NBA 3rd Team" has a higher coefficient than "All-NBA 2nd Team."

In terms of Hall of Fame voting, individual accolades heavily outweigh overall career statistics, which aligns with the evolution of the NBA towards faster-paced, higher-scoring games. For instance, Allen Iverson averaged 26.75 points per game to secure the scoring title in 1998–99, but this figure would not place him in the top 10 in today's NBA. Per-game averages may not accurately reflect actual performance across different eras.

Applying the Model

Who Deserves to Be in the Hall of Fame?

Here are all the players (past and present) that the model predicts have a Hall of Fame probability of 50% or greater:

<figure> <div> <div> </div> </div> </figure>

Which Current Players Have the Highest HOF Probability?

Here are the 100 current or recently retired players with the highest probabilities of making the Hall of Fame:

<figure> <div> <div> </div> </div> </figure>

Which Players Did the Model Misclassify?

Here is a table displaying all the players that the model misclassified (among players eligible for the Hall of Fame):

<figure> <div> <div> </div> </div> </figure>

Among the misclassified players are A’mare Stoudemire, Marques Johnson, Brad Daugherty, Larry Foust, Willie Naulls, Willie Wise, Larry Kenon, and Warren Jabali, who were incorrectly classified as Hall of Famers.

Additionally, there are numerous other instances where the model disagrees with Hall of Fame inductees. However, the Hall of Fame encompasses not only NBA players but also college and international basketball players. Our model focuses solely on NBA statistics and accolades, which explains several discrepancies. For example, Dražen Petrovi? and Toni Kuko? were outstanding NBA players but were inducted into the Hall of Fame primarily due to their dominance in international play.

You can find the complete model predictions here: https://nba-hof-prediction.streamlit.app/

Conclusion

While no model can fully encapsulate the complexity and subjectivity of Hall of Fame voting, this analysis has provided valuable insights into the significance of various accolades, career achievements, and statistics in determining Hall of Fame probabilities.

In the future, it would be intriguing to enhance the model's ability to predict Hall of Fame probabilities for younger players based on their career trajectories and draft positions. Currently, the model assigns Victor Wembanyama—a highly coveted NBA prospect since LeBron James—a 3.58% chance of making the Hall of Fame based on his existing resume, as he has yet to accumulate accolades like more experienced players.

Another fascinating avenue would be to analyze Hall of Fame predictions by position to discern how criteria vary. For instance, it might be more critical for centers to earn defensive-related awards such as Defensive Player of the Year and All-Defense honors.

Given the ongoing data analytics revolution within the NBA, it will be exciting to see if advanced metrics and machine learning increasingly influence Hall of Fame voting in the future.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Navigating Stigma: A Personal Journey Through Health and Identity

A personal reflection on taking PrEP, confronting stigma, and reclaiming health and empowerment.

Your Comprehensive Blueprint for Conquering the SARSEF Science Fair

Discover essential strategies to excel at the SARSEF Science Fair and increase your chances of advancing to prestigious competitions.

Exploring the Therapeutic Aspects of Mysticism and Astrology

Discover how astrology and mysticism can promote self-reflection and understanding, even for the skeptical.

Innovative COVID-19 Vaccine Targets All Variants, Past and Future

The U.S. military is developing a groundbreaking vaccine aimed at combating all current and future coronavirus variants.

# Outsmarting Impostor Syndrome: 8 Essential Strategies for Success

Discover effective strategies to conquer Impostor Syndrome and embrace your achievements with confidence.

# OpenAI's GPT-4 Upgrade: A New Era in AI Interaction

OpenAI has enhanced GPT-4 for ChatGPT users, increasing request limits and response lengths, transforming AI interactions into deeper conversations.

Why Cybersecurity Awareness Month Remains Essential Today

Cybersecurity Awareness Month emphasizes the ongoing need for vigilance against digital threats, fostering a culture of safety and proactive measures.

Mastering Parameters and Return Values in Java Programming

A comprehensive guide on using parameters and return values effectively in Java for improved programming practices.