Web Scraping in R: A Hands-On Guide to NBA Player Data
Written on
Web scraping is a valuable technique for extracting data from websites, especially when the data isn't readily available in a structured format. In many instances, analysts find themselves needing to gather information from various online sources, such as websites, to conduct thorough analyses.
Businesses often leverage web scraping to gain competitive insights by accessing overlooked data. When executed effectively, it allows us to retrieve information from any site and convert it into a format suitable for analysis or reporting.
Web scraping involves the automated extraction of data from websites, which simplifies the process by eliminating manual data collection.
Some practical applications of web scraping include collecting product reviews, tracking real-time prices for travel accommodations, or aggregating job listings.
There are numerous libraries in popular programming languages designed for parsing HTML content, such as Beautiful Soup in Python. However, in this tutorial, we will utilize rvest, an R package specifically created for harvesting web data. We will focus on how to scrape details about current NBA players from the ESPN website.
Player Profile
When scraping data, it's crucial to ensure consistency across the web pages you plan to target. The only way to automate the scraping process across multiple pages is if there is a recognizable pattern in the data structure.
For instance, let's examine the roster pages for the Boston Celtics and the New York Knicks.
Both teams present their rosters in a tabular format, listing each player's name, position, age, height, weight, college, and salary. Additionally, the URL structure varies by team: ../bos/boston-celtics for the Celtics and ../ny/new-york-knicks for the Knicks. This will be useful as we proceed.
Initially, we will concentrate on scraping data for just one team. Once we successfully extract this data, we can easily implement a loop to automate the process for all teams.
To effectively scrape data, it's beneficial to have a basic understanding of HTML and CSS, the foundational technologies for web development. HTML structures the content, while CSS enhances the visual layout of a page.
We will streamline this process using a Chrome extension called SelectorGadget, which allows us to effortlessly generate CSS selectors by highlighting the desired elements on a webpage.
In this instance, we will select the elements corresponding to each player's details for the Boston Celtics and store them in variables. Subsequently, we will compile this information into a data frame.
Player Regular Season Statistics
Let's enhance our scraping process further.
Each player’s name features a hyperlink that directs to a separate page detailing their performance during the latest NBA regular season.
For example, consider Jayson Tatum and Derrick Rose.
In this phase, we aim to amalgamate this performance data with our original data frame.
To achieve this, we will create a function that retrieves the seasonal statistics for each player on our roster.
Automate for All NBA Teams
Having grasped the process for the Boston Celtics, we can extend it to encompass all 30 NBA teams.
We simply need to adjust the URL for each team using a loop. The entire operation took my computer approximately 9 minutes. Quite efficient!
After performing additional data cleaning—such as converting player heights to centimeters and weights to kilograms, along with adjusting data types and renaming columns—we will arrive at a final data frame that appears as follows.
Bonus: Exploratory Data Analysis
This segment serves more for my amusement than as a web scraping tutorial. I’ve employed basic data visualizations to derive insights from the compiled data on current NBA players.
While the analysis may not be groundbreaking, I have included my observations in the captions accompanying each chart.
Thank you for reading! I hope you gained valuable insights into the fundamentals of web scraping with R. I encourage you to explore the workbook associated with this exercise on my GitHub, which includes all the code utilized throughout the project.
If you found this article helpful and aren't yet a Medium member, signing up through the link below would greatly support me and other writers on this platform. Your membership empowers us to continue producing high-quality, informative content—thank you in advance!
<div class="link-block">
<h2>Join Medium with my referral link - Jason Chong</h2>
<h3>Read every story from Jason Chong (and thousands of other writers on Medium). Your membership fee directly supports…</h3>
<p>chongjason.medium.com</p>
</div>
Looking for your next read? Here are some recommendations:
<div class="link-block">
<h2>10 Most Important SQL Commands Every Data Analyst Needs to Know</h2>
<h3>Querying data from a database doesn’t need to be complicated</h3>
<p>towardsdatascience.com</p>
</div>
<div class="link-block">
<h2>Addressing the Issue of “Black Box” in Machine Learning</h2>
<h3>4 must-know techniques to create more transparency and explainability in model predictions</h3>
<p>towardsdatascience.com</p>
</div>
<div class="link-block">
<h2>What does Career Progression Look Like for a Data Scientist?</h2>
<h3>A guide to understanding the role of a junior vs senior data scientist at a large company</h3>
<p>towardsdatascience.com</p>
</div>