Mastering Data Wrangling and Scraping Techniques with R
Written on
Introduction
In 2020, I received an unexpected invitation from a former lecturer from my time at Gadjah Mada University in Indonesia, where I studied Statistics. He asked me to join as an instructor for a workshop at The Tenth International Conference and Workshop on High-Dimensional Data Analysis (ICW-HDDA-X) 2020. It was a surprise, given that we hadn’t been in touch for eight years. However, I felt honored that he remembered me and believed I was qualified to instruct at the workshop. The focus of my presentation was aligned with the topic of this article. Let’s dive in!
What is Data Wrangling?
Data wrangling encompasses all the preparatory tasks necessary before conducting data analysis. The primary activities involved include:
- Identifying variables and observations
- Creating new variables and observations
- Restructuring the data into an optimal format
- Merging multiple datasets
- Summarizing data by groups
According to a New York Times article, “Data scientists, based on interviews and expert estimates, spend between 50% to 80% of their time in this area before they can delve into exploration.” Thus, it is advantageous to perform the data-wrangling process efficiently to allocate more time for analysis.
Data Wrangling with R
For this discussion, I will utilize R programming in conjunction with the R Studio GUI. You can download the R programming language from here and the R Studio GUI from there. The essential packages for data wrangling that I utilized were tidyr and dplyr, both part of the tidyverse.
# installing the packages for the first time
install.packages(c('tidyr', 'dplyr'))
# load the packages using the library function
library(tidyr)
library(dplyr)
I also included a few additional packages: devtools and EDAWR. The latter can only be accessed through the R Studio GUI.
# installing and loading the devtools package for the first time
install.packages('devtools')
library(devtools)
# installing the EDAWR using install_github from devtools
install_github('rstudio/EDAWR')
library(EDAWR)
We can leverage various datasets available in the EDAWR package, such as storms, cases, pollution, and tb.
# get the help of each dataset to know about the background
?storms
?cases
?pollution
?tb
Tidy Data with tidyr
What constitutes “tidy data”? Tidy data should possess the following characteristics:
- Each variable is stored in its own column.
- Each observation is stored in its own row.
- Each type of observation is captured in a single table.
The objective is to enhance data accessibility while preserving the observations.
The tidyr package is utilized to adjust the arrangement of tables. Its primary functions are gather() and spread().
How to Use gather() and spread()
# collapses multiple columns into two columns
gather(cases, 'year', 'count', 2:4)
# generates multiple columns from two columns
spread(pollution, size, amount)
How to Use separate() and unite()
# splits a column by a character string operator
separate(storms, date, c('year', 'month', 'day'), sep = '-')
# unites columns into a single column.
unite(y, 'date', year, month, day, sep = '-')
Manipulating Data with dplyr
The dplyr package is designed for transforming tabular data.
Using the Pipe Operator
The pipe operator %>% allows for the chaining of multiple operations.
Joining Data with dplyr
Some might note that dplyr resembles SQL quite a bit, and that's accurate! Here, we will examine the "join" functions in dplyr, which operate similarly to SQL joins.
Data Scraping with rvest
# installing the package for the first time
install.packages('rvest')
library(rvest)
# other packages needed
install.packages(c('selectr', 'xml2', 'jsonlite', 'stringr'))
library(selectr)
library(xml2)
library(jsonlite)
library(stringr)
In this section, we will scrape data from the following page: https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361.
# assign url
url <- 'https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361'
# using read_html function to read the url then assign to webpage
webpage <- read_html(url)
Conclusion
Data wrangling is essential for analysts, as it equips us to interpret and manipulate the data effectively, enabling the necessary modifications before embarking on deeper data analysis.
If you'd like to access the PDF version of my workshop materials, you can find the file here. Alternatively, you can read it directly below:
I hope this article proves beneficial in your journey toward mastering data wrangling. Thank you for taking the time to read!
References
- Tidyverse https://www.tidyverse.org/
- Dplyr https://dplyr.tidyverse.org/
- Tidyr https://tidyr.tidyverse.org/
- R-Studio Data Wrangling with dplyr and tidyr Cheat Sheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf