Mastering Data Wrangling and Scraping Techniques with R

Introduction

In 2020, I received an unexpected invitation from a former lecturer from my time at Gadjah Mada University in Indonesia, where I studied Statistics. He asked me to join as an instructor for a workshop at The Tenth International Conference and Workshop on High-Dimensional Data Analysis (ICW-HDDA-X) 2020. It was a surprise, given that we hadn’t been in touch for eight years. However, I felt honored that he remembered me and believed I was qualified to instruct at the workshop. The focus of my presentation was aligned with the topic of this article. Let’s dive in!

What is Data Wrangling?

Data wrangling encompasses all the preparatory tasks necessary before conducting data analysis. The primary activities involved include:

Identifying variables and observations
Creating new variables and observations
Restructuring the data into an optimal format
Merging multiple datasets
Summarizing data by groups

According to a New York Times article, “Data scientists, based on interviews and expert estimates, spend between 50% to 80% of their time in this area before they can delve into exploration.” Thus, it is advantageous to perform the data-wrangling process efficiently to allocate more time for analysis.

Data Wrangling with R

For this discussion, I will utilize R programming in conjunction with the R Studio GUI. You can download the R programming language from here and the R Studio GUI from there. The essential packages for data wrangling that I utilized were tidyr and dplyr, both part of the tidyverse.

# installing the packages for the first time

install.packages(c('tidyr', 'dplyr'))

# load the packages using the library function

library(tidyr)

library(dplyr)

I also included a few additional packages: devtools and EDAWR. The latter can only be accessed through the R Studio GUI.

# installing and loading the devtools package for the first time

install.packages('devtools')

library(devtools)

# installing the EDAWR using install_github from devtools

install_github('rstudio/EDAWR')

library(EDAWR)

We can leverage various datasets available in the EDAWR package, such as storms, cases, pollution, and tb.

# get the help of each dataset to know about the background

?storms

?cases

?pollution

?tb

Tidy Data with tidyr

What constitutes “tidy data”? Tidy data should possess the following characteristics:

Each variable is stored in its own column.
Each observation is stored in its own row.
Each type of observation is captured in a single table.

The objective is to enhance data accessibility while preserving the observations.

The tidyr package is utilized to adjust the arrangement of tables. Its primary functions are gather() and spread().

How to Use gather() and spread()

# collapses multiple columns into two columns

gather(cases, 'year', 'count', 2:4)

# generates multiple columns from two columns

spread(pollution, size, amount)

How to Use separate() and unite()

# splits a column by a character string operator

separate(storms, date, c('year', 'month', 'day'), sep = '-')

# unites columns into a single column.

unite(y, 'date', year, month, day, sep = '-')

Manipulating Data with dplyr

The dplyr package is designed for transforming tabular data.

Using the Pipe Operator

The pipe operator %>% allows for the chaining of multiple operations.

Joining Data with dplyr

Some might note that dplyr resembles SQL quite a bit, and that's accurate! Here, we will examine the "join" functions in dplyr, which operate similarly to SQL joins.

Data Scraping with rvest

# installing the package for the first time

install.packages('rvest')

library(rvest)

# other packages needed

install.packages(c('selectr', 'xml2', 'jsonlite', 'stringr'))

library(selectr)

library(xml2)

library(jsonlite)

library(stringr)

In this section, we will scrape data from the following page: https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361.

# assign url

url <- 'https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361'

# using read_html function to read the url then assign to webpage

webpage <- read_html(url)

Conclusion

Data wrangling is essential for analysts, as it equips us to interpret and manipulate the data effectively, enabling the necessary modifications before embarking on deeper data analysis.

If you'd like to access the PDF version of my workshop materials, you can find the file here. Alternatively, you can read it directly below:

I hope this article proves beneficial in your journey toward mastering data wrangling. Thank you for taking the time to read!

References

Tidyverse https://www.tidyverse.org/
Dplyr https://dplyr.tidyverse.org/
Tidyr https://tidyr.tidyverse.org/
R-Studio Data Wrangling with dplyr and tidyr Cheat Sheet https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

zhaopinxinle.com

Mastering Data Wrangling and Scraping Techniques with R

Introduction

What is Data Wrangling?

Data Wrangling with R

Tidy Data with tidyr

How to Use gather() and spread()

How to Use separate() and unite()

Manipulating Data with dplyr

Using the Pipe Operator

Joining Data with dplyr

Data Scraping with rvest

Conclusion

References

Share the page:

Recent Post:

Understanding Alcohol's Impact on the Brain and Body

Finding Spiritual Connection Through the Symbolism of Butterflies

The Transformative Power of Zen: 14 Psychological Benefits

Transforming Your Will into Reality: A Guide to Inner Strength

Boost Your Development Efficiency with These JetBrains Plugins

Overcoming Loneliness: Discovering Your Purpose in Life

# Effective Strategies for Deploying ML Models without Server Dependency

Navigating Emotional Complexity: Insights from DBT Therapy