zhaopinxinle.com

Mastering Data Wrangling and Scraping Techniques with R

Written on

Introduction

In 2020, I received an unexpected invitation from a former lecturer from my time at Gadjah Mada University in Indonesia, where I studied Statistics. He asked me to join as an instructor for a workshop at The Tenth International Conference and Workshop on High-Dimensional Data Analysis (ICW-HDDA-X) 2020. It was a surprise, given that we hadn’t been in touch for eight years. However, I felt honored that he remembered me and believed I was qualified to instruct at the workshop. The focus of my presentation was aligned with the topic of this article. Let’s dive in!

What is Data Wrangling?

Data wrangling encompasses all the preparatory tasks necessary before conducting data analysis. The primary activities involved include:

  1. Identifying variables and observations
  2. Creating new variables and observations
  3. Restructuring the data into an optimal format
  4. Merging multiple datasets
  5. Summarizing data by groups

According to a New York Times article, “Data scientists, based on interviews and expert estimates, spend between 50% to 80% of their time in this area before they can delve into exploration.” Thus, it is advantageous to perform the data-wrangling process efficiently to allocate more time for analysis.

Data Wrangling with R

For this discussion, I will utilize R programming in conjunction with the R Studio GUI. You can download the R programming language from here and the R Studio GUI from there. The essential packages for data wrangling that I utilized were tidyr and dplyr, both part of the tidyverse.

# installing the packages for the first time

install.packages(c('tidyr', 'dplyr'))

# load the packages using the library function

library(tidyr)

library(dplyr)

I also included a few additional packages: devtools and EDAWR. The latter can only be accessed through the R Studio GUI.

# installing and loading the devtools package for the first time

install.packages('devtools')

library(devtools)

# installing the EDAWR using install_github from devtools

install_github('rstudio/EDAWR')

library(EDAWR)

We can leverage various datasets available in the EDAWR package, such as storms, cases, pollution, and tb.

# get the help of each dataset to know about the background

?storms

?cases

?pollution

?tb

Tidy Data with tidyr

What constitutes “tidy data”? Tidy data should possess the following characteristics:

  1. Each variable is stored in its own column.
  2. Each observation is stored in its own row.
  3. Each type of observation is captured in a single table.

The objective is to enhance data accessibility while preserving the observations.

The tidyr package is utilized to adjust the arrangement of tables. Its primary functions are gather() and spread().

How to Use gather() and spread()

# collapses multiple columns into two columns

gather(cases, 'year', 'count', 2:4)

# generates multiple columns from two columns

spread(pollution, size, amount)

How to Use separate() and unite()

# splits a column by a character string operator

separate(storms, date, c('year', 'month', 'day'), sep = '-')

# unites columns into a single column.

unite(y, 'date', year, month, day, sep = '-')

Manipulating Data with dplyr

The dplyr package is designed for transforming tabular data.

Using the Pipe Operator

The pipe operator %>% allows for the chaining of multiple operations.

Joining Data with dplyr

Some might note that dplyr resembles SQL quite a bit, and that's accurate! Here, we will examine the "join" functions in dplyr, which operate similarly to SQL joins.

Data Scraping with rvest

# installing the package for the first time

install.packages('rvest')

library(rvest)

# other packages needed

install.packages(c('selectr', 'xml2', 'jsonlite', 'stringr'))

library(selectr)

library(xml2)

library(jsonlite)

library(stringr)

In this section, we will scrape data from the following page: https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361.

# assign url

url <- 'https://sidata-ptn.ltmpt.ac.id/ptn_sb.php?ptn=361'

# using read_html function to read the url then assign to webpage

webpage <- read_html(url)

Conclusion

Data wrangling is essential for analysts, as it equips us to interpret and manipulate the data effectively, enabling the necessary modifications before embarking on deeper data analysis.

If you'd like to access the PDF version of my workshop materials, you can find the file here. Alternatively, you can read it directly below:

I hope this article proves beneficial in your journey toward mastering data wrangling. Thank you for taking the time to read!

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Understanding Alcohol's Impact on the Brain and Body

Exploring the effects of alcohol metabolism in the brain and its implications for motor control and addiction.

Finding Spiritual Connection Through the Symbolism of Butterflies

Explore the deep symbolism of butterflies in spirituality and personal connection, reflecting on loss and rebirth.

The Transformative Power of Zen: 14 Psychological Benefits

Explore the psychological advantages of Zen practice and how it enhances self-awareness, empathy, and mental clarity.

Transforming Your Will into Reality: A Guide to Inner Strength

Discover how to channel your inner strength to manifest dreams through willpower and imagination.

Boost Your Development Efficiency with These JetBrains Plugins

Discover 7 essential JetBrains IDE plugins that enhance productivity and streamline your development workflow.

Overcoming Loneliness: Discovering Your Purpose in Life

Uncover the truth about loneliness and learn how discovering your purpose can lead to a fulfilling life.

# Effective Strategies for Deploying ML Models without Server Dependency

Discover the importance of separating ML models from application servers for efficient deployment and performance.

Navigating Emotional Complexity: Insights from DBT Therapy

Discovering Dialectical Behavior Therapy helped me understand my emotional struggles and find validation in my experiences.