How To Reshape Wide Data to Tidy Data with pivot_longer() in tidyr?

One of the most common activities in data analysis is the transformation of data from one form to another. It is often easier for the human eye and data acquisition to work with data in a broader form. For data analysis, however, in most cases it is more convenient to have the data in a broad/long form. tidyr, the R-package included in tidyverse, provides a set of functions to manipulate data sets in a broad or long form.

pivot_longer() and pivot_wider() are the two most important neat functions from version 1.0.0 for data reconfiguration. In this article we examine examples of the use of pivot_longer() to convert broad data into long or ordinal data.

According to Tidyr’s definition, ordered data is where :

  • Each column is variable.
  • Every line is an observation.
  • Each cell is a unique value.

Would you like to know more about clean data? Discover the original work of Hadley Wickham.

Let’s start by downloading tidyverse, a set of R packets from RStudio.

library (neat)

We will use the historical data from the mobile phones of the TidyTuesday project. For this position we need data in raw form. We used the pivot_wider() function of tidyr, as described in this post, to get a broad overview of moving data and store the results on the Github page of cmdline tips.

Reformatting simple wide data to long/clean data

First, let’s use a simple and large dataset and load it.

wide_df <- readr::read_tsv(https://raw.githubusercontent.com/cmdlinetips/data/master/mobile_subscription_wide_simple_tidytuesday.tsv)

We see that the data is in a general form and that the column names are years and the values in the data frame are mobile phone subscribers.

head(wide_df)
## ## A Tibble: 6 x 30
## Continent Code `1990` `1992` `1993` `1994` `1995` `1996` `1997`
##
## 1 AFG Asia 0 0 0 0
## 2 ALB Europe 0 0 0,0744 0,107
## 3 DZA Africa 0.00181 0,0180 0,0176 0,0172 0,00475 0,0162 0,0398 0,0582
## 4 AOM Oceania 0 0 1,41 1,77 2,32 2,36 2,41 2,4155
## 5 I Europe 0 0 1,31 1,28 1,25 4,42 8,53 13,4
## 6 AGO Africa 0 0,00821 0,0132 0,0140 0,0225 0,0467
## . 20 other variables : 1998, 1999, 2000, 1999, 1999, 2002, 2003, 2004, 2005, 1999, 2006, 2007, 2008, 2009, 2010,## 2011, 2012, 2013, 2014, 2015,## 2016, 2017…

Let’s start with a simple example where we save one of the columns in an ordered form. To do this, we will remove one column and use the remaining column to make it right.

pivot_extend : Reform of long and net data

We use pivot_longer() and first indicate which column we want to keep free. Then use the names_to argument to specify a variable for column names, and the value_to argument to specify a variable for values in general.

In our example we want to keep the continental column in a large data frame and specify the names_to=year and the values_to=mobile_subs.

wide_df %>%
select(-code) %>%
pivot_long(-continent,
names_to=year,
values_to = mobile_subs)

And we get a neat or long data block with three columns.

## ## Skulls: 6.944 x 3
## Continent Movement Year_subs
##
## 1 Asia 1990 0
## 2 Asia 1991 0
## 3 Asia 1992 0
## 4 Asia 1993 0
## 5 Asia 1994 0
## 6 Asia 1995 0
## 7 Asia 1996 0
## 8 Asia 1997 0
## 9 Asia 1998 0
## 10 Asia 1999 0
##… 6.934 additional lines

Sometimes it is necessary to store more than one variable, for example. B. for restated data. To do this, you need to specify the names of the variables to be left as vectors with or without a negative sign.

wide_df %>%
pivot_longer(-c(continent,code),
names_to=year,
values_to= mobile_subs)

This preserves the two specified columns and changes the shape of the rest of the data. We now have a very clear data framework with four columns.

## Skull: 6,944 x 4
## continent code year mobile_subs
##
## 1 AFG Asia 1990 0
## 2 AFG Asia 1991 0
## 3 AFG Asia 1992 0
## 4 AFG Asia 1993 0
## 5 AFG Asia 1994 0
## 6 AFG Asia 1995 0
## 7 AFG Asia 1996 0
## 8 AFG Asia 1997 0
## 9 AFG Asia 1998 0
## 10 AFG Asia 1999 0
## …. with 6,934 lines

pivot_longer() with several variables

Usually, our general data framework is more complex and can contain multiple variables that need to be adjusted. To illustrate how pivot_longer can be used to change the shape of a large multivariate data frame, we use historical multivariate moving data.

Here is a cartoon of a large data frame with three variables and the resulting net data we would like.

pivot_long with various variables
On the cmdlinetips.com page of the github site we have a large data frame with mobile phone subscribers, gross domestic product and total population in a wide format. Let’s download this data directly from Gitube.

wide_multi_df <- readr::read_tsv(‘https://raw.githubusercontent.com/cmdlinetips/data/master/mobile_subscription_wide_multiple_variables_tidytuesday.tsv’)

And our raw data looks like this. Note that the column names contain information on variables and years and that the separator is a period.

wide_multi_df
## ## One Tibble: 248 x 86
## code continent mobile_subs.1990 mobile_subs.1991 mobile_subs.1992
##
## 1 AFG Asia 0 0
## 2 ALB Europe 0 0
## 3 DZA Africa 0,00181 0,0180 0,0176
## 4 AOM Oceania 0 0 1,41
## 5 I Europe 0 0 0 1.31
## 6 AGO Africa 0 0
## 7 AIA America NA
## 8 ATG America 0 NA
## 9 ARG America 0,0367 0,0753 0,138
## 10 ARM Asia 0 0
## . 238 additional lines and 81 additional variables: mobile_subs.1993
## ## mobile_subs.1994 , mobile_subs.1995 , mobile_subs.1996 ,

To change the form of multivariate wide data using the pivot_long variable, we first need to select the interesting columns corresponding to multiple variables. In the following example, we consider dplyr’s start_with() function, which selects columns beginning with rows corresponding to three variables. Next, we use the names_to and values_to arguments of pivot_longer() to create new variables in the form neat/longer.

cleared_df <- wide_multi_df %>%
pivot_longer( #cols = mobile_subs.1990:total_pop.2017,
cols = c(start_with(mobile),
start_with(gdp),
start_with(total)),
names_to = var_name,
values_to = fall
)

And we get the data in a concise form with four interesting columns.

ordentlich

## ## A Tibble: 20.832 x 4
## continent code var_name trap
##
## 1 AFG Asia mobile_subs.1990 0
## 2 AFG Asia mobile_subs.1991 0
## 3 AFG Asia mobile_subs.1992 0
## 4 AFG Asia mobile_subs.1993 0
## 5 AFG Asia mobile_subs.1994 0
## 6 AFG Asia mobile_subs.1995 0
## 7 AFG Asia mobile_subs.1996 0
## 8 AFG Asia mobile_subs.1997 0
## 9 AFG Asia mobile_subs.1998 0
## 10 AFG Asia mobile_subs.1999 0
## . with 20,822 extra lines.

We often like data in a general intermediate form. Here is a cartoon illustration of the shape we are interested in.

Intermediate latitude data of its own
In this case, for example, we want to have the three variables on separate columns. We can use pivot_wider() to create a data frame with three variables in a separate column. First we have to separate the name of the variables and the year. One way to do this is by using the separate() function of tidyr.

tidy = %>%
separate(var_name,c(var_n,year), sep=\). %>%
pivot_wider(name_of = var_n,
value_of=val).

And here is a user-friendly data frame converted from a neat data frame.

## Skull: 6,944 x 6
## continent code year mobile_subs gdp_per_cap total_pop
##
## 1 AFG Asia 1990 0 NA 13032161
## 2 AFG Asia 1991 0 NA 14069854
## 3 AFG Asia 1992 0 NA 15472076
## 4 AFG Asia 1993 0 NA 17053213
## 5 AFG Asia 1994 0 NA 18553819
## 6 AFG Asia 1995 0 NA 19789880
## 7 AFG Asia 1996 0 NA 20684982
## 8 AFG Asia 1997 0 NA 21299350
## 9 AFG Asia 1998 0 NA 21752257
## 10 AFG Asia 1999 0 NA 22227543
## …. with 6,934 lines

Related Tags:

pivot_longer multiple columns,tidyr pivot_wider,pivot_wider multiple columns,pivot_wider not working,r pivot_longer two sets of columns,dplyr pivot,tidyr wide to long,tidyr spread,tidyr gather