In this post, we describe how to use dplyr’s pivot_longer and pivot_wider functions. These are used to reshape data frames from wide to long formats, and long to wide formats, respectively. We will discuss the basic versions of both functions, but then also discuss some of their more complex variants.
Author
Mark Andrews
Published
July 25, 2021
Data frames in R are rectangular data structures. They consist of one or more columns of data, each with the same number of rows, and with each element of each column being of the same data type, e.g. numbers, character strings, Boolean values, etc. Apart from these general constraints, however, the format and shape that the data takes within a data frame can vary. For example, consider two data frames that are read in from .csv files. On the one hand, we have the data frame read from df_wide.csv:
df_wide <-read_csv('data/df_wide.csv')df_wide
# A tibble: 5 × 4
subject condition_A condition_B condition_C
<chr> <dbl> <dbl> <dbl>
1 faye 26 12 42
2 jason 29 8 35
3 jim 32 15 45
4 ron 22 10 38
5 victor 30 13 40
This is factiousness data representing the results of a repeated measures memory recall experiment. For each subject, we obtain a memory recall value in each of three different experimental conditions. So, for example, subject faye as a memory recall score of 26 in experimental condition condition_A. On the other hand, we have the data frame read from df_long.csv:
df_long <-read_csv('data/df_long.csv')df_long
# A tibble: 15 × 3
subject condition recall
<chr> <chr> <dbl>
1 faye A 26
2 faye B 12
3 faye C 42
4 jason A 29
5 jason B 8
6 jason C 35
7 jim A 32
8 jim B 15
9 jim C 45
10 ron A 22
11 ron B 10
12 ron C 38
13 victor A 30
14 victor B 13
15 victor C 40
The information in df_long is identical to df_wide. For example, by reading across the first row, we see that subject faye has a score of 26 in condition A. Obviously, however, df_wide and df_long are structured or shaped differently. We say that df_wide is a wide data format, and df_long is a long data format.
Often, we need to reshape data from a wide to long format, and vice versa. For many analyses in R, a long format is required, but it is not uncommon for the raw data to be in a wide format, and so it is necessary to reshape it from wide to long. Although reshaping from wide to long is the more common operation, we also still need to be able to reshape from to long to wide. Sometimes, for example, tables in a wide format may be easier to read and understand. More generally, when doing relatively complex data manipulation we may need to repeatedly reshape data in different ways, so being able to move efficiently from long to wide formats and back is an essentially general skill in data wrangling.
In this post, we describe how to use dplyr’s pivot_longer and pivot_wider, which reshape data from wide to long formats, and long to wide formats, respectively. We will discuss the basic versions of both functions, but then also discuss some of their more complex variants.
From wide to long format with pivot_longer
To convert df_wide to the long format in df_long, we use pivot_longer. In the basic version of this function, we need to specify just three things. First, we need to specify which columns to “pivot”, which in this case is the columns condition_A, condition_B, condition_C. Next, we need to specify the name of the new column whose values are the names of the to-be-pivotted columns. In this case, we can use the name condition as the names of the to-be-pivotted columns, i.e., condition_A, condition_B, condition_C, are the values of the experimental condition. Finally, we need to specify the name of the new column whose values are the values of the to-be-pivotted columns. In this case, we can use the name recall as the values of the to-be-pivotted columns are the recall memory scores.
# A tibble: 15 × 3
subject condition recall
<chr> <chr> <dbl>
1 faye condition_A 26
2 faye condition_B 12
3 faye condition_C 42
4 jason condition_A 29
5 jason condition_B 8
6 jason condition_C 35
7 jim condition_A 32
8 jim condition_B 15
9 jim condition_C 45
10 ron condition_A 22
11 ron condition_B 10
12 ron condition_C 38
13 victor condition_A 30
14 victor condition_B 13
15 victor condition_C 40
Notice that here we selected the columns, condition_A, condition_B, condition_C with cols = -subject, which selects all columns except subject.
The data frame produced by this pivot_longer is very similar, though not identical, to the df_long above. In particular, the values of condition in this data frame are condition_A, condition_B, etc., while in df_long, they are A, B, etc. In order to produce the same data frame as df_long, we can indicate that in this case, the names of columns being pivoted all have the prefix condition_ that should not be used in the values of the new condition column:
# A tibble: 15 × 3
subject condition recall
<chr> <chr> <dbl>
1 faye A 26
2 faye B 12
3 faye C 42
4 jason A 29
5 jason B 8
6 jason C 35
7 jim A 32
8 jim B 15
9 jim C 45
10 ron A 22
11 ron B 10
12 ron C 38
13 victor A 30
14 victor B 13
15 victor C 40
This fictitious data is also from a repeated measures experiment, but in this case, it is a factorial experiment, whereby data is obtained for each subject for each combination of two conditions. One condition has values cued versus free, and the other has conditions neg, neu, and pos.
We can pivot data like this with pivot_longer just like above:
# A tibble: 30 × 3
subject condition recall
<chr> <chr> <dbl>
1 faye cued_neg 15
2 faye cued_neu 16
3 faye cued_pos 14
4 faye free_neg 13
5 faye free_neu 13
6 faye free_pos 12
7 jason cued_neg 4
8 jason cued_neu 9
9 jason cued_pos 10
10 jason free_neg 6
# ℹ 20 more rows
In this case, the condition column has values like cued_neg, cued_neu, and so on. It would be preferable to split condition into two new columns with names like cue and emotion. We could do this with tidyr’s separate as follows:
pivot_longer(df_wide2,cols =-subject,names_to ='condition',values_to ='recall') %>%separate(condition, into =c('cue', 'emotion'), sep ='_')
# A tibble: 30 × 4
subject cue emotion recall
<chr> <chr> <chr> <dbl>
1 faye cued neg 15
2 faye cued neu 16
3 faye cued pos 14
4 faye free neg 13
5 faye free neu 13
6 faye free pos 12
7 jason cued neg 4
8 jason cued neu 9
9 jason cued pos 10
10 jason free neg 6
# ℹ 20 more rows
However, we achieve the same effect using pivot_longer on its own too. For this, we provide a vector of two names for the names_to argument, and indicate that the names of the to-be-pivoted columns must be split using a certain separator:
We can also use names_pattern to provide a regular expression to indicate how to split the names of the to-be-pivoted columns into its constituent parts. In the case of df_wide2, this is not necessary as names_sep will suffice, but it is useful to illustrate this feature for use in more complex situations. One regular expression we could use is (.*)_(.*). The parentheses enclose the patterns of the two names. The .* pattern means zero or more characters. So, the names to extract are strings to the left and to the right of the _ separator:
# A tibble: 30 × 4
subject cue emotion recall
<chr> <chr> <chr> <dbl>
1 faye cued neg 15
2 faye cued neu 16
3 faye cued pos 14
4 faye free neg 13
5 faye free neu 13
6 faye free pos 12
7 jason cued neg 4
8 jason cued neu 9
9 jason cued pos 10
10 jason free neg 6
# ℹ 20 more rows
In other words, we say that one name consists of cued or free, and the other consists of neg or neu or pos.
From long to wide format with pivot_wider
We can use pivot_wider to perform the inverse of pivot_longer. From example, to go from df_long to df_wide, we use names_from and values_from analogously to how we used names_to and values_to in pivot_longer:
# A tibble: 5 × 4
subject A B C
<chr> <dbl> <dbl> <dbl>
1 faye 26 12 42
2 jason 29 8 35
3 jim 32 15 45
4 ron 22 10 38
5 victor 30 13 40
In this case, the new column names are now A, B, and C. We can produce the names condition_A, condition_B, condition_C, by using the names_prefix just as we did above with pivot_longer:
# A tibble: 5 × 4
subject condition_A condition_B condition_C
<chr> <dbl> <dbl> <dbl>
1 faye 26 12 42
2 jason 29 8 35
3 jim 32 15 45
4 ron 22 10 38
5 victor 30 13 40
We can also use pivot_wider when there are multiple columns to pivot. This can be used to perform the inverse operation that produced df_long2. In this case, we need only provide a vector of column names as the values of names_from. The values of the corresponding columns create the new columns names in the wide format:
By default, the separator between the two parts of the new names will be an underscore, but we can change this to another character, e.g. ., using the names_sep argument:
# A tibble: 10 × 5
subject cue neg neu pos
<chr> <chr> <dbl> <dbl> <dbl>
1 faye cued 15 16 14
2 faye free 13 13 12
3 jason cued 4 9 10
4 jason free 6 7 9
5 jim cued 7 9 10
6 jim free 8 9 5
7 ron cued 17 18 20
8 ron free 12 14 15
9 victor cued 16 13 14
10 victor free 12 13 14
As we can see, this has a long format for one experimental variables (cue), but it is wide for the emotion variable. One way to create this data frame from df_wide2 is to perform a pivot_longer, exactly as we did above, followed by a pivot_wider as follows:
# A tibble: 10 × 5
subject cue neg neu pos
<chr> <chr> <dbl> <dbl> <dbl>
1 faye cued 15 16 14
2 faye free 13 13 12
3 jason cued 4 9 10
4 jason free 6 7 9
5 jim cued 7 9 10
6 jim free 8 9 5
7 ron cued 17 18 20
8 ron free 12 14 15
9 victor cued 16 13 14
10 victor free 12 13 14
We can, however, achieve this same result from pivot_longer alone. To do so, we use a special variable name .value as one of the variables in the names_to vector. In other words, instead of names_to = c('cue', 'emotion'), we do names_to = c('cue', '.value'). The .value effectively tells pivot_longer that the second variable should be pivoted wide. Also, when using this special .value keyword, we don’t use a values_to argument.
# A tibble: 10 × 5
subject cue neg neu pos
<chr> <chr> <dbl> <dbl> <dbl>
1 faye cued 15 16 14
2 faye free 13 13 12
3 jason cued 4 9 10
4 jason free 6 7 9
5 jim cued 7 9 10
6 jim free 8 9 5
7 ron cued 17 18 20
8 ron free 12 14 15
9 victor cued 16 13 14
10 victor free 12 13 14
One common situation where the need for this kind of semi-long/semi-wide formatting arises is when performing multiple summarize functions on multiple variables. Consider the following data:
Let’s say we want to calculate the mean, median, and standard deviation of all three variables. We can accomplish this using the across function and a list of summary statistics functions inside summarize (see this blog post for more information on using across)
This resulting data frame is not in an ideal format. It would be easier if the three variables corresponded to three separate rows, and the the columns were the mean, median, and standard deviation. We can accomplish this with a pivot_longer using the .value keyword in names_to:
@online{andrews2021,
author = {Andrews, Mark},
title = {Reshaping Data with `Pivot\_longer` and `Pivot\_wider`},
date = {2021-07-25},
url = {https://www.mjandrews.org/notes/pivots/},
langid = {en}
}