Reshaping data with pivot_longer and pivot_wider
Mark Andrews. July 25, 2021
Data frames in R are rectangular data structures. They consist of one or more columns of data, each with the same number of rows, and with each element of each column being of the same data type, e.g. numbers, character strings, Boolean values, etc. Apart from these general constraints, however, the format and shape that the data takes within a data frame can vary. For example, consider two data frames that are read in from .csv
files. On the one hand, we have the data frame read from df_wide.csv
:
df_wide <- read_csv('data/df_wide.csv')
df_wide
## # A tibble: 5 × 4
## subject condition_A condition_B condition_C
## <chr> <dbl> <dbl> <dbl>
## 1 faye 26 12 42
## 2 jason 29 8 35
## 3 jim 32 15 45
## 4 ron 22 10 38
## 5 victor 30 13 40
This is factiousness data representing the results of a repeated measures memory recall experiment. For each subject, we obtain a memory recall value in each of three different experimental conditions. So, for example, subject faye
as a memory recall score of 26 in experimental condition condition_A
. On the other hand, we have the data frame read from df_long.csv
:
df_long <- read_csv('data/df_long.csv')
df_long
## # A tibble: 15 × 3
## subject condition recall
## <chr> <chr> <dbl>
## 1 faye A 26
## 2 faye B 12
## 3 faye C 42
## 4 jason A 29
## 5 jason B 8
## 6 jason C 35
## 7 jim A 32
## 8 jim B 15
## 9 jim C 45
## 10 ron A 22
## 11 ron B 10
## 12 ron C 38
## 13 victor A 30
## 14 victor B 13
## 15 victor C 40
The information in df_long
is identical to df_wide
. For example, by reading across the first row, we see that subject faye
has a score of 26 in condition A
. Obviously, however, df_wide
and df_long
are structured or shaped differently. We say that df_wide
is a wide data format, and df_long
is a long data format.
Often, we need to reshape data from a wide to long format, and vice versa. For many analyses in R, a long format is required, but it is not uncommon for the raw data to be in a wide format, and so it is necessary to reshape it from wide to long. Although reshaping from wide to long is the more common operation, we also still need to be able to reshape from to long to wide. Sometimes, for example, tables in a wide format may be easier to read and understand. More generally, when doing relatively complex data manipulation we may need to repeatedly reshape data in different ways, so being able to move efficiently from long to wide formats and back is an essentially general skill in data wrangling.
In this post, we describe how to use dplyr
’s pivot_longer
and pivot_wider
, which reshape data from wide to long formats, and long to wide formats, respectively. We will discuss the basic versions of both functions, but then also discuss some of their more complex variants.
From wide to long format with pivot_longer
To convert df_wide
to the long format in df_long
, we use pivot_longer
. In the basic version of this function, we need to specify just three things. First, we need to specify which columns to “pivot”, which in this case is the columns condition_A
, condition_B
, condition_C
. Next, we need to specify the name of the new column whose values are the names of the to-be-pivotted columns. In this case, we can use the name condition
as the names of the to-be-pivotted columns, i.e., condition_A
, condition_B
, condition_C
, are the values of the experimental condition. Finally, we need to specify the name of the new column whose values are the values of the to-be-pivotted columns. In this case, we can use the name recall
as the values of the to-be-pivotted columns are the recall memory scores.
pivot_longer(df_wide,
cols = -subject,
names_to = 'condition',
values_to = 'recall')
## # A tibble: 15 × 3
## subject condition recall
## <chr> <chr> <dbl>
## 1 faye condition_A 26
## 2 faye condition_B 12
## 3 faye condition_C 42
## 4 jason condition_A 29
## 5 jason condition_B 8
## 6 jason condition_C 35
## 7 jim condition_A 32
## 8 jim condition_B 15
## 9 jim condition_C 45
## 10 ron condition_A 22
## 11 ron condition_B 10
## 12 ron condition_C 38
## 13 victor condition_A 30
## 14 victor condition_B 13
## 15 victor condition_C 40
Notice that here we selected the columns, condition_A
, condition_B
, condition_C
with cols = -subject
, which selects all columns except subject
.
The data frame produced by this pivot_longer
is very similar, though not identical, to the df_long
above. In particular, the values of condition
in this data frame are condition_A
, condition_B
, etc., while in df_long
, they are A
, B
, etc. In order to produce the same data frame as df_long
, we can indicate that in this case, the names of columns being pivoted all have the prefix condition_
that should not be used in the values of the new condition
column:
pivot_longer(df_wide,
cols = -subject,
names_to = 'condition',
names_prefix = 'condition_',
values_to = 'recall')
## # A tibble: 15 × 3
## subject condition recall
## <chr> <chr> <dbl>
## 1 faye A 26
## 2 faye B 12
## 3 faye C 42
## 4 jason A 29
## 5 jason B 8
## 6 jason C 35
## 7 jim A 32
## 8 jim B 15
## 9 jim C 45
## 10 ron A 22
## 11 ron B 10
## 12 ron C 38
## 13 victor A 30
## 14 victor B 13
## 15 victor C 40
Pivotting long into multiple columns
Consider the data in the file df_wide2.csv
:
df_wide2 <- read_csv('data/df_wide2.csv')
df_wide2
## # A tibble: 5 × 7
## subject cued_neg cued_neu cued_pos free_neg free_neu free_pos
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 faye 15 16 14 13 13 12
## 2 jason 4 9 10 6 7 9
## 3 jim 7 9 10 8 9 5
## 4 ron 17 18 20 12 14 15
## 5 victor 16 13 14 12 13 14
This fictitious data is also from a repeated measures experiment, but in this case, it is a factorial experiment, whereby data is obtained for each subject for each combination of two conditions. One condition has values cued
versus free
, and the other has conditions neg
, neu
, and pos
.
We can pivot data like this with pivot_longer
just like above:
pivot_longer(df_wide2,
cols = -subject,
names_to = 'condition',
values_to = 'recall')
## # A tibble: 30 × 3
## subject condition recall
## <chr> <chr> <dbl>
## 1 faye cued_neg 15
## 2 faye cued_neu 16
## 3 faye cued_pos 14
## 4 faye free_neg 13
## 5 faye free_neu 13
## 6 faye free_pos 12
## 7 jason cued_neg 4
## 8 jason cued_neu 9
## 9 jason cued_pos 10
## 10 jason free_neg 6
## # … with 20 more rows
In this case, the condition
column has values like cued_neg
, cued_neu
, and so on. It would be preferable to split condition
into two new columns with names like cue
and emotion
. We could do this with tidyr
’s separate
as follows:
pivot_longer(df_wide2,
cols = -subject,
names_to = 'condition',
values_to = 'recall') %>%
separate(condition, into = c('cue', 'emotion'), sep = '_')
## # A tibble: 30 × 4
## subject cue emotion recall
## <chr> <chr> <chr> <dbl>
## 1 faye cued neg 15
## 2 faye cued neu 16
## 3 faye cued pos 14
## 4 faye free neg 13
## 5 faye free neu 13
## 6 faye free pos 12
## 7 jason cued neg 4
## 8 jason cued neu 9
## 9 jason cued pos 10
## 10 jason free neg 6
## # … with 20 more rows
However, we achieve the same effect using pivot_longer
on its own too. For this, we provide a vector of two names for the names_to
argument, and indicate that the names of the to-be-pivoted columns must be split using a certain separator:
df_long2 <- pivot_longer(df_wide2,
cols = -subject,
names_to = c('cue', 'emotion'),
names_sep = '_',
values_to = 'recall')
We can also use names_pattern
to provide a regular expression to indicate how to split the names of the to-be-pivoted columns into its constituent parts. In the case of df_wide2
, this is not necessary as names_sep
will suffice, but it is useful to illustrate this feature for use in more complex situations. One regular expression we could use is (.*)_(.*)
. The parentheses enclose the patterns of the two names. The .*
pattern means zero or more characters. So, the names to extract are strings to the left and to the right of the _
separator:
pivot_longer(df_wide2,
cols = -subject,
names_to = c('cue', 'emotion'),
names_pattern = '(.*)_(.*)',
values_to = 'recall')
## # A tibble: 30 × 4
## subject cue emotion recall
## <chr> <chr> <chr> <dbl>
## 1 faye cued neg 15
## 2 faye cued neu 16
## 3 faye cued pos 14
## 4 faye free neg 13
## 5 faye free neu 13
## 6 faye free pos 12
## 7 jason cued neg 4
## 8 jason cued neu 9
## 9 jason cued pos 10
## 10 jason free neg 6
## # … with 20 more rows
We could be more explicit with our names_pattern
regular expression as follows:
pivot_longer(df_wide2,
cols = -subject,
names_to = c('cue', 'emotion'),
names_pattern = '(cued|free)_(neg|neu|pos)',
values_to = 'recall')
## # A tibble: 30 × 4
## subject cue emotion recall
## <chr> <chr> <chr> <dbl>
## 1 faye cued neg 15
## 2 faye cued neu 16
## 3 faye cued pos 14
## 4 faye free neg 13
## 5 faye free neu 13
## 6 faye free pos 12
## 7 jason cued neg 4
## 8 jason cued neu 9
## 9 jason cued pos 10
## 10 jason free neg 6
## # … with 20 more rows
In other words, we say that one name consists of cued
or free
, and the other consists of neg
or neu
or pos
.
From long to wide format with pivot_wider
We can use pivot_wider
to perform the inverse of pivot_longer
. From example, to go from df_long
to df_wide
, we use names_from
and values_from
analogously to how we used names_to
and values_to
in pivot_longer
:
pivot_wider(df_long,
names_from = condition,
values_from = recall)
## # A tibble: 5 × 4
## subject A B C
## <chr> <dbl> <dbl> <dbl>
## 1 faye 26 12 42
## 2 jason 29 8 35
## 3 jim 32 15 45
## 4 ron 22 10 38
## 5 victor 30 13 40
In this case, the new column names are now A
, B
, and C
. We can produce the names condition_A
, condition_B
, condition_C
, by using the names_prefix
just as we did above with pivot_longer
:
pivot_wider(df_long,
names_from = condition,
names_prefix = 'condition_',
values_from = recall)
## # A tibble: 5 × 4
## subject condition_A condition_B condition_C
## <chr> <dbl> <dbl> <dbl>
## 1 faye 26 12 42
## 2 jason 29 8 35
## 3 jim 32 15 45
## 4 ron 22 10 38
## 5 victor 30 13 40
We can also use pivot_wider
when there are multiple columns to pivot. This can be used to perform the inverse operation that produced df_long2
. In this case, we need only provide a vector of column names as the values of names_from
. The values of the corresponding columns create the new columns names in the wide format:
pivot_wider(df_long2,
names_from = c(cue, emotion),
values_from = recall)
## # A tibble: 5 × 7
## subject cued_neg cued_neu cued_pos free_neg free_neu free_pos
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 faye 15 16 14 13 13 12
## 2 jason 4 9 10 6 7 9
## 3 jim 7 9 10 8 9 5
## 4 ron 17 18 20 12 14 15
## 5 victor 16 13 14 12 13 14
By default, the separator between the two parts of the new names will be an underscore, but we can change this to another character, e.g. .
, using the names_sep
argument:
pivot_wider(df_long2,
names_from = c(cue, emotion),
values_from = recall,
names_sep = '.')
## # A tibble: 5 × 7
## subject cued.neg cued.neu cued.pos free.neg free.neu free.pos
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 faye 15 16 14 13 13 12
## 2 jason 4 9 10 6 7 9
## 3 jim 7 9 10 8 9 5
## 4 ron 17 18 20 12 14 15
## 5 victor 16 13 14 12 13 14
Combining pivot_longer
and pivot_wider
in one operation
Consider the following data frame, read in from df_long_wide.csv
, which is a reshaped variant of the df_wide2
above:
df_long_wide <- read_csv('data/df_long_wide.csv')
df_long_wide
## # A tibble: 10 × 5
## subject cue neg neu pos
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 faye cued 15 16 14
## 2 faye free 13 13 12
## 3 jason cued 4 9 10
## 4 jason free 6 7 9
## 5 jim cued 7 9 10
## 6 jim free 8 9 5
## 7 ron cued 17 18 20
## 8 ron free 12 14 15
## 9 victor cued 16 13 14
## 10 victor free 12 13 14
As we can see, this has a long format for one experimental variables (cue
), but it is wide for the emotion
variable. One way to create this data frame from df_wide2
is to perform a pivot_longer
, exactly as we did above, followed by a pivot_wider
as follows:
pivot_longer(df_wide2,
cols = -subject,
names_to = c('cue', 'emotion'),
names_sep = '_',
values_to = 'recall') %>%
pivot_wider(names_from = emotion,
values_from = recall)
## # A tibble: 10 × 5
## subject cue neg neu pos
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 faye cued 15 16 14
## 2 faye free 13 13 12
## 3 jason cued 4 9 10
## 4 jason free 6 7 9
## 5 jim cued 7 9 10
## 6 jim free 8 9 5
## 7 ron cued 17 18 20
## 8 ron free 12 14 15
## 9 victor cued 16 13 14
## 10 victor free 12 13 14
We can, however, achieve this same result from pivot_longer
alone. To do so, we use a special variable name .value
as one of the variables in the names_to
vector. In other words, instead of names_to = c('cue', 'emotion')
, we do names_to = c('cue', '.value')
. The .value
effectively tells pivot_longer
that the second variable should be pivoted wide. Also, when using this special .value
keyword, we don’t use a values_to
argument.
pivot_longer(df_wide2,
cols = -subject,
names_to = c('cue', '.value'),
names_sep = '_')
## # A tibble: 10 × 5
## subject cue neg neu pos
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 faye cued 15 16 14
## 2 faye free 13 13 12
## 3 jason cued 4 9 10
## 4 jason free 6 7 9
## 5 jim cued 7 9 10
## 6 jim free 8 9 5
## 7 ron cued 17 18 20
## 8 ron free 12 14 15
## 9 victor cued 16 13 14
## 10 victor free 12 13 14
One common situation where the need for this kind of semi-long/semi-wide formatting arises is when performing multiple summarize
functions on multiple variables. Consider the following data:
data_df <- tibble(x = rnorm(10),
y = rnorm(10),
z = rnorm(10))
Let’s say we want to calculate the mean, median, and standard deviation of all three variables. We can accomplish this using the across
function and a list of summary statistics functions inside summarize
(see this blog post for more information on using across
)
data_summary <- summarize(data_df,
across(everything(),
list(mean = mean,
median = median,
stdev = sd)
)
)
data_summary
## # A tibble: 1 × 9
## x_mean x_median x_stdev y_mean y_median y_stdev z_mean z_median z_stdev
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.391 0.711 1.18 0.0960 -0.164 0.951 0.0665 0.200 1.25
This resulting data frame is not in an ideal format. It would be easier if the three variables corresponded to three separate rows, and the the columns were the mean, median, and standard deviation. We can accomplish this with a pivot_longer
using the .value
keyword in names_to
:
data_summary %>%
pivot_longer(cols = everything(),
names_to = c('variable', '.value'),
names_sep = '_')
## # A tibble: 3 × 4
## variable mean median stdev
## <chr> <dbl> <dbl> <dbl>
## 1 x 0.391 0.711 1.18
## 2 y 0.0960 -0.164 0.951
## 3 z 0.0665 0.200 1.25