Data frames in R are rectangular data structures. They consist of one or more columns of data, each with the same number of rows, and with each element of each column being of the same data type, e.g. numbers, character strings, Boolean values, etc. Apart from these general constraints, however, the format and shape that the data takes within a data frame can vary. For example, consider two data frames that are read in from .csv files. On the one hand, we have the data frame read from df_wide.csv:

df_wide <- read_csv('data/df_wide.csv')
df_wide
## # A tibble: 5 × 4
##   subject condition_A condition_B condition_C
##   <chr>         <dbl>       <dbl>       <dbl>
## 1 faye             26          12          42
## 2 jason            29           8          35
## 3 jim              32          15          45
## 4 ron              22          10          38
## 5 victor           30          13          40

This is factiousness data representing the results of a repeated measures memory recall experiment. For each subject, we obtain a memory recall value in each of three different experimental conditions. So, for example, subject faye as a memory recall score of 26 in experimental condition condition_A. On the other hand, we have the data frame read from df_long.csv:

df_long <- read_csv('data/df_long.csv')
df_long
## # A tibble: 15 × 3
##    subject condition recall
##    <chr>   <chr>      <dbl>
##  1 faye    A             26
##  2 faye    B             12
##  3 faye    C             42
##  4 jason   A             29
##  5 jason   B              8
##  6 jason   C             35
##  7 jim     A             32
##  8 jim     B             15
##  9 jim     C             45
## 10 ron     A             22
## 11 ron     B             10
## 12 ron     C             38
## 13 victor  A             30
## 14 victor  B             13
## 15 victor  C             40

The information in df_long is identical to df_wide. For example, by reading across the first row, we see that subject faye has a score of 26 in condition A. Obviously, however, df_wide and df_long are structured or shaped differently. We say that df_wide is a wide data format, and df_long is a long data format.

Often, we need to reshape data from a wide to long format, and vice versa. For many analyses in R, a long format is required, but it is not uncommon for the raw data to be in a wide format, and so it is necessary to reshape it from wide to long. Although reshaping from wide to long is the more common operation, we also still need to be able to reshape from to long to wide. Sometimes, for example, tables in a wide format may be easier to read and understand. More generally, when doing relatively complex data manipulation we may need to repeatedly reshape data in different ways, so being able to move efficiently from long to wide formats and back is an essentially general skill in data wrangling.

In this post, we describe how to use dplyr’s pivot_longer and pivot_wider, which reshape data from wide to long formats, and long to wide formats, respectively. We will discuss the basic versions of both functions, but then also discuss some of their more complex variants.

From wide to long format with pivot_longer

To convert df_wide to the long format in df_long, we use pivot_longer. In the basic version of this function, we need to specify just three things. First, we need to specify which columns to “pivot”, which in this case is the columns condition_A, condition_B, condition_C. Next, we need to specify the name of the new column whose values are the names of the to-be-pivotted columns. In this case, we can use the name condition as the names of the to-be-pivotted columns, i.e., condition_A, condition_B, condition_C, are the values of the experimental condition. Finally, we need to specify the name of the new column whose values are the values of the to-be-pivotted columns. In this case, we can use the name recall as the values of the to-be-pivotted columns are the recall memory scores.

pivot_longer(df_wide,
             cols = -subject,
             names_to = 'condition',
             values_to = 'recall')
## # A tibble: 15 × 3
##    subject condition   recall
##    <chr>   <chr>        <dbl>
##  1 faye    condition_A     26
##  2 faye    condition_B     12
##  3 faye    condition_C     42
##  4 jason   condition_A     29
##  5 jason   condition_B      8
##  6 jason   condition_C     35
##  7 jim     condition_A     32
##  8 jim     condition_B     15
##  9 jim     condition_C     45
## 10 ron     condition_A     22
## 11 ron     condition_B     10
## 12 ron     condition_C     38
## 13 victor  condition_A     30
## 14 victor  condition_B     13
## 15 victor  condition_C     40

Notice that here we selected the columns, condition_A, condition_B, condition_C with cols = -subject, which selects all columns except subject.

The data frame produced by this pivot_longer is very similar, though not identical, to the df_long above. In particular, the values of condition in this data frame are condition_A, condition_B, etc., while in df_long, they are A, B, etc. In order to produce the same data frame as df_long, we can indicate that in this case, the names of columns being pivoted all have the prefix condition_ that should not be used in the values of the new condition column:

pivot_longer(df_wide,
             cols = -subject,
             names_to = 'condition',
             names_prefix = 'condition_',
             values_to = 'recall')
## # A tibble: 15 × 3
##    subject condition recall
##    <chr>   <chr>      <dbl>
##  1 faye    A             26
##  2 faye    B             12
##  3 faye    C             42
##  4 jason   A             29
##  5 jason   B              8
##  6 jason   C             35
##  7 jim     A             32
##  8 jim     B             15
##  9 jim     C             45
## 10 ron     A             22
## 11 ron     B             10
## 12 ron     C             38
## 13 victor  A             30
## 14 victor  B             13
## 15 victor  C             40

Pivotting long into multiple columns

Consider the data in the file df_wide2.csv:

df_wide2 <- read_csv('data/df_wide2.csv')
df_wide2
## # A tibble: 5 × 7
##   subject cued_neg cued_neu cued_pos free_neg free_neu free_pos
##   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 faye          15       16       14       13       13       12
## 2 jason          4        9       10        6        7        9
## 3 jim            7        9       10        8        9        5
## 4 ron           17       18       20       12       14       15
## 5 victor        16       13       14       12       13       14

This fictitious data is also from a repeated measures experiment, but in this case, it is a factorial experiment, whereby data is obtained for each subject for each combination of two conditions. One condition has values cued versus free, and the other has conditions neg, neu, and pos.

We can pivot data like this with pivot_longer just like above:

pivot_longer(df_wide2,
             cols = -subject,
             names_to = 'condition',
             values_to = 'recall')
## # A tibble: 30 × 3
##    subject condition recall
##    <chr>   <chr>      <dbl>
##  1 faye    cued_neg      15
##  2 faye    cued_neu      16
##  3 faye    cued_pos      14
##  4 faye    free_neg      13
##  5 faye    free_neu      13
##  6 faye    free_pos      12
##  7 jason   cued_neg       4
##  8 jason   cued_neu       9
##  9 jason   cued_pos      10
## 10 jason   free_neg       6
## # … with 20 more rows

In this case, the condition column has values like cued_neg, cued_neu, and so on. It would be preferable to split condition into two new columns with names like cue and emotion. We could do this with tidyr’s separate as follows:

pivot_longer(df_wide2,
             cols = -subject,
             names_to = 'condition',
             values_to = 'recall') %>% 
  separate(condition, into = c('cue', 'emotion'), sep = '_')
## # A tibble: 30 × 4
##    subject cue   emotion recall
##    <chr>   <chr> <chr>    <dbl>
##  1 faye    cued  neg         15
##  2 faye    cued  neu         16
##  3 faye    cued  pos         14
##  4 faye    free  neg         13
##  5 faye    free  neu         13
##  6 faye    free  pos         12
##  7 jason   cued  neg          4
##  8 jason   cued  neu          9
##  9 jason   cued  pos         10
## 10 jason   free  neg          6
## # … with 20 more rows

However, we achieve the same effect using pivot_longer on its own too. For this, we provide a vector of two names for the names_to argument, and indicate that the names of the to-be-pivoted columns must be split using a certain separator:

df_long2 <- pivot_longer(df_wide2,
                         cols = -subject,
                         names_to = c('cue', 'emotion'),
                         names_sep = '_',
                         values_to = 'recall')

We can also use names_pattern to provide a regular expression to indicate how to split the names of the to-be-pivoted columns into its constituent parts. In the case of df_wide2, this is not necessary as names_sep will suffice, but it is useful to illustrate this feature for use in more complex situations. One regular expression we could use is (.*)_(.*). The parentheses enclose the patterns of the two names. The .* pattern means zero or more characters. So, the names to extract are strings to the left and to the right of the _ separator:

pivot_longer(df_wide2,
             cols = -subject,
             names_to = c('cue', 'emotion'),
             names_pattern = '(.*)_(.*)',
             values_to = 'recall')
## # A tibble: 30 × 4
##    subject cue   emotion recall
##    <chr>   <chr> <chr>    <dbl>
##  1 faye    cued  neg         15
##  2 faye    cued  neu         16
##  3 faye    cued  pos         14
##  4 faye    free  neg         13
##  5 faye    free  neu         13
##  6 faye    free  pos         12
##  7 jason   cued  neg          4
##  8 jason   cued  neu          9
##  9 jason   cued  pos         10
## 10 jason   free  neg          6
## # … with 20 more rows

We could be more explicit with our names_pattern regular expression as follows:

pivot_longer(df_wide2,
             cols = -subject,
             names_to = c('cue', 'emotion'),
             names_pattern = '(cued|free)_(neg|neu|pos)',
             values_to = 'recall')
## # A tibble: 30 × 4
##    subject cue   emotion recall
##    <chr>   <chr> <chr>    <dbl>
##  1 faye    cued  neg         15
##  2 faye    cued  neu         16
##  3 faye    cued  pos         14
##  4 faye    free  neg         13
##  5 faye    free  neu         13
##  6 faye    free  pos         12
##  7 jason   cued  neg          4
##  8 jason   cued  neu          9
##  9 jason   cued  pos         10
## 10 jason   free  neg          6
## # … with 20 more rows

In other words, we say that one name consists of cued or free, and the other consists of neg or neu or pos.

From long to wide format with pivot_wider

We can use pivot_wider to perform the inverse of pivot_longer. From example, to go from df_long to df_wide, we use names_from and values_from analogously to how we used names_to and values_to in pivot_longer:

pivot_wider(df_long,
            names_from = condition, 
            values_from = recall)
## # A tibble: 5 × 4
##   subject     A     B     C
##   <chr>   <dbl> <dbl> <dbl>
## 1 faye       26    12    42
## 2 jason      29     8    35
## 3 jim        32    15    45
## 4 ron        22    10    38
## 5 victor     30    13    40

In this case, the new column names are now A, B, and C. We can produce the names condition_A, condition_B, condition_C, by using the names_prefix just as we did above with pivot_longer:

pivot_wider(df_long,
            names_from = condition, 
            names_prefix = 'condition_',
            values_from = recall)
## # A tibble: 5 × 4
##   subject condition_A condition_B condition_C
##   <chr>         <dbl>       <dbl>       <dbl>
## 1 faye             26          12          42
## 2 jason            29           8          35
## 3 jim              32          15          45
## 4 ron              22          10          38
## 5 victor           30          13          40

We can also use pivot_wider when there are multiple columns to pivot. This can be used to perform the inverse operation that produced df_long2. In this case, we need only provide a vector of column names as the values of names_from. The values of the corresponding columns create the new columns names in the wide format:

pivot_wider(df_long2, 
            names_from = c(cue, emotion), 
            values_from = recall)
## # A tibble: 5 × 7
##   subject cued_neg cued_neu cued_pos free_neg free_neu free_pos
##   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 faye          15       16       14       13       13       12
## 2 jason          4        9       10        6        7        9
## 3 jim            7        9       10        8        9        5
## 4 ron           17       18       20       12       14       15
## 5 victor        16       13       14       12       13       14

By default, the separator between the two parts of the new names will be an underscore, but we can change this to another character, e.g. ., using the names_sep argument:

pivot_wider(df_long2, 
            names_from = c(cue, emotion), 
            values_from = recall,
            names_sep = '.')
## # A tibble: 5 × 7
##   subject cued.neg cued.neu cued.pos free.neg free.neu free.pos
##   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 faye          15       16       14       13       13       12
## 2 jason          4        9       10        6        7        9
## 3 jim            7        9       10        8        9        5
## 4 ron           17       18       20       12       14       15
## 5 victor        16       13       14       12       13       14

Combining pivot_longer and pivot_wider in one operation

Consider the following data frame, read in from df_long_wide.csv, which is a reshaped variant of the df_wide2 above:

df_long_wide <- read_csv('data/df_long_wide.csv')
df_long_wide
## # A tibble: 10 × 5
##    subject cue     neg   neu   pos
##    <chr>   <chr> <dbl> <dbl> <dbl>
##  1 faye    cued     15    16    14
##  2 faye    free     13    13    12
##  3 jason   cued      4     9    10
##  4 jason   free      6     7     9
##  5 jim     cued      7     9    10
##  6 jim     free      8     9     5
##  7 ron     cued     17    18    20
##  8 ron     free     12    14    15
##  9 victor  cued     16    13    14
## 10 victor  free     12    13    14

As we can see, this has a long format for one experimental variables (cue), but it is wide for the emotion variable. One way to create this data frame from df_wide2 is to perform a pivot_longer, exactly as we did above, followed by a pivot_wider as follows:

pivot_longer(df_wide2,
             cols = -subject,
             names_to = c('cue', 'emotion'),
             names_sep = '_',
             values_to = 'recall') %>% 
  pivot_wider(names_from = emotion,
              values_from = recall)
## # A tibble: 10 × 5
##    subject cue     neg   neu   pos
##    <chr>   <chr> <dbl> <dbl> <dbl>
##  1 faye    cued     15    16    14
##  2 faye    free     13    13    12
##  3 jason   cued      4     9    10
##  4 jason   free      6     7     9
##  5 jim     cued      7     9    10
##  6 jim     free      8     9     5
##  7 ron     cued     17    18    20
##  8 ron     free     12    14    15
##  9 victor  cued     16    13    14
## 10 victor  free     12    13    14

We can, however, achieve this same result from pivot_longer alone. To do so, we use a special variable name .value as one of the variables in the names_to vector. In other words, instead of names_to = c('cue', 'emotion'), we do names_to = c('cue', '.value'). The .value effectively tells pivot_longer that the second variable should be pivoted wide. Also, when using this special .value keyword, we don’t use a values_to argument.

pivot_longer(df_wide2,
             cols = -subject,
             names_to = c('cue', '.value'),
             names_sep = '_') 
## # A tibble: 10 × 5
##    subject cue     neg   neu   pos
##    <chr>   <chr> <dbl> <dbl> <dbl>
##  1 faye    cued     15    16    14
##  2 faye    free     13    13    12
##  3 jason   cued      4     9    10
##  4 jason   free      6     7     9
##  5 jim     cued      7     9    10
##  6 jim     free      8     9     5
##  7 ron     cued     17    18    20
##  8 ron     free     12    14    15
##  9 victor  cued     16    13    14
## 10 victor  free     12    13    14

One common situation where the need for this kind of semi-long/semi-wide formatting arises is when performing multiple summarize functions on multiple variables. Consider the following data:

data_df <- tibble(x = rnorm(10),
                  y = rnorm(10),
                  z = rnorm(10))

Let’s say we want to calculate the mean, median, and standard deviation of all three variables. We can accomplish this using the across function and a list of summary statistics functions inside summarize (see this blog post for more information on using across)

data_summary <- summarize(data_df,
          across(everything(),
                 list(mean = mean,
                      median = median,
                      stdev = sd)
          )
)
data_summary
## # A tibble: 1 × 9
##   x_mean x_median x_stdev y_mean y_median y_stdev z_mean z_median z_stdev
##    <dbl>    <dbl>   <dbl>  <dbl>    <dbl>   <dbl>  <dbl>    <dbl>   <dbl>
## 1  0.391    0.711    1.18 0.0960   -0.164   0.951 0.0665    0.200    1.25

This resulting data frame is not in an ideal format. It would be easier if the three variables corresponded to three separate rows, and the the columns were the mean, median, and standard deviation. We can accomplish this with a pivot_longer using the .value keyword in names_to:

data_summary %>% 
  pivot_longer(cols = everything(),
               names_to = c('variable', '.value'),
               names_sep = '_')
## # A tibble: 3 × 4
##   variable   mean median stdev
##   <chr>     <dbl>  <dbl> <dbl>
## 1 x        0.391   0.711 1.18 
## 2 y        0.0960 -0.164 0.951
## 3 z        0.0665  0.200 1.25