Read Multiple Files into a Single Data Frame

tidyverse

In R, usually data frames are created by reading data from a single file. Sometimes, however, we may wish to read multiple files into a single data frame. In this post, we will look at some of the main tidyverse ways in which we can read in files into a single data frame.

Author

Mark Andrews

Published

July 23, 2021

For most analyses in R, our data are represented in data frames. These data frames are often created initially by reading in data from files, such as .csv files, Excel files, and so on. Usually, in this case, the data frame is created by reading from a single file. For example, we might have a csv file named foo.csv (available here) with the following contents:

#foo.csv
a,b,c
1,2,3
4,5,6

We can read in its contents as follows:

library(tidyverse)

read_csv('data/foo.csv')

# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

Sometimes, however, we may wish to read multiple files into a single data frame. There are countless situations where this could arise. As a very simple example, we might have conducted a study in multiple different countries, whereby in each country, we collected the gender, height, and weight of a sample of people. From each country, we might then have a single .csv file, with names like england.csv, france.csv, and so on, but we may wish to read the data from all these .csv files into one single data frame. Fictitious .csv data files of this kind are available in a zip file named country_data.zip.

In this post, we will look at some of the main ways in which we can read in multiple .csv files into a single data frame, using the county_data.zip files as an example. In particular, we will look at using purrr::map combined with dplyr::bind_rows, using purrr::map_dfr, and using readr version 2.0.0. We will also look some common problems, and their solutions, that can arise when reading and concatenating multiple csv files in this way. We will use read_csv as the function to read individual files, but all of what we discuss is equally applicable when using other read functions like read_tsv, read_delim, etc., from readr, and the methods involving the map functionals can be used with almost any function for reading files such as readxl::read_excel, haven::read_spss, and so on.

Using `purrr`’s `map` with `bind_rows`

As discussed in this blog post, functionals like base R’s lapply, purrr’s map, and their many variants, effectively apply functions to each element of a list and then collect the results in a new list or some other data structure. We can therefore use functionals to read multiple .csv files into R as a list of data frames, which may be then concatenated into a single data frame. To do this, we first get a vector with the names of .csv files that we want to read in. This can be done using base R’s list.files, or else the fs package’s dir_ls command.

For example, assuming we have unzipped the country_data.zip file into a directory named country_data, we can get a list of all the files in it as follows:

library(fs)
csv_file_list <- dir_ls('country_data')
csv_file_list

country_data/austria.csv     country_data/belgium.csv     
country_data/england.csv     country_data/france.csv      
country_data/germany.csv     country_data/ireland.csv     
country_data/italy.csv       country_data/netherlands.csv 
country_data/norway.csv      country_data/scotland.csv    
country_data/spain.csv       country_data/sweden.csv

We can now use purrr::map to read in each one of these .csv files in a data frame using read_csv, and to collect these data frames in a list. This is done as follows:

df_list <- map(csv_file_list, read_csv)

The object df_list is a list of length 12.

class(df_list)

[1] "list"

length(df_list)

[1] 12

Each element of this list is a data frame, and the name of each element is the file name from which the data was read.

names(df_list)

 [1] "country_data/austria.csv"     "country_data/belgium.csv"    
 [3] "country_data/england.csv"     "country_data/france.csv"     
 [5] "country_data/germany.csv"     "country_data/ireland.csv"    
 [7] "country_data/italy.csv"       "country_data/netherlands.csv"
 [9] "country_data/norway.csv"      "country_data/scotland.csv"   
[11] "country_data/spain.csv"       "country_data/sweden.csv"

We can now use bind_rows to concatenate all the data frames in the list:

bind_rows(df_list)

# A tibble: 1,200 × 3
   gender height weight
   <chr>   <dbl>  <dbl>
 1 male      149     83
 2 male      169     66
 3 male      183    108
 4 male      181    111
 5 female    175     83
 6 male      185     71
 7 female    172     74
 8 male      176     96
 9 female    163     79
10 male      181     93
# ℹ 1,190 more rows

While this does concatenate the data frame, the result does not contain a variable to indicate the country. We can, however, add in this variable by using the .id argument to bind_rows as follows:

bind_rows(df_list, .id = 'country')

# A tibble: 1,200 × 4
   country                  gender height weight
   <chr>                    <chr>   <dbl>  <dbl>
 1 country_data/austria.csv male      149     83
 2 country_data/austria.csv male      169     66
 3 country_data/austria.csv male      183    108
 4 country_data/austria.csv male      181    111
 5 country_data/austria.csv female    175     83
 6 country_data/austria.csv male      185     71
 7 country_data/austria.csv female    172     74
 8 country_data/austria.csv male      176     96
 9 country_data/austria.csv female    163     79
10 country_data/austria.csv male      181     93
# ℹ 1,190 more rows

This adds a new variable, named country, whose values are the list names corresponding to each data frame being concatenated. In this example, we probably would prefer the country name to be austria etc., rather than country_data/austria.csv etc. This can be accomplished with str_match and a regular expression as follows:

bind_rows(df_list, .id = 'country') %>% 
  mutate(country = str_match(country, 'country_data/(.*)\\.csv')[,2])

# A tibble: 1,200 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 1,190 more rows

Or if we are going to re-use the str_match function often, we can do

get_country <- function(filename){
  str_match(filename, 'country_data/(.*)\\.csv')[,2]
}

bind_rows(df_list, .id = 'country') %>% 
  mutate(country = get_country(country))

# A tibble: 1,200 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 1,190 more rows

We can put all these steps together into one pipeline as follows:

dir_ls('country_data') %>% 
  map(read_csv) %>% 
  bind_rows(.id = 'country') %>% 
  mutate(country = get_country(country))

# A tibble: 1,200 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 1,190 more rows

Using `map_dfr`

We also use the map_dfr variant of map to effectively perform the map and bind_rows operations in one step:

dir_ls('country_data') %>% 
  map_dfr(read_csv, .id = 'country') %>% 
  mutate(country = get_country(country))

# A tibble: 1,200 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 1,190 more rows

Given that map_dfr does in one step that which requires two separate steps when using map itself, it may seem always preferable. There are occasions, however, where we might like to have the list of data frames returned by map, and perform additional wrangling and tidying operations on the individual data frames before we combine them into one single data frame.

Additional arguments to `read_csv`

In all of the above example, we used read_csv with no optional arguments. Should we wish to use optional arguments to read_csv, we can use add them as arguments to map or map_dfr. For example, if we want to limit the number of records we read in to no more than 10 rows per file, we would use n_max = 10 in read_csv, which we can accomplish with a functional as follows:

dir_ls('country_data') %>% 
  map_dfr(read_csv, n_max = 10, .id = 'country') %>% 
  mutate(country = get_country(country))

# A tibble: 120 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 110 more rows

Using `readr` 2.0.0

Since readr 2.0.0, released on July 20, 2021, read_csv on its own can read in and concatenate multiple files:

packageVersion('readr')

[1] '2.1.5'

dir_ls('country_data') %>% 
  read_csv(id = 'country') %>% 
  mutate(country = get_country(country))

# A tibble: 1,200 × 4
   country gender height weight
   <chr>   <chr>   <dbl>  <dbl>
 1 austria male      149     83
 2 austria male      169     66
 3 austria male      183    108
 4 austria male      181    111
 5 austria female    175     83
 6 austria male      185     71
 7 austria female    172     74
 8 austria male      176     96
 9 austria female    163     79
10 austria male      181     93
# ℹ 1,190 more rows

Note that here we use id (without the leading .) to accomplish what .id was doing in map above.

While this feature of read_csv is undoubtedly very convenient, the map and map_dfr are more general. For example, if the files being read in have their columns in different orders, then read_csv will not work with the file list, as we see in the following example:

write_csv(tibble(x = 1:10, y = 10:1), 'file1.txt')
write_csv(tibble(y = 1:10, x = 10:1), 'file2.txt')

files <- dir_ls(glob = '*.txt')
read_csv(files, id = 'filename')

Error: Files must have consistent column names:
* File 1 column 1 is: x
* File 2 column 1 is: y

On the other hand, map_dfr, and map ... bind_rows too, will work in this situation too:

map_dfr(files, read_csv, .id = 'filename')

# A tibble: 20 × 3
   filename      x     y
   <chr>     <dbl> <dbl>
 1 file1.txt     1    10
 2 file1.txt     2     9
 3 file1.txt     3     8
 4 file1.txt     4     7
 5 file1.txt     5     6
 6 file1.txt     6     5
 7 file1.txt     7     4
 8 file1.txt     8     3
 9 file1.txt     9     2
10 file1.txt    10     1
11 file2.txt    10     1
12 file2.txt     9     2
13 file2.txt     8     3
14 file2.txt     7     4
15 file2.txt     6     5
16 file2.txt     5     6
17 file2.txt     4     7
18 file2.txt     3     8
19 file2.txt     2     9
20 file2.txt     1    10

Incompatible column types

As a final topic, let us consider a problem that sometimes arises when combining data frames. When reading and combining dozens or hundreds of files, it is not unlikely that some errors in some files will prevent combining the corresponding data frames. As an example, consider the files foo.csv, foo1.csv foo2.csv. Their contents are as follows:

/home/andrews/gitdev/www/mjandrews.org-collection/montgomery/mjandrews.org/notes/readmultifile/data
==> foo.csv <==
a,b,c
1,2,3
4,5,6

==> foo1.csv <==
a,b,c
7,8,9
10,11,12

==> foo2.csv <==
a,b,c
13,14,15
16,X,18

If we try to combine these files into one data frame using the methods above, we encounter an error:

dir_ls('data', regexp = 'foo.*\\.csv') %>% 
  map_dfr(read_csv)

Error in `dplyr::bind_rows()` at purrr/R/superseded-map-df.R:86:3:
! Can't combine `..1$b` <double> and `..3$b` <character>.

What has happened here is that the string X in the b column of foo2.csv forces that column to be read as a character vector, while the b columns in foo.csv and foo1.csv are read, correctly, as numeric vectors. This is confirmed if we just use map with read_csv and view the data frames:

dir_ls('data', regexp = 'foo.*\\.csv') %>% 
  map(read_csv)

$`data/foo.csv`
# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6

$`data/foo1.csv`
# A tibble: 2 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     7     8     9
2    10    11    12

$`data/foo2.csv`
# A tibble: 2 × 3
      a b         c
  <dbl> <chr> <dbl>
1    13 14       15
2    16 X        18

This X string can be see as an example of coding error that could arise in one of the original .csv files. We can deal with this in different ways. One way is to explicitly inform read_csv to treat a value of X as a missing value. We do this with the na = 'X' argument to read_csv, which can be added as additional arguments to map_dfr, etc.

dir_ls('data', regexp = 'foo.*\\.csv') %>% 
  map_dfr(read_csv, na = 'X')

# A tibble: 6 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6
3     7     8     9
4    10    11    12
5    13    14    15
6    16    NA    18

Alternatively, we could force read_csv to parse every b column as a numeric column, which forces every non-numeric value to be coded as a NA, and then the data frames can be combined. This can be done as follows:

dir_ls('data', regexp = 'foo.*\\.csv') %>% 
  map_dfr(read_csv, 
          col_types = list(b = col_double())
  )

# A tibble: 6 × 3
      a     b     c
  <dbl> <dbl> <dbl>
1     1     2     3
2     4     5     6
3     7     8     9
4    10    11    12
5    13    14    15
6    16    NA    18

Reuse

CC BY-SA 4.0

Citation

BibTeX citation:

@online{andrews2021,
  author = {Andrews, Mark},
  title = {Read {Multiple} {Files} into a {Single} {Data} {Frame}},
  date = {2021-07-23},
  url = {https://www.mjandrews.org/notes/readmultifile/},
  langid = {en}
}

For attribution, please cite this work as:

Andrews, Mark. 2021. “Read Multiple Files into a Single Data Frame.” July 23, 2021. https://www.mjandrews.org/notes/readmultifile/.

Using purrr’s map with bind_rows

Using map_dfr

Additional arguments to read_csv

Using readr 2.0.0