#foo.csv
a,b,c
1,2,3
4,5,6
Read Multiple Files into a Single Data Frame
In R, usually data frames are created by reading data from a single file. Sometimes, however, we may wish to read multiple files into a single data frame. In this post, we will look at some of the main tidyverse
ways in which we can read in files into a single data frame.
For most analyses in R, our data are represented in data frames. These data frames are often created initially by reading in data from files, such as .csv
files, Excel files, and so on. Usually, in this case, the data frame is created by reading from a single file. For example, we might have a csv file named foo.csv
(available here
) with the following contents:
We can read in its contents as follows:
library(tidyverse)
read_csv('data/foo.csv')
# A tibble: 2 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
Sometimes, however, we may wish to read multiple files into a single data frame. There are countless situations where this could arise. As a very simple example, we might have conducted a study in multiple different countries, whereby in each country, we collected the gender, height, and weight of a sample of people. From each country, we might then have a single .csv
file, with names like england.csv
, france.csv
, and so on, but we may wish to read the data from all these .csv
files into one single data frame. Fictitious .csv
data files of this kind are available in a zip file named country_data.zip
.
In this post, we will look at some of the main ways in which we can read in multiple .csv
files into a single data frame, using the county_data.zip
files as an example. In particular, we will look at using purrr::map
combined with dplyr::bind_rows
, using purrr::map_dfr
, and using readr
version 2.0.0. We will also look some common problems, and their solutions, that can arise when reading and concatenating multiple csv files in this way. We will use read_csv
as the function to read individual files, but all of what we discuss is equally applicable when using other read functions like read_tsv
, read_delim
, etc., from readr
, and the methods involving the map
functionals can be used with almost any function for reading files such as readxl::read_excel
, haven::read_spss
, and so on.
Using purrr
’s map
with bind_rows
As discussed in this blog post, functionals like base R’s lapply
, purrr
’s map
, and their many variants, effectively apply functions to each element of a list and then collect the results in a new list or some other data structure. We can therefore use functionals to read multiple .csv
files into R as a list of data frames, which may be then concatenated into a single data frame. To do this, we first get a vector with the names of .csv
files that we want to read in. This can be done using base R’s list.files
, or else the fs
package’s dir_ls
command.
For example, assuming we have unzipped the country_data.zip
file into a directory named country_data
, we can get a list of all the files in it as follows:
library(fs)
<- dir_ls('country_data')
csv_file_list csv_file_list
country_data/austria.csv country_data/belgium.csv
country_data/england.csv country_data/france.csv
country_data/germany.csv country_data/ireland.csv
country_data/italy.csv country_data/netherlands.csv
country_data/norway.csv country_data/scotland.csv
country_data/spain.csv country_data/sweden.csv
We can now use purrr::map
to read in each one of these .csv
files in a data frame using read_csv
, and to collect these data frames in a list. This is done as follows:
<- map(csv_file_list, read_csv) df_list
The object df_list
is a list of length 12.
class(df_list)
[1] "list"
length(df_list)
[1] 12
Each element of this list is a data frame, and the name of each element is the file name from which the data was read.
names(df_list)
[1] "country_data/austria.csv" "country_data/belgium.csv"
[3] "country_data/england.csv" "country_data/france.csv"
[5] "country_data/germany.csv" "country_data/ireland.csv"
[7] "country_data/italy.csv" "country_data/netherlands.csv"
[9] "country_data/norway.csv" "country_data/scotland.csv"
[11] "country_data/spain.csv" "country_data/sweden.csv"
We can now use bind_rows
to concatenate all the data frames in the list:
bind_rows(df_list)
# A tibble: 1,200 × 3
gender height weight
<chr> <dbl> <dbl>
1 male 149 83
2 male 169 66
3 male 183 108
4 male 181 111
5 female 175 83
6 male 185 71
7 female 172 74
8 male 176 96
9 female 163 79
10 male 181 93
# ℹ 1,190 more rows
While this does concatenate the data frame, the result does not contain a variable to indicate the country. We can, however, add in this variable by using the .id
argument to bind_rows
as follows:
bind_rows(df_list, .id = 'country')
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 country_data/austria.csv male 149 83
2 country_data/austria.csv male 169 66
3 country_data/austria.csv male 183 108
4 country_data/austria.csv male 181 111
5 country_data/austria.csv female 175 83
6 country_data/austria.csv male 185 71
7 country_data/austria.csv female 172 74
8 country_data/austria.csv male 176 96
9 country_data/austria.csv female 163 79
10 country_data/austria.csv male 181 93
# ℹ 1,190 more rows
This adds a new variable, named country
, whose values are the list names corresponding to each data frame being concatenated. In this example, we probably would prefer the country name to be austria
etc., rather than country_data/austria.csv
etc. This can be accomplished with str_match
and a regular expression as follows:
bind_rows(df_list, .id = 'country') %>%
mutate(country = str_match(country, 'country_data/(.*)\\.csv')[,2])
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 1,190 more rows
Or if we are going to re-use the str_match
function often, we can do
<- function(filename){
get_country str_match(filename, 'country_data/(.*)\\.csv')[,2]
}
bind_rows(df_list, .id = 'country') %>%
mutate(country = get_country(country))
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 1,190 more rows
We can put all these steps together into one pipeline as follows:
dir_ls('country_data') %>%
map(read_csv) %>%
bind_rows(.id = 'country') %>%
mutate(country = get_country(country))
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 1,190 more rows
Using map_dfr
We also use the map_dfr
variant of map
to effectively perform the map
and bind_rows
operations in one step:
dir_ls('country_data') %>%
map_dfr(read_csv, .id = 'country') %>%
mutate(country = get_country(country))
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 1,190 more rows
Given that map_dfr
does in one step that which requires two separate steps when using map
itself, it may seem always preferable. There are occasions, however, where we might like to have the list of data frames returned by map
, and perform additional wrangling and tidying operations on the individual data frames before we combine them into one single data frame.
Additional arguments to read_csv
In all of the above example, we used read_csv
with no optional arguments. Should we wish to use optional arguments to read_csv
, we can use add them as arguments to map
or map_dfr
. For example, if we want to limit the number of records we read in to no more than 10 rows per file, we would use n_max = 10
in read_csv
, which we can accomplish with a functional as follows:
dir_ls('country_data') %>%
map_dfr(read_csv, n_max = 10, .id = 'country') %>%
mutate(country = get_country(country))
# A tibble: 120 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 110 more rows
Using readr
2.0.0
Since readr
2.0.0, released on July 20, 2021, read_csv
on its own can read in and concatenate multiple files:
packageVersion('readr')
[1] '2.1.5'
dir_ls('country_data') %>%
read_csv(id = 'country') %>%
mutate(country = get_country(country))
# A tibble: 1,200 × 4
country gender height weight
<chr> <chr> <dbl> <dbl>
1 austria male 149 83
2 austria male 169 66
3 austria male 183 108
4 austria male 181 111
5 austria female 175 83
6 austria male 185 71
7 austria female 172 74
8 austria male 176 96
9 austria female 163 79
10 austria male 181 93
# ℹ 1,190 more rows
Note that here we use id
(without the leading .
) to accomplish what .id
was doing in map
above.
While this feature of read_csv
is undoubtedly very convenient, the map
and map_dfr
are more general. For example, if the files being read in have their columns in different orders, then read_csv
will not work with the file list, as we see in the following example:
write_csv(tibble(x = 1:10, y = 10:1), 'file1.txt')
write_csv(tibble(y = 1:10, x = 10:1), 'file2.txt')
<- dir_ls(glob = '*.txt')
files read_csv(files, id = 'filename')
Error: Files must have consistent column names:
* File 1 column 1 is: x
* File 2 column 1 is: y
On the other hand, map_dfr
, and map ... bind_rows
too, will work in this situation too:
map_dfr(files, read_csv, .id = 'filename')
# A tibble: 20 × 3
filename x y
<chr> <dbl> <dbl>
1 file1.txt 1 10
2 file1.txt 2 9
3 file1.txt 3 8
4 file1.txt 4 7
5 file1.txt 5 6
6 file1.txt 6 5
7 file1.txt 7 4
8 file1.txt 8 3
9 file1.txt 9 2
10 file1.txt 10 1
11 file2.txt 10 1
12 file2.txt 9 2
13 file2.txt 8 3
14 file2.txt 7 4
15 file2.txt 6 5
16 file2.txt 5 6
17 file2.txt 4 7
18 file2.txt 3 8
19 file2.txt 2 9
20 file2.txt 1 10
Incompatible column types
As a final topic, let us consider a problem that sometimes arises when combining data frames. When reading and combining dozens or hundreds of files, it is not unlikely that some errors in some files will prevent combining the corresponding data frames. As an example, consider the files foo.csv
, foo1.csv
foo2.csv
. Their contents are as follows:
/home/andrews/gitdev/www/mjandrews.org-collection/montgomery/mjandrews.org/notes/readmultifile/data
==> foo.csv <==
a,b,c
1,2,3
4,5,6
==> foo1.csv <==
a,b,c
7,8,9
10,11,12
==> foo2.csv <==
a,b,c
13,14,15
16,X,18
If we try to combine these files into one data frame using the methods above, we encounter an error:
dir_ls('data', regexp = 'foo.*\\.csv') %>%
map_dfr(read_csv)
Error in `dplyr::bind_rows()` at purrr/R/superseded-map-df.R:69:3:
! Can't combine `..1$b` <double> and `..3$b` <character>.
What has happened here is that the string X
in the b
column of foo2.csv
forces that column to be read as a character vector, while the b
columns in foo.csv
and foo1.csv
are read, correctly, as numeric vectors. This is confirmed if we just use map
with read_csv
and view the data frames:
dir_ls('data', regexp = 'foo.*\\.csv') %>%
map(read_csv)
$`data/foo.csv`
# A tibble: 2 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
$`data/foo1.csv`
# A tibble: 2 × 3
a b c
<dbl> <dbl> <dbl>
1 7 8 9
2 10 11 12
$`data/foo2.csv`
# A tibble: 2 × 3
a b c
<dbl> <chr> <dbl>
1 13 14 15
2 16 X 18
This X
string can be see as an example of coding error that could arise in one of the original .csv
files. We can deal with this in different ways. One way is to explicitly inform read_csv
to treat a value of X
as a missing value. We do this with the na = 'X'
argument to read_csv
, which can be added as additional arguments to map_dfr
, etc.
dir_ls('data', regexp = 'foo.*\\.csv') %>%
map_dfr(read_csv, na = 'X')
# A tibble: 6 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
5 13 14 15
6 16 NA 18
Alternatively, we could force read_csv
to parse every b
column as a numeric column, which forces every non-numeric value to be coded as a NA
, and then the data frames can be combined. This can be done as follows:
dir_ls('data', regexp = 'foo.*\\.csv') %>%
map_dfr(read_csv,
col_types = list(b = col_double())
)
# A tibble: 6 × 3
a b c
<dbl> <dbl> <dbl>
1 1 2 3
2 4 5 6
3 7 8 9
4 10 11 12
5 13 14 15
6 16 NA 18
Reuse
Citation
@online{andrews2021,
author = {Andrews, Mark},
title = {Read {Multiple} {Files} into a {Single} {Data} {Frame}},
date = {2021-07-23},
url = {https://www.mjandrews.org/notes/readmultifile/},
langid = {en}
}