Web scraping Wikipedia data tables into R data frames
Wikipedia provides us with very large numbers of data tables. All of these tables, though perhaps with some exceptions, provide a detailed reference to the original sources of the data and, by virtue of being on a Wikipedia page, are free to use according tothe Creative Commons BY-SA 3.0 licence. However, if we want to do data analysis using R on data in these tables, we must first do some web scraping to extract the data from the html tables and then use dplyr
and related tools to clean them up and convert them into a form that is easy to work with. Here, we provide a worked example of how to do this.
For this example, the page we will work with is List of English districts by population, and in particular the page version as of 28 January 2019 10:38 UTC.
Here’s a screenshot of that page where we see the main table:
We’ll need some R packages for web scraping, namely rvest, which provides a set of functions that are wrappers around functions in the httr
and xml2
packages. We’ll then need dplyr
for re-formatting and cleaning the data. We’ll use magrittr
for some processing pipelines. We’ll use stringr
for some renaming. We’ll use ggplot2
at the end for some visualizations.
library(tidyverse) # loads dplyr, ggplot2,
library(rvest)
library(magrittr)
library(stringr)
Web scraping the table
The web scraping is relatively painless. In addition to the url of the webpage, we also need to know the xpath node of the table in the page. This can be obtained using inspector tools in modern web-browsers, such as those provided by the DevTools in Chrome, see here. On the webpage used in this example, there’s actually just one html table, and we convert that to a tibble, although that final step is not strictly necessary.
<- 'https://en.wikipedia.org/w/index.php?title=List_of_English_districts_by_population&oldid=879013307'
webpage_url
<-
webpage_tables %>%
webpage_url read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
html_table()
<- webpage_tables[[1]] %>% # There's only one table
Df as_tibble()
Data wrangling
Having read in the data into a data-frame, we now must do a series of typical dplyr
data wrangling steps.
First, let’s take a look at the first 10 rows of the Df
that we’ve obtained:
Df
# A tibble: 338 × 6
Rank District Population Type `Ceremonial county` `English Region`
<chr> <chr> <chr> <chr> <chr> <chr>
1 More than 1,0… More th… More than… More… More than 1,000,00… More than 1,000…
2 1 Birming… 1,157,603 Metr… West Midlands West Midlands
3 More than 500… More th… More than… More… More than 500,000 … More than 500,0…
4 2 Leeds 822,483 Metr… West Yorkshire Yorkshire and t…
5 7 Sheffie… 566,242 Metr… South Yorkshire Yorkshire and t…
6 5 Cornwall 575,413 Unit… Cornwall South West
7 6 Manches… 568,996 Metr… Greater Manchester North West
8 9 Bradford 552,644 Metr… West Yorkshire Yorkshire and t…
9 10 County … 528,127 Unit… Durham North East
10 400,000 to 50… 400,000… 400,000 t… 400,… 400,000 to 500,000… 400,000 to 500,…
# ℹ 328 more rows
There are a few things that need to be fixed:
- First, though this is maybe just a personal preference, we’ll convert column names to lower case and with no spaces.
- The original table’s sub-headers like
More than 1,000,000 inhabitants
,More than 500,000 inhabitants
, etc., have turned into rows with repeated values, and these need to be removed. - The population values need to be converted to integers.
- Some variables can be dropped to make a cleaner data frame.
To convert the column names, I’ll create a function using stringr
and apply it using rename_with
(see this blog post for more details on using rename_with
):
<- function(s){
tolower_no_spaces %>%
s str_to_lower() %>%
str_replace_all(' ', '_')
}
%<>% rename_with(tolower_no_spaces) Df
To deal with the inclusion of sub-headers rows, we need to filter out all rows that contain values like 250,000 to 300,000 inhabitants
, Below 50,000
, etc. If we just look at the rank
variable, these cases can be relatively easily identified by whether they contain the words to
, than
or Below
.
%>% filter(str_detect(rank, 'to|than|Below')) Df
# A tibble: 12 × 6
rank district population type ceremonial_county english_region
<chr> <chr> <chr> <chr> <chr> <chr>
1 More than 1,000,0… More th… More than… More… More than 1,000,… More than 1,0…
2 More than 500,000… More th… More than… More… More than 500,00… More than 500…
3 400,000 to 500,00… 400,000… 400,000 t… 400,… 400,000 to 500,0… 400,000 to 50…
4 300,000 to 400,00… 300,000… 300,000 t… 300,… 300,000 to 400,0… 300,000 to 40…
5 250,000 to 300,00… 250,000… 250,000 t… 250,… 250,000 to 300,0… 250,000 to 30…
6 200,000 to 250,000 200,000… 200,000 t… 200,… 200,000 to 250,0… 200,000 to 25…
7 150,000 to 200,000 150,000… 150,000 t… 150,… 150,000 to 200,0… 150,000 to 20…
8 125,000 to 150,000 125,000… 125,000 t… 125,… 125,000 to 150,0… 125,000 to 15…
9 100,000 to 125,000 100,000… 100,000 t… 100,… 100,000 to 125,0… 100,000 to 12…
10 75,000 to 100,000 75,000 … 75,000 to… 75,0… 75,000 to 100,000 75,000 to 100…
11 50,000 to 75,000 50,000 … 50,000 to… 50,0… 50,000 to 75,000 50,000 to 75,…
12 Below 50,000 Below 5… Below 50,… Belo… Below 50,000 Below 50,000
That gives us an easy rule with which to filter them out:
%<>% filter(!str_detect(rank, 'to|than|Below')) Df
We can now verify that the number of rows that we now have is the correct number, namely 326 (which is the current number of official districts that are in England).
nrow(Df)
[1] 326
Now, let’s take a look again at the first 10 rows:
Df
# A tibble: 326 × 6
rank district population type ceremonial_county english_region
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 Birmingham 1,157,603 Metropolitan… West Midlands West Midlands
2 2 Leeds 822,483 Metropolitan… West Yorkshire Yorkshire and…
3 7 Sheffield 566,242 Metropolitan… South Yorkshire Yorkshire and…
4 5 Cornwall 575,413 Unitary auth… Cornwall South West
5 6 Manchester 568,996 Metropolitan… Greater Manchest… North West
6 9 Bradford 552,644 Metropolitan… West Yorkshire Yorkshire and…
7 10 County Durham 528,127 Unitary auth… Durham North East
8 11 Wiltshire 515,885 Unitary auth… Wiltshire South West
9 12 Liverpool 496,770 Metropolitan… Merseyside North West
10 13 Bristol 479,024 Unitary auth… Bristol South West
# ℹ 316 more rows
Next, we’ll convert the population
variable to an integer
variable by first removing the commas in their values. However, before doing that, in the original permanent URL page, we notice that some values in the Population column have align="center"
as values.
%>% filter(str_detect(population, 'align="center"')) Df
# A tibble: 41 × 6
rank district population type ceremonial_county english_region
<chr> <chr> <chr> <chr> <chr> <chr>
1 "align=\"center\"" Northam… "align=\"… "Bor… Northamptonshire East Midlands
2 "align=\"center\"" Aylesbu… "align=\"… "" Buckinghamshire South East
3 "align=\"center\"" Bournem… "align=\"… "Uni… Dorset South West
4 "align=\"center\"" Wycombe "align=\"… "" Buckinghamshire South East
5 "align=\"center\"" South S… "align=\"… "" Somerset South West
6 "align=\"center\"" Harroga… "align=\"… "Bor… North Yorkshire Yorkshire and…
7 "align=\"center\"" Poole "align=\"… "Uni… Dorset South West
8 "align=\"center\"" Suffolk… "align=\"… "" Suffolk East of Engla…
9 "align=\"center\"" Sedgemo… "align=\"… "" Somerset South West
10 "align=\"center\"" Waveney "align=\"… "" Suffolk East of Engla…
# ℹ 31 more rows
This is presumably just an typographic error. We could remove these rows, but when we convert to integers, population
values in these rows will be converted to NA
.
After we convert to integers, we’ll drop rank
entirely because it can be calculated easily from population
.
%<>% mutate(population = str_replace_all(population, ',', '') %>% as.integer()) %>%
Df select(-rank)
Let’s look again at the first 10 rows:
Df
# A tibble: 326 × 5
district population type ceremonial_county english_region
<chr> <int> <chr> <chr> <chr>
1 Birmingham 1157603 Metropolitan borou… West Midlands West Midlands
2 Leeds 822483 Metropolitan borou… West Yorkshire Yorkshire and…
3 Sheffield 566242 Metropolitan borou… South Yorkshire Yorkshire and…
4 Cornwall 575413 Unitary authority Cornwall South West
5 Manchester 568996 Metropolitan borou… Greater Manchest… North West
6 Bradford 552644 Metropolitan borou… West Yorkshire Yorkshire and…
7 County Durham 528127 Unitary authority Durham North East
8 Wiltshire 515885 Unitary authority Wiltshire South West
9 Liverpool 496770 Metropolitan borou… Merseyside North West
10 Bristol 479024 Unitary authority,… Bristol South West
# ℹ 316 more rows
The type
variable is quite inconsistent. The 326 English districts are in fact simply divided into 5 types but in our Df
, the type variable is sometimes missing, or has values like “City (2012)” that do not correspond to one the five English district types, or has two pieces of information that are separated by a comma. Because of this overall inconsistency, we will drop the variable. However, it is possible to recover this information from joining this table with other tables of English districts such as from what is available on the List of English districts by area Wikipedia page.
%<>% select(-type) Df
As a final fix, we can see that one district, namely Stockton-on-Tees, is listed as belonging to two regions of England:
%>% filter(district == "Stockton-on-Tees") Df
# A tibble: 1 × 4
district population ceremonial_county english_region
<chr> <int> <chr> <chr>
1 Stockton-on-Tees 199966 Durham andNorth Yorkshire North East andYorkshire…
Given that this district is listed here as part of North East England, we’ll change its listed region to North East
.
%<>% mutate(english_region = recode(english_region,
Df 'North East andYorkshire and the Humber' = 'North East')
)
The final state of the data frame is now as follows:
Df
# A tibble: 326 × 4
district population ceremonial_county english_region
<chr> <int> <chr> <chr>
1 Birmingham 1157603 West Midlands West Midlands
2 Leeds 822483 West Yorkshire Yorkshire and the Humber
3 Sheffield 566242 South Yorkshire Yorkshire and the Humber
4 Cornwall 575413 Cornwall South West
5 Manchester 568996 Greater Manchester North West
6 Bradford 552644 West Yorkshire Yorkshire and the Humber
7 County Durham 528127 Durham North East
8 Wiltshire 515885 Wiltshire South West
9 Liverpool 496770 Merseyside North West
10 Bristol 479024 Bristol South West
# ℹ 316 more rows
Summarizing and visualizing the data
Having extracted and cleaned the data, we can now obtain some summary statistics:
%>% summarize(number_of_districts = n(),
Df median_population = median(population, na.rm = T),
population_iqr = IQR(population, na.rm = T)
)
# A tibble: 1 × 3
number_of_districts median_population population_iqr
<int> <int> <dbl>
1 326 144593 120022
And we can provide these summaries by English region too.
%>% group_by(english_region) %>%
Df summarize(number_of_districts = n(),
median_population = median(population, na.rm = T),
population_iqr = IQR(population, na.rm = T)
)
# A tibble: 9 × 4
english_region number_of_districts median_population population_iqr
<chr> <int> <dbl> <dbl>
1 East Midlands 40 111133 28106
2 East of England 48 141363 68772.
3 London 33 275887 107740
4 North East 10 205226. 139082.
5 North West 39 184728 162403
6 South East 66 131570. 49108.
7 South West 37 134230. 132337.
8 West Midlands 30 135008. 146617.
9 Yorkshire and the Humber 23 268603 181771.
We can visualize the distributions of the population in each district.
ggplot(Df,
aes(x = population)
+ geom_histogram(bins = 25, col='white') +
) theme_classic() +
scale_x_continuous(name="Population",
breaks = c(10^5, 5*10^5, 10^6),
labels = c('100K', '500K', '1M')
)
And do the same by region.
ggplot(Df,
aes(x = population)
+ geom_histogram(bins = 25, col='white') +
) facet_wrap(~english_region)+ theme_classic() +
theme(strip.background = element_blank()) +
scale_x_continuous(name="Population",
breaks = c(10^5, 5*10^5, 10^6),
labels = c('100K', '500K', '1M')
)
Reuse
Citation
@online{andrews2019,
author = {Andrews, Mark},
title = {Web Scraping {Wikipedia} Data Tables into {R} Data Frames},
date = {2019-01-28},
url = {https://www.mjandrews.org/notes/scraping/},
langid = {en}
}