Web scraping Wikipedia data tables into R data frames
Mark Andrews. January 28, 2019
Wikipedia provides us with very large numbers of data tables. All of these tables, though perhaps with some exceptions, provide a detailed reference to the original sources of the data and, by virtue of being on a Wikipedia page, are free to use according tothe Creative Commons BY-SA 3.0 licence. However, if we want to do data analysis using R on data in these tables, we must first do some web scraping to extract the data from the html tables and then use dplyr
and related tools to clean them up and convert them into a form that is easy to work with. Here, we provide a worked example of how to do this.
For this example, the page we will work with is List of English districts by population, and in particular the page version as of 28 January 2019 10:38 UTC.
Here’s a screenshot of that page where we see the main table:
We’ll need some R packages for web scraping, namely rvest, which provides a set of functions that are wrappers around functions in the httr
and xml2
packages. We’ll then need dplyr
for re-formatting and cleaning the data. We’ll use magrittr
for some processing pipelines. We’ll use stringr
for some renaming. We’ll use ggplot2
at the end for some visualizations.
library(tidyverse) # loads dplyr, ggplot2,
library(rvest)
library(magrittr)
library(stringr)
Web scraping the table
The web scraping is relatively painless. In addition to the url of the webpage, we also need to know the xpath node of the table in the page. This can be obtained using inspector tools in modern web-browsers, such as those provided by the DevTools in Chrome, see here. On the webpage used in this example, there’s actually just one html table, and we convert that to a tibble, although that final step is not strictly necessary.
webpage_url <- 'https://en.wikipedia.org/w/index.php?title=List_of_English_districts_by_population&oldid=879013307'
webpage_tables <-
webpage_url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>%
html_table()
Df <- webpage_tables[[1]] %>% # There's only one table
as_tibble()
Data wrangling
Having read in the data into a data-frame, we now must do a series of typical dplyr
data wrangling steps.
First, let’s take a look at the first 10 rows of the Df
that we’ve obtained:
Df
## # A tibble: 338 x 6
## Rank District Population Type `Ceremonial coun… `English Region`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 More th… More than … More than 1… More t… More than 1,000,… More than 1,000,…
## 2 1 Birmingham 1,141,816 Metrop… West Midlands West Midlands
## 3 More th… More than … More than 5… More t… More than 500,00… More than 500,00…
## 4 2 Leeds 793,139 Metrop… West Yorkshire Yorkshire and th…
## 5 3 Sheffield 584,853 Metrop… South Yorkshire Yorkshire and th…
## 6 4 Cornwall 569,578 Unitar… Cornwall South West
## 7 5 Manchester 552,858 Metrop… Greater Manchest… North West
## 8 7 Bradford 539,776 Metrop… West Yorkshire Yorkshire and th…
## 9 8 County Dur… 530,094 Unitar… Durham North East
## 10 400,000… 400,000 to… 400,000 to … 400,00… 400,000 to 500,0… 400,000 to 500,0…
## # … with 328 more rows
There are a few things that need to be fixed:
- First, though this is maybe just a personal preference, we’ll convert column names to lower case and with no spaces.
- The original table’s sub-headers like
More than 1,000,000 inhabitants
,More than 500,000 inhabitants
, etc., have turned into rows with repeated values, and these need to be removed. - The population values need to be converted to integers.
- Some variables can be dropped to make a cleaner data frame.
To convert the column names, I’ll create a function using stringr
and apply it using rename_with
(see this blog post for more details on using rename_with
):
tolower_no_spaces <- function(s){
s %>%
str_to_lower() %>%
str_replace_all(' ', '_')
}
Df %<>% rename_with(tolower_no_spaces)
To deal with the inclusion of sub-headers rows, we need to filter out all rows that contain values like 250,000 to 300,000 inhabitants
, Below 50,000
, etc. If we just look at the rank
variable, these cases can be relatively easily identified by whether they contain the words to
, than
or Below
.
Df %>% filter(str_detect(rank, 'to|than|Below'))
## # A tibble: 12 x 6
## rank district population type ceremonial_county english_region
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 More th… More than … More than 1,… More th… More than 1,000,… More than 1,00…
## 2 More th… More than … More than 50… More th… More than 500,00… More than 500,…
## 3 400,000… 400,000 to… 400,000 to 5… 400,000… 400,000 to 500,0… 400,000 to 500…
## 4 300,000… 300,000 to… 300,000 to 4… 300,000… 300,000 to 400,0… 300,000 to 400…
## 5 250,000… 250,000 to… 250,000 to 3… 250,000… 250,000 to 300,0… 250,000 to 300…
## 6 200,000… 200,000 to… 200,000 to 2… 200,000… 200,000 to 250,0… 200,000 to 250…
## 7 150,000… 150,000 to… 150,000 to 2… 150,000… 150,000 to 200,0… 150,000 to 200…
## 8 125,000… 125,000 to… 125,000 to 1… 125,000… 125,000 to 150,0… 125,000 to 150…
## 9 100,000… 100,000 to… 100,000 to 1… 100,000… 100,000 to 125,0… 100,000 to 125…
## 10 75,000 … 75,000 to … 75,000 to 10… 75,000 … 75,000 to 100,000 75,000 to 100,…
## 11 50,000 … 50,000 to … 50,000 to 75… 50,000 … 50,000 to 75,000 50,000 to 75,0…
## 12 Below 5… Below 50,0… Below 50,000 Below 5… Below 50,000 Below 50,000
That gives us an easy rule with which to filter them out:
Df %<>% filter(!str_detect(rank, 'to|than|Below'))
We can now verify that the number of rows that we now have is the correct number, namely 326 (which is the current number of official districts that are in England).
nrow(Df)
## [1] 326
Now, let’s take a look again at the first 10 rows:
Df
## # A tibble: 326 x 6
## rank district population type ceremonial_coun… english_region
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Birmingham 1,141,816 Metropolitan … West Midlands West Midlands
## 2 2 Leeds 793,139 Metropolitan … West Yorkshire Yorkshire and th…
## 3 3 Sheffield 584,853 Metropolitan … South Yorkshire Yorkshire and th…
## 4 4 Cornwall 569,578 Unitary autho… Cornwall South West
## 5 5 Manchester 552,858 Metropolitan … Greater Manches… North West
## 6 7 Bradford 539,776 Metropolitan … West Yorkshire Yorkshire and th…
## 7 8 County Du… 530,094 Unitary autho… Durham North East
## 8 9 Wiltshire 500,024 Unitary autho… Wiltshire South West
## 9 10 Liverpool 498,042 Metropolitan … Merseyside North West
## 10 11 Bristol 463,377 Unitary autho… Bristol South West
## # … with 316 more rows
Next, we’ll convert the population
variable to an integer
variable by first removing the commas in their values. However, before doing that, in the original permanent URL page, we notice that some values in the Population column have align="center"
as values.
Df %>% filter(str_detect(population, 'align="center"'))
## # A tibble: 18 x 6
## rank district population type ceremonial_coun… english_region
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "align=\… Aylesbury V… "align=\"c… "" Buckinghamshire South East
## 2 "align=\… Bournemouth "align=\"c… "Unitary … Dorset South West
## 3 "align=\… Wycombe "align=\"c… "" Buckinghamshire South East
## 4 "align=\… Poole "align=\"c… "Unitary … Dorset South West
## 5 "align=\… Suffolk Coa… "align=\"c… "" Suffolk East of Engla…
## 6 "align=\… Waveney "align=\"c… "" Suffolk East of Engla…
## 7 "align=\… Taunton Dea… "align=\"c… "Borough" Somerset South West
## 8 "align=\… St Edmundsb… "align=\"c… "Borough" Suffolk East of Engla…
## 9 "align=\… West Dorset "align=\"c… "" Dorset South West
## 10 "align=\… Chiltern "align=\"c… "" Buckinghamshire South East
## 11 "align=\… East Dorset "align=\"c… "" Dorset South West
## 12 "align=\… North Dorset "align=\"c… "" Dorset South West
## 13 "align=\… South Bucks "align=\"c… "" Buckinghamshire South East
## 14 "align=\… Weymouth an… "align=\"c… "Borough" Dorset South West
## 15 "align=\… Forest Heath "align=\"c… "" Suffolk East of Engla…
## 16 "align=\… Christchurch "align=\"c… "Borough" Dorset South West
## 17 "align=\… Purbeck "align=\"c… "" Dorset South West
## 18 "align=\… West Somers… "align=\"c… "" Somerset South West
This is presumably just an typographic error. We could remove these rows, but when we convert to integers, population
values in these rows will be converted to NA
.
After we convert to integers, we’ll drop rank
entirely because it can be calculated easily from population
.
Df %<>% mutate(population = str_replace_all(population, ',', '') %>% as.integer()) %>%
select(-rank)
Let’s look again at the first 10 rows:
Df
## # A tibble: 326 x 5
## district population type ceremonial_coun… english_region
## <chr> <int> <chr> <chr> <chr>
## 1 Birmingham 1141816 Metropolitan boro… West Midlands West Midlands
## 2 Leeds 793139 Metropolitan boro… West Yorkshire Yorkshire and the…
## 3 Sheffield 584853 Metropolitan boro… South Yorkshire Yorkshire and the…
## 4 Cornwall 569578 Unitary authority Cornwall South West
## 5 Manchester 552858 Metropolitan boro… Greater Manches… North West
## 6 Bradford 539776 Metropolitan boro… West Yorkshire Yorkshire and the…
## 7 County Dur… 530094 Unitary authority Durham North East
## 8 Wiltshire 500024 Unitary authority Wiltshire South West
## 9 Liverpool 498042 Metropolitan boro… Merseyside North West
## 10 Bristol 463377 Unitary authority… Bristol South West
## # … with 316 more rows
The type
variable is quite inconsistent. The 326 English districts are in fact simply divided into 5 types but in our Df
, the type variable is sometimes missing, or has values like “City (2012)” that do not correspond to one the five English district types, or has two pieces of information that are separated by a comma. Because of this overall inconsistency, we will drop the variable. However, it is possible to recover this information from joining this table with other tables of English districts such as from what is available on the List of English districts by area Wikipedia page.
Df %<>% select(-type)
As a final fix, we can see that one district, namely Stockton-on-Tees, is listed as belonging to two regions of England:
Df %>% filter(district == "Stockton-on-Tees")
## # A tibble: 1 x 4
## district population ceremonial_county english_region
## <chr> <int> <chr> <chr>
## 1 Stockton-on-T… 197348 Durham andNorth Yorks… North East andYorkshire and …
Given that this district is listed here as part of North East England, we’ll change its listed region to North East
.
Df %<>% mutate(english_region = recode(english_region,
'North East andYorkshire and the Humber' = 'North East')
)
The final state of the data frame is now as follows:
Df
## # A tibble: 326 x 4
## district population ceremonial_county english_region
## <chr> <int> <chr> <chr>
## 1 Birmingham 1141816 West Midlands West Midlands
## 2 Leeds 793139 West Yorkshire Yorkshire and the Humber
## 3 Sheffield 584853 South Yorkshire Yorkshire and the Humber
## 4 Cornwall 569578 Cornwall South West
## 5 Manchester 552858 Greater Manchester North West
## 6 Bradford 539776 West Yorkshire Yorkshire and the Humber
## 7 County Durham 530094 Durham North East
## 8 Wiltshire 500024 Wiltshire South West
## 9 Liverpool 498042 Merseyside North West
## 10 Bristol 463377 Bristol South West
## # … with 316 more rows
Summarizing and visualizing the data
Having extracted and cleaned the data, we can now obtain some summary statistics:
Df %>% summarize(number_of_districts = n(),
median_population = median(population, na.rm = T),
population_iqr = IQR(population, na.rm = T)
)
## # A tibble: 1 x 3
## number_of_districts median_population population_iqr
## <int> <dbl> <dbl>
## 1 326 137215 113963.
And we can provide these summaries by English region too.
Df %>% group_by(english_region) %>%
summarize(number_of_districts = n(),
median_population = median(population, na.rm = T),
population_iqr = IQR(population, na.rm = T)
)
## # A tibble: 9 x 4
## english_region number_of_districts median_population population_iqr
## <chr> <int> <dbl> <dbl>
## 1 East Midlands 40 104256. 26472.
## 2 East of England 48 135242. 65534
## 3 London 33 276983 83567
## 4 North East 10 204984 133972.
## 5 North West 39 146038 146670.
## 6 South East 66 126274 43046.
## 7 South West 37 129128 111726.
## 8 West Midlands 30 129662 145136.
## 9 Yorkshire and the Humber 23 210618 203578
We can visualize the distributions of the population in each district.
ggplot(Df,
aes(x = population)
) + geom_histogram(bins = 25, col='white') +
theme_classic() +
scale_x_continuous(name="Population",
breaks = c(10^5, 5*10^5, 10^6),
labels = c('100K', '500K', '1M')
)
And do the same by region.
ggplot(Df,
aes(x = population)
) + geom_histogram(bins = 25, col='white') +
facet_wrap(~english_region)+ theme_classic() +
theme(strip.background = element_blank()) +
scale_x_continuous(name="Population",
breaks = c(10^5, 5*10^5, 10^6),
labels = c('100K', '500K', '1M')
)