Web scraping Wikipedia data tables into R data frames

Mark Andrews. January 28, 2019

Wikipedia provides us with very large numbers of data tables. All of these tables, though perhaps with some exceptions, provide a detailed reference to the original sources of the data and, by virtue of being on a Wikipedia page, are free to use according tothe Creative Commons BY-SA 3.0 licence. However, if we want to do data analysis using R on data in these tables, we must first do some web scraping to extract the data from the html tables and then use dplyr and related tools to clean them up and convert them into a form that is easy to work with. Here, we provide a worked example of how to do this.

For this example, the page we will work with is List of English districts by population, and in particular the page version as of 28 January 2019 10:38 UTC.

Here’s a screenshot of that page where we see the main table:

We’ll need some R packages for web scraping, namely rvest, which provides a set of functions that are wrappers around functions in the httr and xml2 packages. We’ll then need dplyr for re-formatting and cleaning the data. We’ll use magrittr for some processing pipelines. We’ll use stringr for some renaming. We’ll use ggplot2 at the end for some visualizations.

library(tidyverse) # loads dplyr, ggplot2, 
library(rvest)
library(magrittr)
library(stringr)

Web scraping the table

The web scraping is relatively painless. In addition to the url of the webpage, we also need to know the xpath node of the table in the page. This can be obtained using inspector tools in modern web-browsers, such as those provided by the DevTools in Chrome, see here. On the webpage used in this example, there’s actually just one html table, and we convert that to a tibble, although that final step is not strictly necessary.

webpage_url <- 'https://en.wikipedia.org/w/index.php?title=List_of_English_districts_by_population&oldid=879013307'

webpage_tables <- 
  webpage_url %>% 
  read_html() %>% 
  html_nodes(xpath='//*[@id="mw-content-text"]/div/table') %>% 
  html_table()

Df <- webpage_tables[[1]] %>%   # There's only one table
  as_tibble()

Data wrangling

Having read in the data into a data-frame, we now must do a series of typical dplyr data wrangling steps.

First, let’s take a look at the first 10 rows of the Df that we’ve obtained:

Df

## # A tibble: 338 x 6
##    Rank     District    Population   Type    `Ceremonial coun… `English Region` 
##    <chr>    <chr>       <chr>        <chr>   <chr>             <chr>            
##  1 More th… More than … More than 1… More t… More than 1,000,… More than 1,000,…
##  2 1        Birmingham  1,141,816    Metrop… West Midlands     West Midlands    
##  3 More th… More than … More than 5… More t… More than 500,00… More than 500,00…
##  4 2        Leeds       793,139      Metrop… West Yorkshire    Yorkshire and th…
##  5 3        Sheffield   584,853      Metrop… South Yorkshire   Yorkshire and th…
##  6 4        Cornwall    569,578      Unitar… Cornwall          South West       
##  7 5        Manchester  552,858      Metrop… Greater Manchest… North West       
##  8 7        Bradford    539,776      Metrop… West Yorkshire    Yorkshire and th…
##  9 8        County Dur… 530,094      Unitar… Durham            North East       
## 10 400,000… 400,000 to… 400,000 to … 400,00… 400,000 to 500,0… 400,000 to 500,0…
## # … with 328 more rows

There are a few things that need to be fixed:

First, though this is maybe just a personal preference, we’ll convert column names to lower case and with no spaces.
The original table’s sub-headers like More than 1,000,000 inhabitants, More than 500,000 inhabitants, etc., have turned into rows with repeated values, and these need to be removed.
The population values need to be converted to integers.
Some variables can be dropped to make a cleaner data frame.

To convert the column names, I’ll create a function using stringr and apply it using rename_with (see this blog post for more details on using rename_with):

tolower_no_spaces <- function(s){
  s %>% 
    str_to_lower() %>% 
    str_replace_all(' ', '_')
}

Df %<>% rename_with(tolower_no_spaces)

To deal with the inclusion of sub-headers rows, we need to filter out all rows that contain values like 250,000 to 300,000 inhabitants, Below 50,000, etc. If we just look at the rank variable, these cases can be relatively easily identified by whether they contain the words to, than or Below.

Df %>% filter(str_detect(rank, 'to|than|Below'))

## # A tibble: 12 x 6
##    rank     district    population    type     ceremonial_county english_region 
##    <chr>    <chr>       <chr>         <chr>    <chr>             <chr>          
##  1 More th… More than … More than 1,… More th… More than 1,000,… More than 1,00…
##  2 More th… More than … More than 50… More th… More than 500,00… More than 500,…
##  3 400,000… 400,000 to… 400,000 to 5… 400,000… 400,000 to 500,0… 400,000 to 500…
##  4 300,000… 300,000 to… 300,000 to 4… 300,000… 300,000 to 400,0… 300,000 to 400…
##  5 250,000… 250,000 to… 250,000 to 3… 250,000… 250,000 to 300,0… 250,000 to 300…
##  6 200,000… 200,000 to… 200,000 to 2… 200,000… 200,000 to 250,0… 200,000 to 250…
##  7 150,000… 150,000 to… 150,000 to 2… 150,000… 150,000 to 200,0… 150,000 to 200…
##  8 125,000… 125,000 to… 125,000 to 1… 125,000… 125,000 to 150,0… 125,000 to 150…
##  9 100,000… 100,000 to… 100,000 to 1… 100,000… 100,000 to 125,0… 100,000 to 125…
## 10 75,000 … 75,000 to … 75,000 to 10… 75,000 … 75,000 to 100,000 75,000 to 100,…
## 11 50,000 … 50,000 to … 50,000 to 75… 50,000 … 50,000 to 75,000  50,000 to 75,0…
## 12 Below 5… Below 50,0… Below 50,000  Below 5… Below 50,000      Below 50,000

That gives us an easy rule with which to filter them out:

Df %<>% filter(!str_detect(rank, 'to|than|Below'))

We can now verify that the number of rows that we now have is the correct number, namely 326 (which is the current number of official districts that are in England).

nrow(Df)

## [1] 326

Now, let’s take a look again at the first 10 rows:

Df

## # A tibble: 326 x 6
##    rank  district   population type           ceremonial_coun… english_region   
##    <chr> <chr>      <chr>      <chr>          <chr>            <chr>            
##  1 1     Birmingham 1,141,816  Metropolitan … West Midlands    West Midlands    
##  2 2     Leeds      793,139    Metropolitan … West Yorkshire   Yorkshire and th…
##  3 3     Sheffield  584,853    Metropolitan … South Yorkshire  Yorkshire and th…
##  4 4     Cornwall   569,578    Unitary autho… Cornwall         South West       
##  5 5     Manchester 552,858    Metropolitan … Greater Manches… North West       
##  6 7     Bradford   539,776    Metropolitan … West Yorkshire   Yorkshire and th…
##  7 8     County Du… 530,094    Unitary autho… Durham           North East       
##  8 9     Wiltshire  500,024    Unitary autho… Wiltshire        South West       
##  9 10    Liverpool  498,042    Metropolitan … Merseyside       North West       
## 10 11    Bristol    463,377    Unitary autho… Bristol          South West       
## # … with 316 more rows

Next, we’ll convert the population variable to an integer variable by first removing the commas in their values. However, before doing that, in the original permanent URL page, we notice that some values in the Population column have align="center" as values.

Df %>% filter(str_detect(population, 'align="center"'))

## # A tibble: 18 x 6
##    rank      district     population  type       ceremonial_coun… english_region
##    <chr>     <chr>        <chr>       <chr>      <chr>            <chr>         
##  1 "align=\… Aylesbury V… "align=\"c… ""         Buckinghamshire  South East    
##  2 "align=\… Bournemouth  "align=\"c… "Unitary … Dorset           South West    
##  3 "align=\… Wycombe      "align=\"c… ""         Buckinghamshire  South East    
##  4 "align=\… Poole        "align=\"c… "Unitary … Dorset           South West    
##  5 "align=\… Suffolk Coa… "align=\"c… ""         Suffolk          East of Engla…
##  6 "align=\… Waveney      "align=\"c… ""         Suffolk          East of Engla…
##  7 "align=\… Taunton Dea… "align=\"c… "Borough"  Somerset         South West    
##  8 "align=\… St Edmundsb… "align=\"c… "Borough"  Suffolk          East of Engla…
##  9 "align=\… West Dorset  "align=\"c… ""         Dorset           South West    
## 10 "align=\… Chiltern     "align=\"c… ""         Buckinghamshire  South East    
## 11 "align=\… East Dorset  "align=\"c… ""         Dorset           South West    
## 12 "align=\… North Dorset "align=\"c… ""         Dorset           South West    
## 13 "align=\… South Bucks  "align=\"c… ""         Buckinghamshire  South East    
## 14 "align=\… Weymouth an… "align=\"c… "Borough"  Dorset           South West    
## 15 "align=\… Forest Heath "align=\"c… ""         Suffolk          East of Engla…
## 16 "align=\… Christchurch "align=\"c… "Borough"  Dorset           South West    
## 17 "align=\… Purbeck      "align=\"c… ""         Dorset           South West    
## 18 "align=\… West Somers… "align=\"c… ""         Somerset         South West

This is presumably just an typographic error. We could remove these rows, but when we convert to integers, population values in these rows will be converted to NA.

After we convert to integers, we’ll drop rank entirely because it can be calculated easily from population.

Df %<>% mutate(population = str_replace_all(population, ',', '') %>% as.integer()) %>% 
  select(-rank)

Let’s look again at the first 10 rows:

Df

## # A tibble: 326 x 5
##    district    population type               ceremonial_coun… english_region    
##    <chr>            <int> <chr>              <chr>            <chr>             
##  1 Birmingham     1141816 Metropolitan boro… West Midlands    West Midlands     
##  2 Leeds           793139 Metropolitan boro… West Yorkshire   Yorkshire and the…
##  3 Sheffield       584853 Metropolitan boro… South Yorkshire  Yorkshire and the…
##  4 Cornwall        569578 Unitary authority  Cornwall         South West        
##  5 Manchester      552858 Metropolitan boro… Greater Manches… North West        
##  6 Bradford        539776 Metropolitan boro… West Yorkshire   Yorkshire and the…
##  7 County Dur…     530094 Unitary authority  Durham           North East        
##  8 Wiltshire       500024 Unitary authority  Wiltshire        South West        
##  9 Liverpool       498042 Metropolitan boro… Merseyside       North West        
## 10 Bristol         463377 Unitary authority… Bristol          South West        
## # … with 316 more rows

The type variable is quite inconsistent. The 326 English districts are in fact simply divided into 5 types but in our Df, the type variable is sometimes missing, or has values like “City (2012)” that do not correspond to one the five English district types, or has two pieces of information that are separated by a comma. Because of this overall inconsistency, we will drop the variable. However, it is possible to recover this information from joining this table with other tables of English districts such as from what is available on the List of English districts by area Wikipedia page.

Df %<>% select(-type)

As a final fix, we can see that one district, namely Stockton-on-Tees, is listed as belonging to two regions of England:

Df %>% filter(district == "Stockton-on-Tees")

## # A tibble: 1 x 4
##   district       population ceremonial_county      english_region               
##   <chr>               <int> <chr>                  <chr>                        
## 1 Stockton-on-T…     197348 Durham andNorth Yorks… North East andYorkshire and …

Given that this district is listed here as part of North East England, we’ll change its listed region to North East.

Df %<>% mutate(english_region = recode(english_region, 
                                       'North East andYorkshire and the Humber' = 'North East')
)

The final state of the data frame is now as follows:

Df

## # A tibble: 326 x 4
##    district      population ceremonial_county  english_region          
##    <chr>              <int> <chr>              <chr>                   
##  1 Birmingham       1141816 West Midlands      West Midlands           
##  2 Leeds             793139 West Yorkshire     Yorkshire and the Humber
##  3 Sheffield         584853 South Yorkshire    Yorkshire and the Humber
##  4 Cornwall          569578 Cornwall           South West              
##  5 Manchester        552858 Greater Manchester North West              
##  6 Bradford          539776 West Yorkshire     Yorkshire and the Humber
##  7 County Durham     530094 Durham             North East              
##  8 Wiltshire         500024 Wiltshire          South West              
##  9 Liverpool         498042 Merseyside         North West              
## 10 Bristol           463377 Bristol            South West              
## # … with 316 more rows

Summarizing and visualizing the data

Having extracted and cleaned the data, we can now obtain some summary statistics:

Df %>% summarize(number_of_districts = n(),
                 median_population = median(population, na.rm = T),
                 population_iqr = IQR(population, na.rm = T)
)

## # A tibble: 1 x 3
##   number_of_districts median_population population_iqr
##                 <int>             <dbl>          <dbl>
## 1                 326            137215        113963.

And we can provide these summaries by English region too.

Df %>% group_by(english_region) %>% 
  summarize(number_of_districts = n(),
            median_population = median(population, na.rm = T),
            population_iqr = IQR(population, na.rm = T)
  )

## # A tibble: 9 x 4
##   english_region           number_of_districts median_population population_iqr
##   <chr>                                  <int>             <dbl>          <dbl>
## 1 East Midlands                             40           104256.         26472.
## 2 East of England                           48           135242.         65534 
## 3 London                                    33           276983          83567 
## 4 North East                                10           204984         133972.
## 5 North West                                39           146038         146670.
## 6 South East                                66           126274          43046.
## 7 South West                                37           129128         111726.
## 8 West Midlands                             30           129662         145136.
## 9 Yorkshire and the Humber                  23           210618         203578

We can visualize the distributions of the population in each district.

ggplot(Df,
       aes(x = population)
) + geom_histogram(bins = 25, col='white') + 
  theme_classic() +
  scale_x_continuous(name="Population", 
                     breaks = c(10^5, 5*10^5, 10^6),
                     labels = c('100K', '500K', '1M')
        )

And do the same by region.

ggplot(Df,
       aes(x = population)
) + geom_histogram(bins = 25, col='white') + 
  facet_wrap(~english_region)+  theme_classic() +
  theme(strip.background = element_blank()) + 
  scale_x_continuous(name="Population", 
                     breaks = c(10^5, 5*10^5, 10^6),
                     labels = c('100K', '500K', '1M')
  )