<- rvest::read_html("https://wwwnc.cdc.gov/travel/destinations/list") |>
country_pages ::html_elements(xpath = "//div[starts-with(@id, 'group-')]//li//a/@href") |>
rvest::html_text() |>
rvestunique()
Politely mapping recommended travel vaccines
Ahead of any trip I am planning, I check if any travel vaccines are recommended, because of the potential presence of pathogens that may not be present in my home country.
There are a number a great authoritative sources for this, often managed by the national health authorities of your home country. This service is for example provided by the CDC in the US, l’Institut Pasteur in France, or TravelHealthPro in the UK.
All the listed websites have a similar structure: one page per country, will the recommended vaccines, prophylaxis treatments, and general recommendations. But none of them provided a visualization to answer my question:
For each vaccine, for which destination are they recommended?
In other words, can we change the “country-view” to a “vaccine-view”?
Beyond the simple scientific curiosity, this can help the decision to take less useful vaccines because they might be beneficial for other future trips.
If you do not care about the technical walkthrough, you can jump straight to the final result.
Since the websites do not provide this view, let’s build it ourselves!
For this, we need to gather the data, and then plot it. Since the data is not provided directly as a dataset, or via an API, our only resort is web scraping: loading the webpage programmatically and selectively extract the elements we are interested in by parsing the HTML code. A common tool for web scraping is the rvest R package which we will use below.
Getting the list of covered countries
The list of countries is present on the following page: https://wwwnc.cdc.gov/travel/destinations/list and can be extracted relatively quickly by selecting all the list (<li>
) elements in each letter category (<div id='group-a'>
for letter A, etc.):
If you are wondering how to read the value of the xpath
argument on line 2, you can read Maëlle Salmon’s post on this topic: Why I like XPath, XML and HTML.
From this, we get a vector character where each element is the location of the page containing the location for the specific country.
head(country_pages)
[1] "/travel/destinations/traveler/none/afghanistan"
[2] "/travel/destinations/traveler/none/albania"
[3] "/travel/destinations/traveler/none/algeria"
[4] "/travel/destinations/traveler/none/american-samoa"
[5] "/travel/destinations/traveler/none/andorra"
[6] "/travel/destinations/traveler/none/british-virgin-islands"
To continue, we will have to visit each country page and extract the recommendations from this page. This means we will do 245 requests (number of elements in the list + 1 for the request we just did).
This raises an alarm bell: is it okay to automatically send so many requests to a third-party website?
Scraping web data politely
Indeed, scraping differs from interacting with APIs in the sense that the webpages we are scraping were not necessarily intended to be visited a large amount of times in a short timespan by a script. In some cases, scraping a website without extra care may result in reduced performance for other visitors or the site or get yourself blocked from this website.
To prevent accidents, websites will generally tell you what acceptable ways of scraping (or not!) their data are. This is done in a machine readable format in the robots.txt
file.
::GET("https://wwwnc.cdc.gov/robots.txt") |>
httr::content() |>
httrcat()
# Rover is a bad dog
User-agent: Roverbot
Disallow: /
#GoogleBot
User-agent:googlebot
Crawl-delay: 20
Allow: /travel
Disallow: *.js
Disallow: /
User-agent:googlebot
Crawl-delay: 20
Allow: /eid
Disallow: *.js
Disallow: /
# EmailSiphon is a hunter/gatherer which extracts email addresses for spam-mailers to use
User-agent: EmailSiphon
Disallow: /
# Exclude MindSpider since it appears to be ill-behaved
User-agent: MindSpider
Disallow: /
User-agent: *
Crawl-delay: 20
User-agent: gsa-crawler
Disallow: /
# Exclude script for custom EID metrics
User-agent: *
Disallow: /eid/Scripts/EidMetrics.js
# Exclude resource intensive files for EID
User-agent: *
Disallow: /eid/article/*.pptx
Disallow: /eid/article/*-combined.pdf
It can contain information about which pages should not be scraped, who shouldn’t scrape the website, and the acceptable rate at which you can make request.
Since it is machine-readable, reading the robots.txt
and adapting our behaviour accordingly can be fully automated. This is what the polite R package is doing.
polite is well integrated with rvest and only minimal changes are required to make rvest code polite. The code posted previously thus becomes:
# Initialize scraping session
<- polite::bow("https://wwwnc.cdc.gov")
session
<- polite::nod(session, "/travel/destinations/list") |>
country_pages ::scrape() |>
polite::html_elements(xpath = "//div[starts-with(@id, 'group-')]//li//a/@href") |>
rvest::html_text() |>
rvestunique()
With this mind, we can move ahead with our scraping task and have a look at each country page to extract the recommended vaccines. We define a helper function parse_country_page()
for this. Comments are added inline but the logic is the same as for the main page from which we got the list of countries:
- We use polite to handle the rate limiting, etc.
- We extract the specific element we are interested in with XPath
<- function(session, path) {
parse_country_page message(basename(path))
<- session |>
page ::nod(path) |>
polite::scrape()
polite
|>
page ::html_elements(xpath = "//h3[text()='Vaccines and Medicines']/parent::div/following-sibling::div//table//td[@class='clinician-disease']/a") |>
rvest::html_text()
rvest
}
This function can now be run on each country page with a loop 1:
<- country_pages |>
recos ::map(\(path) parse_country_page(session, path))
purrr
names(recos) <- basename(country_pages)
head(recos)
$afghanistan
[1] "Routine vaccines" "COVID-19" "Cholera" "Hepatitis A"
[5] "Hepatitis B" "Malaria" "Measles" "Polio"
[9] "Rabies" "Typhoid"
$albania
[1] "Routine vaccines" "COVID-19" "Hepatitis A" "Hepatitis B"
[5] "Measles" "Rabies" "Yellow Fever"
$algeria
[1] "Routine vaccines" "COVID-19" "Cholera" "Hepatitis A"
[5] "Hepatitis B" "Measles" "Polio" "Rabies"
[9] "Typhoid" "Yellow Fever"
$`american-samoa`
[1] "Routine vaccines" "COVID-19" "Hepatitis A" "Hepatitis B"
[5] "Measles" "Rabies" "Typhoid"
$andorra
[1] "Routine vaccines" "COVID-19" "Hepatitis B" "Measles"
[5] "Rabies"
$`british-virgin-islands`
[1] "Routine vaccines" "COVID-19" "Hepatitis A" "Hepatitis B"
[5] "Measles" "Rabies" "Typhoid"
If you want to data but don’t have time to scrape it yourself, I have saved it as a JSON file (30 KB) while writing this blog post. Keep in mind it may be slightly outdated when you read this post.
We’re almost there but some data wrangling is still necessary to go from the list format to a nice data.frame
:
<- recos |>
cdc_recommendations ::list_transpose() |>
purrr::bind_rows() |>
dplyr::pivot_longer(tidyselect::everything(), names_to = "country") |>
tidyr::drop_na() |>
tidyr::pivot_wider(names_from = "value") |>
tidyr::mutate(across(-country, ~ !is.na(.x)))
dplyr
head(cdc_recommendations)
# A tibble: 6 × 16
country `Routine vaccines` `COVID-19` Cholera `Hepatitis A` `Hepatitis B`
<chr> <lgl> <lgl> <lgl> <lgl> <lgl>
1 afghanistan TRUE TRUE TRUE TRUE TRUE
2 albania TRUE TRUE FALSE TRUE TRUE
3 algeria TRUE TRUE TRUE TRUE TRUE
4 american-sa… TRUE TRUE FALSE TRUE TRUE
5 andorra TRUE TRUE FALSE FALSE TRUE
6 british-vir… TRUE TRUE FALSE TRUE TRUE
# ℹ 10 more variables: Chikungunya <lgl>, Measles <lgl>,
# `Japanese Encephalitis` <lgl>, Rabies <lgl>, Malaria <lgl>,
# `Tick-borne Encephalitis` <lgl>, Polio <lgl>, `Yellow Fever` <lgl>,
# Typhoid <lgl>, `Meningitis (Meningococcal disease)` <lgl>
Using tidyr::pivot_longer()
and then tidyr::pivot_wider()
is a neat trick you can often use when trying to create a boolean matrix, such as a species/site occurrence matrix, or our vaccine/country matrix here.
Challenges of web scraping
In the previous section, I directly added the relevant XPath to extract the relevant piece of information but this can sometimes be challenging.
Here are some specific issues I encountered while trying to scrape data from Institut Pasteur instead of the CDC:
Scraping is notoriously sensible to changes in the page structure or phrasing. For example, if you extract a section by its name and the name has a slightly change, the code will stop working. Alternatively, if you extract the section by position and another section is added before, the code will stop working. As opposed to an API, since the page was not designed to be read by a machine, the developers will usually not do any specific efforts to keep compatibility during updates.
In HTML, the different section headings are not nested, while XPath is great at handling nested structure. In other words, sections are not delimited by opening and closing markers, but by the implicit “current section until a header of the same level appear”. For example, I wanted to extract the subsection “Systématiquement” (= “Always”) under the “Vaccinations recommandées” (= “Recommended vaccines”) top-level section, but couldn’t find an easy way to do it.
This is what HTML code looks like:
<h1>Top-level section</h1>
<h2>Subsection</h2>
<p>lorem ipsum...</p>
<h1>Another top-level section</h1>
But my (incorrect) mental model was considering headings more like div
s:
<level1><title>Top-level section</title>
<level2><title>Subsection</title>
<p>lorem ipsum...</p>
</level2>
</level1>
<level1><title>Another top-level section</title>
...</level<1
Data wrangling and plotting
We now want to plot this data with one map per vaccine and two colours indicating whether a certain vaccine is recommended when travelling to a certain country.
We can use ggplot2 for this, with the maps package which includes data for polygons of every country.
However, one extra data wrangling challenge lies here: there is a mismatch between the region names used in the map data and on the CDC website due to capitalization, space formatting, or even region definition.
# List of country from maps polygons
::map('world', plot = FALSE)$names |>
mapsstrsplit(":", fixed = TRUE) |>
::map_chr(1) |>
purrrunique() |>
sort() |>
head(10)
[1] "Afghanistan" "Albania" "Algeria" "American Samoa"
[5] "Andorra" "Angola" "Anguilla" "Antarctica"
[9] "Antigua" "Argentina"
# List of country from scraped CDC data
$country |>
cdc_recommendationshead(10)
[1] "afghanistan" "albania" "algeria"
[4] "american-samoa" "andorra" "british-virgin-islands"
[7] "angola" "anguilla" "antigua-and-barbuda"
[10] "argentina"
For example, the CDC will often issue specific recommendation for islands than for the mainland part of the same country.
To resolve this issue, we harmonize the names of both data sources with a common list of names provided by the countrycode package. countrycode uses regular expressions to match names in different languages with the official name of the country.
<- sf::st_as_sf(maps::map('world', plot = FALSE, fill = TRUE)) |>
world1 ::mutate(
dplyrID = countrycode::countrycode(ID, origin = "country.name", destination = "country.name")
)
Final plot
|>
cdc_recommendations ::mutate(
dplyrID = countrycode::countrycode(country, origin = "country.name", destination = "country.name"),
.keep = "unused"
|>
) ::full_join(world1, by = "ID", relationship = "many-to-many") |>
dplyr::pivot_longer(-c(ID, geom), names_to = "vax") |>
tidyr::mutate(value = ifelse(is.na(value), FALSE, value)) |>
dplyr::ggplot(ggplot2::aes(fill = value, geometry = geom)) +
ggplot2::facet_wrap(ggplot2::vars(vax), ncol = 2, shrink = FALSE) +
ggplot2::geom_sf() +
ggplot2::labs(
ggplot2title = glue::glue("Traveler prevention map"),
subtitle = "as per CDC travel guidelines"
+
) ::scale_fill_manual(
ggplot2values = c("TRUE" = "darkgreen", "FALSE" = "grey70"),
name = "",
labels = c("Not Recommended", "Recommended")
+
) ::theme_bw() +
ggplot2::theme(legend.position = "bottom") ggplot2
Next steps
This blog post is now quite long but the work is not fully finished. Here are the steps I will need to take in the future:
The travel recommendations are regularly updated. So we would need to re-scrape the data regularly to ensure it is always up-to-date. This is one extra reason why it made sense to scrape politely. But the format of the code here is not well-suite for production and automation. The code will be packaged and regular scraping will be automated via GitHub Actions.
During our scraping, we have taken the names of all the diseases mentioned in a specific country page. But travel recommendations are usually more granular. You will frequently see “for extended stay” or “if staying in a rural setting”, or in a specific region of the country. Future versions of this work will take this granularity into account and turn the binary “Recommended” vs “Not Recommended” outcome into something that can take 3 values: “Recommended for all travelers”, “Recommended for specific cases”, “Not recommended”.
Footnotes
See the blog post ‘Lesser-known reasons to prefer
apply()
over for loops’ on why I usepurrr::map()
here rather than afor
loop.↩︎