BOE Scraper
A web scraper written in R for extracting and processing content from the Boletín Oficial del Estado (BOE), the official gazette of Spain, as well as the historical BOE archives (“Gazeta”).
Features
- Extracts metadata (title, date, references, document links) from BOE search result pages
- Downloads linked PDF, XML, and HTML files
- Outputs metadata to structured CSV
- Two scraper modules included:
- BOE Scraper (1960–Present) — from the official digital BOE
- Gazette Historical Scraper — for older historical BOE archives
How It Works
- Go to the BOE website
- Use the search tool to filter by keyword, date range, or section
- Copy the resulting URL from the browser’s address bar
- Run the appropriate scraper in R, passing the copied URL as input
- The scraper will:
- Load all matching publications
- Extract metadata (title, publication date, section, file links, etc.)
- Save the data into a structured CSV file
Scrapers
1. BOE Scraper (1960–Present)
2. Gazette Historical Scraper (Pre-1960)
Requirements
You must create the following folder structure in your working directory:
- output
- files
Required R packages:
xml2httrpurrrrvestcurltidyversejanitor
Install dependencies with:
install.packages(c("xml2", "httr", "purrr", "rvest", "curl", "tidyverse","janitor"))