BOE Scraper

2025

BOE Scraper

A web scraper written in R for extracting and processing content from the Boletín Oficial del Estado (BOE), the official gazette of Spain, as well as the historical BOE archives (“Gazeta”).

Features

Extracts metadata (title, date, references, document links) from BOE search result pages
Downloads linked PDF, XML, and HTML files
Outputs metadata to structured CSV
Two scraper modules included:
- BOE Scraper (1960–Present) — from the official digital BOE
- Gazette Historical Scraper — for older historical BOE archives

How It Works

Go to the BOE website
Use the search tool to filter by keyword, date range, or section
Copy the resulting URL from the browser’s address bar
Run the appropriate scraper in R, passing the copied URL as input
The scraper will:
- Load all matching publications
- Extract metadata (title, publication date, section, file links, etc.)
- Save the data into a structured CSV file

Scrapers

1. BOE Scraper (1960–Present)

2. Gazette Historical Scraper (Pre-1960)

Requirements

You must create the following folder structure in your working directory:
- output
- files
Required R packages:
- xml2
- httr
- purrr
- rvest
- curl
- tidyverse
- janitor

Install dependencies with:

install.packages(c("xml2", "httr", "purrr", "rvest", "curl", "tidyverse","janitor"))