Electoral Manifestos OCR Pipeline
This repository contains the core Python workflow (“código madre”) for processing historical electoral programs using AI-based OCR, sentence-level text segmentation, and automatic translation into Spanish or English.
The pipeline is designed to be reproducible, scalable, and user-friendly, so that research assistants without deep text-as-data expertise can process new electoral programs as they are incorporated into the project.
Project Overview
The workflow follows these steps:
AI-based OCR of PDF files Electoral programs in PDF format are processed using Mistral OCR, which extracts high-quality text from scanned or native PDFs.
Sentence-level segmentation Extracted text is split into individual sentences, with one sentence per row, enabling downstream text-as-data analysis.
Language detection and translation Sentences not written in Spanish or English (e.g. French) are automatically translated sentence by sentence using ChatGPT.
Structured output The results are stored in structured formats (Markdown and DataFrames) that can later be used for:
- Manual or automated coding
- Machine learning classification (e.g. CAP / Babel models)
- Comparative political text analysis
Input Data
Format: PDF files
Typical content: Historical electoral programs / manifestos
Naming convention:
region_year_party_language.pdfExample:
quebec_1994_pq_fr.pdf
This naming system is important for tracking documents consistently across the project.
Output
For each PDF, the pipeline generates:
A Markdown (.md) file containing the extracted OCR text
A DataFrame with:
- Document identifier (filename)
- Full extracted text
- Sentence-level representations (after segmentation)
- Translated sentences when applicable
Each sentence occupies one row, making the output ready for coding and analysis.
Requirements
Main Python dependencies:
python >= 3.9pandasnumpyspacymistralaiopenai
You will also need:
- Mistral API key (for OCR)
- OpenAI API key (for translation)
Make sure your API keys are available as environment variables, for example:
export MISTRAL_API_KEY="your_key_here"
export OPENAI_API_KEY="your_key_here"
How to Use
Place PDF files in the designated input folder.
Open and run the notebook
individual_files.ipynb.The script will:
- Send PDFs to Mistral OCR
- Extract and clean the text
- Split text into sentences
- Translate non-English/Spanish content
- Save outputs automatically
The notebook is intentionally written in a step-by-step and readable way to facilitate use by non-expert research assistants.
Future Extensions
This repository is intended as a foundation for:
- Automated coding using CAP / Babel models
- ChatGPT-based supervised classification
- Cross-national and longitudinal manifesto analysis