🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)

⚡ Introduction

In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.
However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) —…


This content originally appeared on DEV Community and was authored by John Wakaba

⚡ Introduction

In today’s data-driven world, access to reliable and structured energy data is critical for decision-making, research, and policy planning.

However, most open data platforms in Africa — such as the Africa Energy Portal (AEP) — present information in dashboard views, which makes large-scale analysis tedious.

To address this challenge, I built a fully automated ETL (Extract, Transform, Load) pipeline that:

  • Scrapes energy indicators for all African countries (2000–2024),
  • Formats and validates the data for consistency,
  • And stores it in a MongoDB database for easy access and analysis.

This project uses Python, Playwright, and MongoDB, with automation powered by the lightweight dependency manager uv.

đź§© Problem Statement

While the Africa Energy Portal provides valuable country-level datasets, it does not offer a bulk download option.

Researchers, analysts, and energy planners need historical time-series data — such as:

  • Electricity generation and consumption
  • Renewable energy contribution
  • Access to clean cooking
  • Population electrification (rural vs urban)

Manually downloading data for 50+ African countries and 20+ years would take days — not counting inconsistencies in data formats and missing years.

The solution: automate it end-to-end.

đź§  Project Goals

  1. Extract data for all African countries directly from the AEP website.
  2. Transform it into a structured, tabular format for analysis.
  3. Store it efficiently in MongoDB for scalability and retrieval.
  4. Validate data completeness and consistency across countries and indicators.
  5. Export the final cleaned dataset for analysis and sharing.

⚙️ Tools & Technologies

Purpose Tool / Library Role
Web scraping Playwright Automates browser-based data capture
Environment & Dependency Management uv Manages virtual environment and packages
Data storage MongoDB Stores country-wise metrics and year data
Data validation & analysis pandas, pydantic Cleans and structures data
Export openpyxl Saves Excel files
Scripting Python Glue for the entire ETL process

🔄 ETL Pipeline Overview

The pipeline consists of four modular stages:

Stage 1 – Data Extraction

  • Uses Playwright to navigate to each country’s profile page.
  • Intercepts the /get-country-data XHR response.
  • Extracts JSON payloads containing all available indicators and yearly values.

Each JSON record includes:

{
  "country": "Kenya",
  "metric": "Population with access to electricity - National",
  "sector": "ELECTRICITY ACCESS",
  "yearly": {
    "2015": 19.65,
    "2016": 25.73,
    "2022": 42.62
  }
}

Stage 2 – Data Formatting

  • Converts raw JSON into a tabular schema: ["country", "country_serial", "metric", "unit", "sector", "sub_sector", "sub_sub_sector", "source_link", "source", "2000", ..., "2024"]
  • Ensures each row represents one metric for one country.
  • Fills missing years with null values to maintain consistency.

Stage 3 – Data Storage

  • Inserts formatted records into MongoDB using pymongo.
  • Adds a unique index (country, metric, source) to prevent duplicates.
  • Upserts records — ensuring updates don’t create duplicates.

Each MongoDB document looks like this:

{
  "country": "Kenya",
  "metric": "Access to Clean Cooking%",
  "source": "Tracking SDG7/WBG",
  "2000": null,
  "2015": 11.9,
  "2020": 23.6,
  "2024": null
}

Stage 4 – Validation

  • Identifies missing years or inconsistent units.
  • Detects countries with incomplete datasets.
  • Exports a detailed validation_report.csv that flags issues automatically.

Sample output:
| issue_type | country | metric | details |
|-------------|----------|--------|----------|
| MISSING_YEARS | Kenya | Access to Clean Cooking% | 2000–2014, 2023–2024 |
| UNIT_INCONSISTENCY | ALL | Electricity Access | %; MW |

đź§ľ Data Export

Once the ETL pipeline finishes, data is exported to both CSV and Excel formats for analysis.

uv run python export_to_csv.py

Output files:

  • reports/exports/energy_data.csv
  • reports/exports/energy_data.xlsx

⚠️ Challenges Faced

Challenge Description
Cloudflare protection The AEP website blocked simple HTTP requests (403, 500). Solved by using Playwright’s browser simulation to mimic human behavior.
Slow response times Some pages took >30 seconds to return data. Added retry logic and longer timeouts.
Inconsistent URL naming Country URLs (like cote-d’ivoire vs cote-divoire) required slug normalization logic.
Incomplete datasets Some countries lacked data for certain years, handled via validation.
Browser resource use Playwright’s real browser automation was resource-heavy; introduced throttling to manage load.

📊 Results

  • âś… Successfully extracted data for 50 African countries
  • âś… Collected 500+ indicators covering 2000–2024
  • âś… All records stored in MongoDB with proper schema
  • âś… Automated validation caught missing and inconsistent data
  • âś… Exportable formats ready for visualization and analysis

đź’ˇ Key Takeaways

  • Automating data extraction from protected websites is possible using browser-level automation (Playwright).
  • Designing modular ETL stages makes maintenance and debugging easier.
  • Data validation is just as important as extraction — raw data is rarely clean.
  • Storing data in MongoDB offers flexibility for hierarchical (nested) data structures.

đź§  Future Work

  • Extend scraping to additional AEP datasets (energy pricing, COâ‚‚ emissions).
  • Build an interactive dashboard using Streamlit or Power BI.
  • Automate periodic updates (monthly/quarterly).
  • Add country-level time-series visualization modules.


This content originally appeared on DEV Community and was authored by John Wakaba


Print Share Comment Cite Upload Translate Updates
APA

John Wakaba | Sciencx (2025-11-04T12:07:16+00:00) 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024). Retrieved from https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/

MLA
" » 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)." John Wakaba | Sciencx - Tuesday November 4, 2025, https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/
HARVARD
John Wakaba | Sciencx Tuesday November 4, 2025 » 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)., viewed ,<https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/>
VANCOUVER
John Wakaba | Sciencx - » 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024). [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/
CHICAGO
" » 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)." John Wakaba | Sciencx - Accessed . https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/
IEEE
" » 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024)." John Wakaba | Sciencx [Online]. Available: https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/. [Accessed: ]
rf:citation
» 🌍 Automating Africa’s Energy Data Collection Using Python, Playwright, and MongoDB (2000–2024) | John Wakaba | Sciencx | https://www.scien.cx/2025/11/04/%f0%9f%8c%8d-automating-africas-energy-data-collection-using-python-playwright-and-mongodb-2000-2024/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.