This content originally appeared on DEV Community and was authored by ashraf el houfi
How I built an end-to-end data pipeline from scratch using modern data engineering tools
Introduction
From Student to Data Engineer: My First Pipeline
As a student diving into the world of data engineering, I embarked on building my first complete ELT (Extract, Load, Transform) pipeline. This project taught me the fundamentals of modern data architecture and gave me hands-on experience with industry-standard tools.
What you'll learn from this article:
- How to design and implement an ELT pipeline from scratch
- Docker containerization for data services
- Data transformation with dbt (data build tool)
- Workflow orchestration with Apache Airflow
- Real-world best practices and lessons learned
Tech Stack:
🐳 Docker & Docker Compose
🐘 PostgreSQL (Source & Destination)
🔧 dbt (Data Build Tool)
✈️ Apache Airflow
🐍 Python
Section 1: Architecture Overview
The Architecture
My pipeline follows the modern ELT pattern:
- Extract & Load: Python script extracts data from source PostgreSQL and loads into destination
- Transform: dbt handles data transformations, testing, and documentation
- Orchestrate: Airflow manages the entire workflow
Why ELT over ETL?
- Scalability: Transform after loading leverages destination database power
- Flexibility: Raw data available for ad-hoc analysis
- Modern Approach: Aligns with cloud data warehouse patterns
Section 2: Implementation Deep Dive
Building the Pipeline Step by Step
Step 1: Containerized Database Setup
# docker-compose.yaml excerpt
services:
source_postgres:
image: postgres:15
ports:
- "5433:5432"
environment:
POSTGRES_DB: source_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: secret
volumes:
- ./source_db_init/init.sql:/docker-entrypoint-initdb.d/init.sql
Step 2: The ELT Script
# elt_script.py - The heart of data movement
import subprocess
import sys
def wait_for_postgres(host, max_retries=5, delay_seconds=5):
# Connection logic here
pass
def extract_and_load():
# pg_dump for extraction
dump_command = [
'pg_dump',
'-h', 'source_postgres',
'-U', 'postgres',
'-d', 'source_db',
'--clean',
'--if-exists'
]
# psql for loading
load_command = [
'psql',
'-h', 'destination_postgres',
'-U', 'postgres',
'-d', 'destination_db'
]
Section 3: Data Transformation with dbt
Smart Transformations with dbt
Custom Macro for Rating Classification
-- macros/classify_ratings.sql
{% macro classify_ratings(rating_column) %}
CASE
WHEN {{ rating_column }} >= 4.5 THEN 'Excellent'
WHEN {{ rating_column }} >= 4.0 THEN 'Good'
WHEN {{ rating_column }} >= 3.0 THEN 'Average'
WHEN {{ rating_column }} >= 2.0 THEN 'Poor'
ELSE 'Very Poor'
END
{% endmacro %}
Transformation Models
-- models/film_classification.sql
SELECT
film_id,
title,
user_rating,
{{ classify_ratings('user_rating') }} as rating_category,
release_date
FROM {{ ref('films') }}
Why dbt?
Version Control: SQL transformations in Git
Testing: Built-in data quality tests
Documentation: Auto-generated lineage
Modularity: Reusable macros and models
Section 4: Orchestration with Airflow
Orchestrating with Apache Airflow
The DAG Structure
# airflow/dags/elt_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator
dag = DAG(
'elt_pipeline',
default_args=default_args,
description='Extract, Load, and Transform pipeline using dbt',
schedule=timedelta(hours=1),
catchup=False,
tags=['elt', 'postgres', 'dbt'],
)
# Task dependencies
elt_task >> dbt_task >> quality_check
Pipeline Visualization
Section 5: Results & Monitoring
Results & What I Learned
Pipeline Performance
- Data Volume: 20 films, 20 actors, 39 categories, 14 users
- Execution Time: ~30 seconds end-to-end
- Success Rate: 100% after debugging
- Tests Passed: 20/20 dbt data quality tests
Key Metrics Dashboard
Sample Output
-- Transformed data example
SELECT film_id, title, rating_category, actors
FROM film_rating
LIMIT 3;
| film_id | title | rating_category | actors |
|---------|------------|-----------------|------------------|
| 1 | Inception | Excellent | Leonardo DiCaprio|
| 2 | Shawshank | Excellent | Tim Robbins |
| 3 | Godfather | Excellent | Marlon Brando |
Section 6: Lessons Learned
Student Insights & Lessons Learned
What Went Well
- Containerization: Docker made development environment consistent
- Version Control: Everything in Git from day one
- Incremental Development: Built piece by piece, tested at each step
- Documentation: Commented code saved me hours of debugging
Challenges & Solutions
-
PostgreSQL Version Mismatch
- Problem: pg_dump version didn't match server
- Solution: Standardized on PostgreSQL 15 images
-
Airflow 3.0.3 Breaking Changes
- Problem:
schedule_interval
parameter deprecated - Solution: Updated to schedule parameter
- Problem:
-
dbt Schema Validation
- Problem: Column names in tests didn't match actual model
- Solution: Regular testing and validation
If I Started Over
- Use Infrastructure as Code (Terraform) for cloud deployment
- Implement CI/CD pipeline with GitHub Actions
- Add data lineage tracking
- Include more comprehensive logging
Section 7: Next Steps & Future Improvements
Short-term Improvements
- Add data quality alerts
- Implement incremental loading
- Create Slack notifications for failures
- Add more sophisticated dbt tests
Long-term Goals
- Deploy to AWS/GCP with managed services
- Implement streaming with Kafka
- Add ML pipeline integration
- Scale to handle GB+ datasets
For Fellow Students
If you're starting your data engineering journey:
- Start Small: Begin with simple transformations
- Practice Regularly: Build something every week
- Join Communities: dbt Slack, Airflow forums
- Document Everything: Your future self will thank you
- Share Your Work: Teaching others reinforces learning
Conclusion
Building this ELT pipeline taught me that data engineering is equal parts technical skill and problem-solving mindset. Every error message was a learning opportunity, and every successful run was a small victory.
Key Takeaways:
- Modern data tools are powerful but require careful integration
- Container orchestration simplifies complex deployments
- Data quality testing is non-negotiable
- Good documentation saves more time than you think
Resources That Helped Me
- dbt Learn
- Airflow Documentation
- Docker Compose Guide
- Data Engineering communities on Discord/Slack
Want to try this yourself?
Check out [https://github.com/el-houfi-achraf/elt-pipeline] with full source code and setup instructions.
This content originally appeared on DEV Community and was authored by ashraf el houfi

ashraf el houfi | Sciencx (2025-08-07T22:34:45+00:00) Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow. Retrieved from https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.