Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow

How I built an end-to-end data pipeline from scratch using modern data engineering tools

Introduction

From Student to Data Engineer: My First Pipeline

As a student diving into the world of data engineering, I embarked on building my first complete…


This content originally appeared on DEV Community and was authored by ashraf el houfi

How I built an end-to-end data pipeline from scratch using modern data engineering tools

Introduction

From Student to Data Engineer: My First Pipeline

As a student diving into the world of data engineering, I embarked on building my first complete ELT (Extract, Load, Transform) pipeline. This project taught me the fundamentals of modern data architecture and gave me hands-on experience with industry-standard tools.

What you'll learn from this article:

  • How to design and implement an ELT pipeline from scratch
  • Docker containerization for data services
  • Data transformation with dbt (data build tool)
  • Workflow orchestration with Apache Airflow
  • Real-world best practices and lessons learned

Tech Stack:
🐳 Docker & Docker Compose
🐘 PostgreSQL (Source & Destination)
🔧 dbt (Data Build Tool)
✈️ Apache Airflow
🐍 Python

Section 1: Architecture Overview

The Architecture

My pipeline follows the modern ELT pattern:

  1. Extract & Load: Python script extracts data from source PostgreSQL and loads into destination
  2. Transform: dbt handles data transformations, testing, and documentation
  3. Orchestrate: Airflow manages the entire workflow

Why ELT over ETL?

  • Scalability: Transform after loading leverages destination database power
  • Flexibility: Raw data available for ad-hoc analysis
  • Modern Approach: Aligns with cloud data warehouse patterns

Section 2: Implementation Deep Dive

Building the Pipeline Step by Step

Step 1: Containerized Database Setup

# docker-compose.yaml excerpt
services:
  source_postgres:
    image: postgres:15
    ports:
      - "5433:5432"
    environment:
      POSTGRES_DB: source_db
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: secret
    volumes:
      - ./source_db_init/init.sql:/docker-entrypoint-initdb.d/init.sql

Step 2: The ELT Script

# elt_script.py - The heart of data movement
import subprocess
import sys

def wait_for_postgres(host, max_retries=5, delay_seconds=5):
    # Connection logic here
    pass

def extract_and_load():
    # pg_dump for extraction
    dump_command = [
        'pg_dump',
        '-h', 'source_postgres',
        '-U', 'postgres',
        '-d', 'source_db',
        '--clean',
        '--if-exists'
    ]

    # psql for loading
    load_command = [
        'psql',
        '-h', 'destination_postgres',
        '-U', 'postgres',
        '-d', 'destination_db'
    ]

Section 3: Data Transformation with dbt

Smart Transformations with dbt

Custom Macro for Rating Classification

-- macros/classify_ratings.sql
{% macro classify_ratings(rating_column) %}
  CASE
    WHEN {{ rating_column }} >= 4.5 THEN 'Excellent'
    WHEN {{ rating_column }} >= 4.0 THEN 'Good'
    WHEN {{ rating_column }} >= 3.0 THEN 'Average'
    WHEN {{ rating_column }} >= 2.0 THEN 'Poor'
    ELSE 'Very Poor'
  END
{% endmacro %}

Transformation Models

-- models/film_classification.sql
SELECT
    film_id,
    title,
    user_rating,
    {{ classify_ratings('user_rating') }} as rating_category,
    release_date
FROM {{ ref('films') }}

Why dbt?

  • Version Control: SQL transformations in Git

  • Testing: Built-in data quality tests

  • Documentation: Auto-generated lineage

  • Modularity: Reusable macros and models

Section 4: Orchestration with Airflow

Orchestrating with Apache Airflow

The DAG Structure

# airflow/dags/elt_pipeline.py
from airflow import DAG
from airflow.operators.bash import BashOperator

dag = DAG(
    'elt_pipeline',
    default_args=default_args,
    description='Extract, Load, and Transform pipeline using dbt',
    schedule=timedelta(hours=1),
    catchup=False,
    tags=['elt', 'postgres', 'dbt'],
)

# Task dependencies
elt_task >> dbt_task >> quality_check

Pipeline Visualization

Section 5: Results & Monitoring

Results & What I Learned

Pipeline Performance

  • Data Volume: 20 films, 20 actors, 39 categories, 14 users
  • Execution Time: ~30 seconds end-to-end
  • Success Rate: 100% after debugging
  • Tests Passed: 20/20 dbt data quality tests

Key Metrics Dashboard

Sample Output

-- Transformed data example
SELECT film_id, title, rating_category, actors 
FROM film_rating 
LIMIT 3;

| film_id | title      | rating_category | actors           |
|---------|------------|-----------------|------------------|
| 1       | Inception  | Excellent       | Leonardo DiCaprio|
| 2       | Shawshank  | Excellent       | Tim Robbins      |
| 3       | Godfather  | Excellent       | Marlon Brando    |

Section 6: Lessons Learned

Student Insights & Lessons Learned

What Went Well

  • Containerization: Docker made development environment consistent
  • Version Control: Everything in Git from day one
  • Incremental Development: Built piece by piece, tested at each step
  • Documentation: Commented code saved me hours of debugging

Challenges & Solutions

  1. PostgreSQL Version Mismatch

    • Problem: pg_dump version didn't match server
    • Solution: Standardized on PostgreSQL 15 images
  2. Airflow 3.0.3 Breaking Changes

    • Problem: schedule_interval parameter deprecated
    • Solution: Updated to schedule parameter
  3. dbt Schema Validation

    • Problem: Column names in tests didn't match actual model
    • Solution: Regular testing and validation

If I Started Over

  • Use Infrastructure as Code (Terraform) for cloud deployment
  • Implement CI/CD pipeline with GitHub Actions
  • Add data lineage tracking
  • Include more comprehensive logging

Section 7: Next Steps & Future Improvements

Short-term Improvements

  • Add data quality alerts
  • Implement incremental loading
  • Create Slack notifications for failures
  • Add more sophisticated dbt tests

Long-term Goals

  • Deploy to AWS/GCP with managed services
  • Implement streaming with Kafka
  • Add ML pipeline integration
  • Scale to handle GB+ datasets

For Fellow Students

If you're starting your data engineering journey:

  1. Start Small: Begin with simple transformations
  2. Practice Regularly: Build something every week
  3. Join Communities: dbt Slack, Airflow forums
  4. Document Everything: Your future self will thank you
  5. Share Your Work: Teaching others reinforces learning

Conclusion

Building this ELT pipeline taught me that data engineering is equal parts technical skill and problem-solving mindset. Every error message was a learning opportunity, and every successful run was a small victory.

Key Takeaways:

  • Modern data tools are powerful but require careful integration
  • Container orchestration simplifies complex deployments
  • Data quality testing is non-negotiable
  • Good documentation saves more time than you think

Resources That Helped Me

Want to try this yourself?
Check out [https://github.com/el-houfi-achraf/elt-pipeline] with full source code and setup instructions.


This content originally appeared on DEV Community and was authored by ashraf el houfi


Print Share Comment Cite Upload Translate Updates
APA

ashraf el houfi | Sciencx (2025-08-07T22:34:45+00:00) Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow. Retrieved from https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/

MLA
" » Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow." ashraf el houfi | Sciencx - Thursday August 7, 2025, https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/
HARVARD
ashraf el houfi | Sciencx Thursday August 7, 2025 » Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow., viewed ,<https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/>
VANCOUVER
ashraf el houfi | Sciencx - » Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/
CHICAGO
" » Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow." ashraf el houfi | Sciencx - Accessed . https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/
IEEE
" » Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow." ashraf el houfi | Sciencx [Online]. Available: https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/. [Accessed: ]
rf:citation
» Building My First Production-Ready ELT Pipeline: A Student’s Journey with Docker, PostgreSQL, dbt, and Airflow | ashraf el houfi | Sciencx | https://www.scien.cx/2025/08/07/building-my-first-production-ready-elt-pipeline-a-students-journey-with-docker-postgresql-dbt-and-airflow/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.