Data Engineering Concepts

This content originally appeared on DEV Community and was authored by susan waweru

Data engineering is the discipline of designing, building, and maintaining the systems and workflows that make data accessible, reliable, and ready for analysis. It is the behind-the-scenes backbone of modern analytics, machine learning, and business intelligence.

In this article, we will explore some foundational concepts in data engineering

1. Batch vs Streaming Ingestion

These are methods for getting data (ingestion) for processing, analytics, or storage

Batch Ingestion
What it is:
Involves collecting data over a period of time, then processing it all at once in a batch.

When used:
Batch ingestion is used when one doesn’t need immediate results and when large volumes of data is required to be processed at once

data is stored temporarily (over a period of time)
At a set time, a job is scheduled to run (to read and process the data in chunks)

Example use case
Data warehouse: loading data into a data warehouse from operational systems daily

Streaming Ingestion
What it is:
Involves collecting data in real time as it arrives.
It's continuously ingested and processed without waiting for a batch window

When used:
Used when one needs real-time insights and fast reactions, eg, in fraud detection, real-time alerts

data flows continuously
data is processed in small increments

Example use case
Stock trading - the trade data in financial markets is ingested in real time

Batch vs Streaming Ingestion
Batch
Processing of data: Scheduled
Data delay: Over time (min -hrs)
Data Volume: Large chunks of data at once
Use case: Reports, analytics, ETL jobs
Examples: Payroll, web logs
Tools: SSIS

Streaming
Processing of data: Real-time
Data delay: Real time (seconds to milliseconds)
Data volume: Small data events continuously
Use case: Real-time alerts, monitoring, dashboard
Examples: Stock prices, IoT sensor data
Tools: Apache Kafka, Spark Streaming

2. Change Data Capture (CDC)

Just as the name suggests (change data capture) CDC is a technique that is used to track and capture changes in a database so that changes can be propagated to another system or used for downstream processing.
The changes can be inserts, updates, deletes etc in a database

CDC

Detect the changes
So Data sources (usually relational databases) are monitored for changes
Identifies rows that have been inserted, updated or deleted
Capture the changes
The data changes are logged in a table or a log
Metadata ie (operation type, timestamp) is included in the data changes
Deliver changes
Push or pull the changes to a target system (eg data warehouse, an analytics tool, streaming pipeline)
Process changes
Downstream systems (eg dashboards, alerts, machine learning models) update/ react based on the new data

How CDC is implemented
Database triggers
Timestamps
Transaction logs
Third-party tools/ services

Why use CDC
Efficiency: avoids full table scans to find out what has changed
Real time replication: keeps systems in sync
Audit trails: records what changed, when and by whom
Data pipelines: feeds changes into data lakes, warehouses

Use cases

Data warehousing:
Keeps data warehouse in sync with operational/ transactional dbs without reloading entire tables
Source Systems log changes via CDC -> ETL/ELT pipelines extract only the changed records (new, updated, or deleted) -> loaded into the data warehouse
Microservices communication
Monolith or microservice writes data to a shared db -> CDC tool captures and publishes them ->
Etl optimization
Reduce processing time and system load in ETL pipelines by only extracting what changed instead of full table scans
ETL jobs query CDC change tables or logs instead of full source tables -> extract only rows where lastModified or CDC log entry exists since the last run

3. Idempotency

Involves an operation being performed multiple times without changing the result beyond the initial application ie doing it once or doing it multiple times will give the same outcome
What Idempotency means is = Operation that can be repeated safely, on Repeat = same effect

How it works
Say you’re calling an API or running a DB operation
Without Idempotency, repeating the same operation multiple times would result to duplicate records, errors
With idempotency, repeating the same operation would not affect the result, it would just behave as if done only once

What Idempotency involves
Unique Identifiers:

track each operation performed using a key
server checks if the request with the same key has already been processed

State checks:

systems check the current state before performing an action eg only update if status is still in “Pending” status

Idempotent methods:

In REST APIs..GET, PUT, DELETE are idempotent..POST is not (but can be made idempotent using a unique key)
Upserts.. In databases you can prevent duplicates by using logic like (Update if exists else insert)

Where Idempotency is used mostly
. APIs(POST/PUT) - prevent duplicate submissions
. Databases - prevent duplicate rows or repeated updates
. ETL/ Batch jobs - Rerunning jobs shouldn’t corrupt data
. Distributed Systems - Network retries must not reapply the same action
. Messaging systems - reprocessing messages should not repeat side effects
. Payments - backend uses the same payment_id and returns the existing result

Use Cases
1.Payments .. in a case eg where the user clicks “Pay” but the network fails, without idempotency the user is charged twice, with idempotency backend uses the same payment_id and returns the existing result
2.API Call.. if the same request is sent again with the same key, the server ignores it or returns the first response
3.Database Upsert

Importance
. Prevent duplicate transactions
. Improve system reliability
. Enable safe retries in unreliable networks

4. OLTP vs OLAP

Online Transaction Processing (OLTP)
Systems optimized for managing day-to-day transactional operations. They are designed for speed, consistency and concurrency.

Use Case
E-commerce systems (shopping cart, orders
Banking systems
Ticket booking platforms

Example
A customer places an order in an online store, OLTP system records the purchase, updates inventory and creates an invoice

Online Analytical Processing (OLAP)
Designed for analyzing large volumes of historical data, enabling business intelligence, reporting and decision-making

Use Case
Sales trends
Executive dashboards
Profitability analysis by region or product

Example
Analysts want to know “what were the total sales per region in Year1 vs Year2?”
This query runs on the OLAP system, which processes and aggregates the data quickly

OLTP vs OLAP diffs
OLTP
Purpose: Handles real-time transactions
Data type: Deals with Operational ie current data
Operations: Implements Insert, Update, Delete and Read operations

Users: Is mainly for frontline employees, systems
Speed Focus: High-speed, low latency
Database: Normalized (3NF), fewer indexes

OLAP
Purpose: Supports complex data analysis and reporting
Data type: Can also have historical(aggregated, summarized data
Operations: Implements the read-heavy operations ie SELECTs etc
Users: Mainly for analysts and decision- makers
Speed Focus: Fast query performance on large datasets
Database: Denormalized(start/snowflake schema)

5. Columnar vs Row-based storage

These are ways databases organize data

Row-based storage:
Stores data row by row hence
All columns for a single row are stored together
Efficient for OLTP where records are frequently accessed/ modified

Columnar storage:
Stores data column by column
All values for a single column are stored together
Columnar storage excels in analytical (OLAP) queries that focus on specific columns across many rows/ over large datasets

For example you have below query
SELECT AVG(AGE) FROM STUDENTS;

Columnar would work better because it only reads the AGE column. Row-based on the other hand would have to read every row and skip over other columns (non AGE columns) hence not efficient

6. Partitioning

Partitioning involves dividing a large dataset to smaller subsets called partitions. The partitions are stored and managed independently

Data partitioning is done mostly for performance and scalability.
In the case for instance you have a large dataset but you only need to query for a certain period say last month, without partitioning the engine would have to scan the entire dataset hence consuming a lot of time, with partitioning the engine skips all the irrelevant partitions and only scans based on partition provided.

- Improved performance
ie faster query execution and data retrieval
Because data is divided into smaller partitions, when querying, the query can target only the relevant subset hence reducing the amount of data that needs to be scanned and processed.

- Enhanced scalability
Because the partitions are distributed across multiple nodes/ servers more servers resources can be added to handle increasing data volumes and demands

- Increased Availability
Because there are several partitions, if one partition is unavailable, the other partitions can be accessed hence ensures continued data availabilty

Drawback
One of the drawbacks of partitioning is extra management because there too many small partitions

7. ETL vs ELT

Both are data integration methods ie moving and processing data.
They differ in that ETL (Extract, Transform, Load) transforms data before loading in into the target system whereas ELT(Extract, Load, Transform) first loads data into the target system and transforms the data withing target.

ETL Steps

Data is extracted from its source
Its then transformed in a staging area
The data transformed data is then loaded into a data warehouse or data lake

ETL is better used with complex transformations or when data quality and consistency are paramount

ELT Steps

Data is extracted from source
Its loaded directly into a data warehouse or data lake without transformation
Transformation is applied within target

ELT is more suitable for cloud-based data warehouse and data lakes
Its efficient for handling large volumes of data and is well-suited for analytics workloads

Comparison
ETL
Transform step: Before loading
Target type: Legacy/on-prem warehouses
Raw data stored?: No (only transformed)
Speed of loading: Slower (pre-transform)
Flexibility: Lower
Compute location: ETL tool/staging server

ELT
Transform step: After loading
Target type: Cloud warehouses/lakes
Raw data stored?: Yes (raw + transformed)
Speed of loading: Faster (load first)
Flexibility: Higher
Compute location: Data warehouse/lake

8. CAP Theorem

Cap theorem states that in a distributed data system, its impossible to simultaneously achieve all three of: Consistency, Availability and Partition Tolerance. A distributed system must choose at most two of the three properties to prioritize

Consistency (C)
Every read receives the most recent write or an error. After a successful write, any subsequent read should return that updated value. The clients therefore see a single, up-to-date view of the data.

Example
In a banking system if you deposit money in ATM A then check the balance immediately at ATM B, the money you deposited should reflect in the new balance

Availability (A)

the system never refuses to answer Every request receives a valid (non-error) response - but it might not be the latest data

Example
In the banking system, if there’s a network glitch between the data centers when checking balance after depositing, it shows the previous balance instead of the current one

Partition Tolerance (P)
Network still continues to operate despite network failures or delays between nodes
The system continues to operate even if messages between parts of the system are lost or delayed

Example
A distributed database runs 3 data centers and communication to one of the data centers break, partition tolerance means the others can still serve reads/writes despite being cut off from the other.

If part of the network goes down the system still runs using the available nodes/ reachable nodes.

Key
In real-world distributed systems Partition Tolerance is non-negotiable because networks can and will fail therefore the trade-off in CAP is Consistency vs Availability during a partition:
CP systems: prefer correctness
AP systems: prefer uptime (may return stale data)

9. Windowing in Streaming

Windowing in stream processing allows data be processed in small, manageable chunks over a specified period. It’s a technique that’s used to handle large datasets, real-time data processing, an in-memory analytics.
The aim of windowing is to allow stream processing applications break down continuous data streams into manageable chunks for processing and analysis.

Data streams (eg sensor readings) never end, so to calculate metrics like count, sum, average over time, you cant wait for the stream to finish because it wont. Windowing lets you process data in slices of time or count, producing intermediate results

Types of Windows

- Tumbling Window
Divides data streams into fixed-size , non-overlapping time intervals
Each event belongs to exactly one window

Use Case example
Website traffic monitoring: number of visits per minute

- Sliding Window
Create overlapping intervals allowing windows be included in multiple windows
Defined by window length and slide interval.
They capture overlapping patterns and trends in the data stream

Use Case example
Network monitoring: track packet loss or error rate over overlapping periods

- Session Window
Groups events that occur within a specific timeframe, separated by periods of inactivity.
The window size is not fixed but determined by the gaps between events.
Window closes after a period of inactivity (gap)

Use Case example
E-commerce analytics: track user journey from first click to checkout

10. DAGs and Workflow Orchestration

DAGs (Directed Acyclic Graphs)
DAG is a type of graph structure used to represent tasks and their dependencies in a workflow.
Provide a structured way to represent and manage complex processes.
Directed : Each edge has a direction, A -> B meaning B depends on A
Acyclic: No loops or cycles
Graph: Made up of nodes (tasks) and edges (dependencies)

DAGS:

ensure tasks execute in the right order
make it clear which tasks can run in paralle and which must wait
prevent infinite loops in workflows

Example
Extract - Transform - Load
Extract runs first, Transform waits for Extract, Load runs after Transform

Workflow Orchestration
Process of defining, scheduling and monitoring workflows. Ensures every task happens at the right time.

Workflow Orchestration

defines workflows
schedules workflows
manages dependencies
handles failures and retries
monitors runs

11. Retry Logic & Dead Letter Queues

Essential components in building robust and resilient distributed systems

Retry Logic
The ability of a system to automatically reattempt an operation when it fails, instead of instantly giving up.
Its crucial for handling transient errors which are temporary failures that are likely to resolve themselves with time.
Failures are often temporary eg a network glitch, a busy server retrying can save you from losing messages

How it works
When an operation fails, the system attempts to re-execute it after a short delay. This process can be repeated a number of times.

Immediate Retries - Try again instantly
Fixed Interval – Wait a fixed time before retrying.
Exponential Backoff – Increase wait time exponentially after each failure.
Jitter - Add randomness to wait times to avoid ‘retry storms’ when many clients retry at once

Benefits
Improves system resilience by making applications more tolerant to temporary disruptions hence reducing the need for manual intervention and can prevent messages from being prematurely moved to a DLQ.

Dead Letter Queues (DLQs)
A special “quarantine” queue where failed messages are sent after they’ve been retried the allowed number of times but still fail

How it works
When a message fails to be processed after the configured retries, or if an unrecoverable error occurs the message is moved from the main queue to the DLQ

prevents problematic messages from blocking the main queue, ensuring flow of other messages
provides a centralized location to inspect failed messages, analyzes the root cause of failures

12.Backfilling & Reprocessing

Techniques used to ensure data completeness and accuracy in data pipelines

Backfilling
Running data pipelines for past time periods to fill in missing or incomplete data

_Purpose _
Data consistency: Ensures that all historical data is consistent and complete, avoiding inconsistencies or missing data points
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data
Regulatory compliance: Helps meet regulatory requirements by ensuring that historical data is complete and accurate
Accurate analytics: Provides a complete and accurate historical dataset for analysis and reporting, preventing skewed results

Goal: Fill missing historical data
Trigger: Data gap or late arrival
Data State: No data exists for that time period
Example: Adding data for Jan 1–14 that was never loaded

Reprocessing
Running your data pipelines again for data that has already been processed, usually to correct or update results

Purpose
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data.
Code updates: Allows for the application of new code or logic to existing data.
Data quality improvements: Ensures that the data is in the desired state by re-running the processing logic.

Goal: Correct or update existing processed data
Trigger: Logic change, bug fix, or source update
Data State: Data already exists but is wrong or outdated
Example: Correcting sales data from Jan 1–14 that was miscalculated

13.Data Governance

Refers to the policies, procedures, and practices that ensure data is managed, secured, and used effectively throughout its lifecycle. Basically the “rules” for how data is collected, stored, transformed, and accessed—so that your data remains accurate, consistent, secure, and compliant

Key Components
Data Quality - Implement validation checks in ETL/ELT pipelines
Metadata Management - Store information about data sources, transformations, and lineage
Data Lineage - Track how data moves from source → transformations → destination. Helps in debugging and compliance audits.
Access Control & Security - Use role-based access control (RBAC), encryption, masking, and tokenization to protect sensitive data

Data Ingestion
Governance rules decide what data can be ingested, from which sources, and at what frequency

Data Transformation
Apply quality checks, enrichment rules, and ensure changes are logged for lineage tracking

Data Storage
Follow partitioning, retention, and archival rules
Apply encryption and role-based access at the storage layer

Data Access
Use a data catalog so users can discover data and understand its meaning before querying
Enforce least-privilege access

Monitoring & Auditing
Automated alerts when governance rules are violated

14.Time Travel and Data Versioning

Techniques that let you query, restore, or compare data from a specific point in the past — almost like a “rewind” button for your datasets

Time Travel
‍Refers to the ability to access historical versions of feature values at previous points in time
Time travel simplifies low-cost comparisons between previous versions of data. Time travel aids in analyzing performance over time. Time travel allows organizations to audit data changes over time, often required for compliance purposes. Time travel helps to reproduce the results from machine learning models.

How it works
Systems like Delta Lake store changes as immutable snapshots with transaction logs
Each write (insert/update/delete) produces a new version, but older files are retained until cleanup (compaction or vacuum)

Data Versioning
Data versioning is the storage of different versions of data that were created or changed at specific points in times

Benefits
Safe experimentation (branch & merge like Git)
Reproducibility for ML models and analytics
Ability to restore to a stable version after a bad pipeline run

15.Distributed Processing Concepts

The concept of dividing a large task into smaller parts that are processed concurrently across multiple interconnected computers or nodes, rather than on a single system.

Instead of a single powerful computer handling everything, multiple computers (nodes) work together.
Tasks are broken down and executed simultaneously on different machines, significantly speeding up processing time.
Nodes communicate and coordinate their actions through a network

This approach enhances performance, scalability, and fault tolerance, making it ideal for handling large-scale data processing and complex computations

Importance
- Improved performance
Parallel processing can dramatically reduce the time it takes to complete a task.
- Scalability
Systems can easily be scaled by adding more nodes as needed to handle increasing workloads.
- Fault tolerance
If one node fails, the others can often continue working, preventing complete system failure.
- Cost-effectiveness
Using multiple commodity computers can be more affordable than relying on a single, powerful (and expensive) machine, according to Hazelcast.
- Concurrency
Nodes execute tasks at the same time.
- Independent failure
Nodes can fail without bringing down the entire system.
- Resource sharing
Nodes can share resources like processing power, storage, and data

Examples
Cloud computing
Services like AWS, Google Cloud utilize distributed processing to offer on-demand computing resources.
Big data processing
Frameworks like Apache Hadoop and Spark leverage distributed processing for analyzing massive datasets.
Database systems
Many database systems are designed with distributed processing to handle large amounts of data and high traffic

This content originally appeared on DEV Community and was authored by susan waweru

Print Share Comment Cite Upload Translate Updates

APA

susan waweru | Sciencx (2025-08-14T14:43:13+00:00) Data Engineering Concepts. Retrieved from https://www.scien.cx/2025/08/14/data-engineering-concepts/

MLA

" » Data Engineering Concepts." susan waweru | Sciencx - Thursday August 14, 2025, https://www.scien.cx/2025/08/14/data-engineering-concepts/

HARVARD

susan waweru | Sciencx Thursday August 14, 2025 » Data Engineering Concepts., viewed ,<https://www.scien.cx/2025/08/14/data-engineering-concepts/>

VANCOUVER

susan waweru | Sciencx - » Data Engineering Concepts. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/08/14/data-engineering-concepts/

CHICAGO

" » Data Engineering Concepts." susan waweru | Sciencx - Accessed . https://www.scien.cx/2025/08/14/data-engineering-concepts/

IEEE

" » Data Engineering Concepts." susan waweru | Sciencx [Online]. Available: https://www.scien.cx/2025/08/14/data-engineering-concepts/. [Accessed: ]

rf:citation

» Data Engineering Concepts | susan waweru | Sciencx | https://www.scien.cx/2025/08/14/data-engineering-concepts/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.