This content originally appeared on DEV Community and was authored by susan waweru
Data engineering is the discipline of designing, building, and maintaining the systems and workflows that make data accessible, reliable, and ready for analysis. It is the behind-the-scenes backbone of modern analytics, machine learning, and business intelligence.
In this article, we will explore some foundational concepts in data engineering
1. Batch vs Streaming Ingestion
- These are methods for getting data (ingestion) for processing, analytics, or storage
Batch Ingestion
What it is:
Involves collecting data over a period of time, then processing it all at once in a batch.
When used:
Batch ingestion is used when one doesn’t need immediate results and when large volumes of data is required to be processed at once
- data is stored temporarily (over a period of time)
- At a set time, a job is scheduled to run (to read and process the data in chunks)
Example use case
Data warehouse: loading data into a data warehouse from operational systems daily
Streaming Ingestion
What it is:
Involves collecting data in real time as it arrives.
It's continuously ingested and processed without waiting for a batch window
When used:
Used when one needs real-time insights and fast reactions, eg, in fraud detection, real-time alerts
- data flows continuously
- data is processed in small increments
Example use case
Stock trading - the trade data in financial markets is ingested in real time
Batch vs Streaming Ingestion
Batch
Processing of data: Scheduled
Data delay: Over time (min -hrs)
Data Volume: Large chunks of data at once
Use case: Reports, analytics, ETL jobs
Examples: Payroll, web logs
Tools: SSIS
Streaming
Processing of data: Real-time
Data delay: Real time (seconds to milliseconds)
Data volume: Small data events continuously
Use case: Real-time alerts, monitoring, dashboard
Examples: Stock prices, IoT sensor data
Tools: Apache Kafka, Spark Streaming
2. Change Data Capture (CDC)
Just as the name suggests (change data capture) CDC is a technique that is used to track and capture changes in a database so that changes can be propagated to another system or used for downstream processing.
The changes can be inserts, updates, deletes etc in a database
CDC
Detect the changes
So Data sources (usually relational databases) are monitored for changes
Identifies rows that have been inserted, updated or deletedCapture the changes
The data changes are logged in a table or a log
Metadata ie (operation type, timestamp) is included in the data changesDeliver changes
Push or pull the changes to a target system (eg data warehouse, an analytics tool, streaming pipeline)Process changes
Downstream systems (eg dashboards, alerts, machine learning models) update/ react based on the new data
How CDC is implemented
Database triggers
Timestamps
Transaction logs
Third-party tools/ services
Why use CDC
Efficiency: avoids full table scans to find out what has changed
Real time replication: keeps systems in sync
Audit trails: records what changed, when and by whom
Data pipelines: feeds changes into data lakes, warehouses
Use cases
Data warehousing:
Keeps data warehouse in sync with operational/ transactional dbs without reloading entire tables
Source Systems log changes via CDC -> ETL/ELT pipelines extract only the changed records (new, updated, or deleted) -> loaded into the data warehouseMicroservices communication
Monolith or microservice writes data to a shared db -> CDC tool captures and publishes them ->Etl optimization
Reduce processing time and system load in ETL pipelines by only extracting what changed instead of full table scans
ETL jobs query CDC change tables or logs instead of full source tables -> extract only rows where lastModified or CDC log entry exists since the last run
3. Idempotency
Involves an operation being performed multiple times without changing the result beyond the initial application ie doing it once or doing it multiple times will give the same outcome
What Idempotency means is = Operation that can be repeated safely, on Repeat = same effect
How it works
Say you’re calling an API or running a DB operation
Without Idempotency, repeating the same operation multiple times would result to duplicate records, errors
With idempotency, repeating the same operation would not affect the result, it would just behave as if done only once
What Idempotency involves
Unique Identifiers:
- track each operation performed using a key
- server checks if the request with the same key has already been processed
State checks:
- systems check the current state before performing an action eg only update if status is still in “Pending” status
Idempotent methods:
In REST APIs..GET, PUT, DELETE are idempotent..POST is not (but can be made idempotent using a unique key)
Upserts.. In databases you can prevent duplicates by using logic like (Update if exists else insert)
Where Idempotency is used mostly
. APIs(POST/PUT) - prevent duplicate submissions
. Databases - prevent duplicate rows or repeated updates
. ETL/ Batch jobs - Rerunning jobs shouldn’t corrupt data
. Distributed Systems - Network retries must not reapply the same action
. Messaging systems - reprocessing messages should not repeat side effects
. Payments - backend uses the same payment_id and returns the existing result
Use Cases
1.Payments .. in a case eg where the user clicks “Pay” but the network fails, without idempotency the user is charged twice, with idempotency backend uses the same payment_id and returns the existing result
2.API Call.. if the same request is sent again with the same key, the server ignores it or returns the first response
3.Database Upsert
Importance
. Prevent duplicate transactions
. Improve system reliability
. Enable safe retries in unreliable networks
4. OLTP vs OLAP
Online Transaction Processing (OLTP)
Systems optimized for managing day-to-day transactional operations. They are designed for speed, consistency and concurrency.
Use Case
E-commerce systems (shopping cart, orders
Banking systems
Ticket booking platforms
Example
A customer places an order in an online store, OLTP system records the purchase, updates inventory and creates an invoice
Online Analytical Processing (OLAP)
Designed for analyzing large volumes of historical data, enabling business intelligence, reporting and decision-making
Use Case
Sales trends
Executive dashboards
Profitability analysis by region or product
Example
Analysts want to know “what were the total sales per region in Year1 vs Year2?”
This query runs on the OLAP system, which processes and aggregates the data quickly
OLTP vs OLAP diffs
OLTP
Purpose: Handles real-time transactions
Data type: Deals with Operational ie current data
Operations: Implements Insert, Update, Delete and Read operations
Users: Is mainly for frontline employees, systems
Speed Focus: High-speed, low latency
Database: Normalized (3NF), fewer indexes
OLAP
Purpose: Supports complex data analysis and reporting
Data type: Can also have historical(aggregated, summarized data
Operations: Implements the read-heavy operations ie SELECTs etc
Users: Mainly for analysts and decision- makers
Speed Focus: Fast query performance on large datasets
Database: Denormalized(start/snowflake schema)
5. Columnar vs Row-based storage
These are ways databases organize data
Row-based storage:
Stores data row by row hence
All columns for a single row are stored together
Efficient for OLTP where records are frequently accessed/ modified
Columnar storage:
Stores data column by column
All values for a single column are stored together
Columnar storage excels in analytical (OLAP) queries that focus on specific columns across many rows/ over large datasets
For example you have below query
SELECT AVG(AGE) FROM STUDENTS;
Columnar would work better because it only reads the AGE column. Row-based on the other hand would have to read every row and skip over other columns (non AGE columns) hence not efficient
6. Partitioning
Partitioning involves dividing a large dataset to smaller subsets called partitions. The partitions are stored and managed independently
Data partitioning is done mostly for performance and scalability.
In the case for instance you have a large dataset but you only need to query for a certain period say last month, without partitioning the engine would have to scan the entire dataset hence consuming a lot of time, with partitioning the engine skips all the irrelevant partitions and only scans based on partition provided.
- Improved performance
ie faster query execution and data retrieval
Because data is divided into smaller partitions, when querying, the query can target only the relevant subset hence reducing the amount of data that needs to be scanned and processed.
- Enhanced scalability
Because the partitions are distributed across multiple nodes/ servers more servers resources can be added to handle increasing data volumes and demands
- Increased Availability
Because there are several partitions, if one partition is unavailable, the other partitions can be accessed hence ensures continued data availabilty
Drawback
One of the drawbacks of partitioning is extra management because there too many small partitions
7. ETL vs ELT
Both are data integration methods ie moving and processing data.
They differ in that ETL (Extract, Transform, Load) transforms data before loading in into the target system whereas ELT(Extract, Load, Transform) first loads data into the target system and transforms the data withing target.
ETL Steps
- Data is extracted from its source
- Its then transformed in a staging area
- The data transformed data is then loaded into a data warehouse or data lake
ETL is better used with complex transformations or when data quality and consistency are paramount
ELT Steps
- Data is extracted from source
- Its loaded directly into a data warehouse or data lake without transformation
- Transformation is applied within target
ELT is more suitable for cloud-based data warehouse and data lakes
Its efficient for handling large volumes of data and is well-suited for analytics workloads
Comparison
ETL
Transform step: Before loading
Target type: Legacy/on-prem warehouses
Raw data stored?: No (only transformed)
Speed of loading: Slower (pre-transform)
Flexibility: Lower
Compute location: ETL tool/staging server
ELT
Transform step: After loading
Target type: Cloud warehouses/lakes
Raw data stored?: Yes (raw + transformed)
Speed of loading: Faster (load first)
Flexibility: Higher
Compute location: Data warehouse/lake
8. CAP Theorem
Cap theorem states that in a distributed data system, its impossible to simultaneously achieve all three of: Consistency, Availability and Partition Tolerance. A distributed system must choose at most two of the three properties to prioritize
Consistency (C)
Every read receives the most recent write or an error. After a successful write, any subsequent read should return that updated value. The clients therefore see a single, up-to-date view of the data.
Example
In a banking system if you deposit money in ATM A then check the balance immediately at ATM B, the money you deposited should reflect in the new balance
Availability (A)
- the system never refuses to answer Every request receives a valid (non-error) response - but it might not be the latest data
Example
In the banking system, if there’s a network glitch between the data centers when checking balance after depositing, it shows the previous balance instead of the current one
Partition Tolerance (P)
Network still continues to operate despite network failures or delays between nodes
The system continues to operate even if messages between parts of the system are lost or delayed
Example
A distributed database runs 3 data centers and communication to one of the data centers break, partition tolerance means the others can still serve reads/writes despite being cut off from the other.
If part of the network goes down the system still runs using the available nodes/ reachable nodes.
Key
In real-world distributed systems Partition Tolerance is non-negotiable because networks can and will fail therefore the trade-off in CAP is Consistency vs Availability during a partition:
CP systems: prefer correctness
AP systems: prefer uptime (may return stale data)
9. Windowing in Streaming
Windowing in stream processing allows data be processed in small, manageable chunks over a specified period. It’s a technique that’s used to handle large datasets, real-time data processing, an in-memory analytics.
The aim of windowing is to allow stream processing applications break down continuous data streams into manageable chunks for processing and analysis.
Data streams (eg sensor readings) never end, so to calculate metrics like count, sum, average over time, you cant wait for the stream to finish because it wont. Windowing lets you process data in slices of time or count, producing intermediate results
Types of Windows
- Tumbling Window
Divides data streams into fixed-size , non-overlapping time intervals
Each event belongs to exactly one window
Use Case example
Website traffic monitoring: number of visits per minute
- Sliding Window
Create overlapping intervals allowing windows be included in multiple windows
Defined by window length and slide interval.
They capture overlapping patterns and trends in the data stream
Use Case example
Network monitoring: track packet loss or error rate over overlapping periods
- Session Window
Groups events that occur within a specific timeframe, separated by periods of inactivity.
The window size is not fixed but determined by the gaps between events.
Window closes after a period of inactivity (gap)
Use Case example
E-commerce analytics: track user journey from first click to checkout
10. DAGs and Workflow Orchestration
DAGs (Directed Acyclic Graphs)
DAG is a type of graph structure used to represent tasks and their dependencies in a workflow.
Provide a structured way to represent and manage complex processes.
Directed : Each edge has a direction, A -> B meaning B depends on A
Acyclic: No loops or cycles
Graph: Made up of nodes (tasks) and edges (dependencies)
DAGS:
- ensure tasks execute in the right order
- make it clear which tasks can run in paralle and which must wait
- prevent infinite loops in workflows
Example
Extract - Transform - Load
Extract runs first, Transform waits for Extract, Load runs after Transform
Workflow Orchestration
Process of defining, scheduling and monitoring workflows. Ensures every task happens at the right time.
Workflow Orchestration
- defines workflows
- schedules workflows
- manages dependencies
- handles failures and retries
- monitors runs
11. Retry Logic & Dead Letter Queues
Essential components in building robust and resilient distributed systems
Retry Logic
The ability of a system to automatically reattempt an operation when it fails, instead of instantly giving up.
Its crucial for handling transient errors which are temporary failures that are likely to resolve themselves with time.
Failures are often temporary eg a network glitch, a busy server retrying can save you from losing messages
How it works
When an operation fails, the system attempts to re-execute it after a short delay. This process can be repeated a number of times.
Immediate Retries - Try again instantly
Fixed Interval – Wait a fixed time before retrying.
Exponential Backoff – Increase wait time exponentially after each failure.
Jitter - Add randomness to wait times to avoid ‘retry storms’ when many clients retry at once
Benefits
Improves system resilience by making applications more tolerant to temporary disruptions hence reducing the need for manual intervention and can prevent messages from being prematurely moved to a DLQ.
Dead Letter Queues (DLQs)
A special “quarantine” queue where failed messages are sent after they’ve been retried the allowed number of times but still fail
How it works
When a message fails to be processed after the configured retries, or if an unrecoverable error occurs the message is moved from the main queue to the DLQ
- prevents problematic messages from blocking the main queue, ensuring flow of other messages
- provides a centralized location to inspect failed messages, analyzes the root cause of failures
12.Backfilling & Reprocessing
Techniques used to ensure data completeness and accuracy in data pipelines
Backfilling
Running data pipelines for past time periods to fill in missing or incomplete data
_Purpose _
Data consistency: Ensures that all historical data is consistent and complete, avoiding inconsistencies or missing data points
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data
Regulatory compliance: Helps meet regulatory requirements by ensuring that historical data is complete and accurate
Accurate analytics: Provides a complete and accurate historical dataset for analysis and reporting, preventing skewed results
Goal: Fill missing historical data
Trigger: Data gap or late arrival
Data State: No data exists for that time period
Example: Adding data for Jan 1–14 that was never loaded
Reprocessing
Running your data pipelines again for data that has already been processed, usually to correct or update results
Purpose
Error correction: Corrects errors or inconsistencies in historical data that may have been introduced by bugs, system failures, or incorrect data.
Code updates: Allows for the application of new code or logic to existing data.
Data quality improvements: Ensures that the data is in the desired state by re-running the processing logic.
Goal: Correct or update existing processed data
Trigger: Logic change, bug fix, or source update
Data State: Data already exists but is wrong or outdated
Example: Correcting sales data from Jan 1–14 that was miscalculated
13.Data Governance
Refers to the policies, procedures, and practices that ensure data is managed, secured, and used effectively throughout its lifecycle. Basically the “rules” for how data is collected, stored, transformed, and accessed—so that your data remains accurate, consistent, secure, and compliant
Key Components
Data Quality - Implement validation checks in ETL/ELT pipelines
Metadata Management - Store information about data sources, transformations, and lineage
Data Lineage - Track how data moves from source → transformations → destination. Helps in debugging and compliance audits.
Access Control & Security - Use role-based access control (RBAC), encryption, masking, and tokenization to protect sensitive data
Data Ingestion
Governance rules decide what data can be ingested, from which sources, and at what frequency
Data Transformation
Apply quality checks, enrichment rules, and ensure changes are logged for lineage tracking
Data Storage
Follow partitioning, retention, and archival rules
Apply encryption and role-based access at the storage layer
Data Access
Use a data catalog so users can discover data and understand its meaning before querying
Enforce least-privilege access
Monitoring & Auditing
Automated alerts when governance rules are violated
14.Time Travel and Data Versioning
Techniques that let you query, restore, or compare data from a specific point in the past — almost like a “rewind” button for your datasets
Time Travel
Refers to the ability to access historical versions of feature values at previous points in time
Time travel simplifies low-cost comparisons between previous versions of data. Time travel aids in analyzing performance over time. Time travel allows organizations to audit data changes over time, often required for compliance purposes. Time travel helps to reproduce the results from machine learning models.
How it works
Systems like Delta Lake store changes as immutable snapshots with transaction logs
Each write (insert/update/delete) produces a new version, but older files are retained until cleanup (compaction or vacuum)
Data Versioning
Data versioning is the storage of different versions of data that were created or changed at specific points in times
Benefits
Safe experimentation (branch & merge like Git)
Reproducibility for ML models and analytics
Ability to restore to a stable version after a bad pipeline run
15.Distributed Processing Concepts
The concept of dividing a large task into smaller parts that are processed concurrently across multiple interconnected computers or nodes, rather than on a single system.
- Instead of a single powerful computer handling everything, multiple computers (nodes) work together.
- Tasks are broken down and executed simultaneously on different machines, significantly speeding up processing time.
- Nodes communicate and coordinate their actions through a network
This approach enhances performance, scalability, and fault tolerance, making it ideal for handling large-scale data processing and complex computations
Importance
- Improved performance
Parallel processing can dramatically reduce the time it takes to complete a task.
- Scalability
Systems can easily be scaled by adding more nodes as needed to handle increasing workloads.
- Fault tolerance
If one node fails, the others can often continue working, preventing complete system failure.
- Cost-effectiveness
Using multiple commodity computers can be more affordable than relying on a single, powerful (and expensive) machine, according to Hazelcast.
- Concurrency
Nodes execute tasks at the same time.
- Independent failure
Nodes can fail without bringing down the entire system.
- Resource sharing
Nodes can share resources like processing power, storage, and data
Examples
Cloud computing
Services like AWS, Google Cloud utilize distributed processing to offer on-demand computing resources.
Big data processing
Frameworks like Apache Hadoop and Spark leverage distributed processing for analyzing massive datasets.
Database systems
Many database systems are designed with distributed processing to handle large amounts of data and high traffic
This content originally appeared on DEV Community and was authored by susan waweru

susan waweru | Sciencx (2025-08-14T14:43:13+00:00) Data Engineering Concepts. Retrieved from https://www.scien.cx/2025/08/14/data-engineering-concepts/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.