This content originally appeared on DEV Community and was authored by Sandeep
Welcome back to Day 2 of the 60-Day Spark Mastery Series.
Today, we dive into the core of Spark’s execution engine - an essential concept for every Data Engineer who wants to write efficient and scalable ETL pipelines.
Let’s break down Spark architecture in a way that is simple, visual, and interview-friendly.
🧠 Why Learn Spark Architecture?
If you understand how Spark works internally, you can:
- Write faster pipelines
- Debug errors quickly
- Reduce shuffle
- Tune cluster performance
⚙️ Spark Architecture (High-Level)
Spark has 3 major components:
- Driver Program : This is the "brain" of your Spark application.
The driver:
- Creates SparkSession
- Builds logical plan (DAG)
- Converts transformations into stages/tasks
- Manages metadata
- Talks to cluster manager
If the driver crashes → the entire application stops.
This is why we never use collect() on huge datasets - it overloads the driver.
- Executors : These are worker processes distributed across the cluster.
Executors:
- Execute tasks in parallel
- Store data in memory (RDD/DataFrame cache)
- Write shuffle data
- Report progress back to the driver
Executors die when your Spark application ends.
If you allocate:
4 executors
4 cores per executor
→ You get 16 parallel task slots.
3. Cluster Manager
This system allocates machines to Spark.
Spark supports:
Manager Usage
Standalone - Local clusters
YARN - Hadoop ecosystem
Kubernetes - Cloud-native Spark
Databricks - Managed Spark service
Cluster manager = Resource allocator, not responsible for task scheduling.
Task scheduling is done by the driver.
🔁 Spark Execution Process: Simplified
Example code:
df = spark.read.csv("sales.csv")
filtered = df.filter(df.amount > 1000)
result = filtered.groupBy("category").count()
result.show()
Step 1: You write code
Driver receives commands.
Step 2: Build Logical Plan (DAG)
Spark builds a series of transformations.
Step 3: Optimize Plan
Catalyst optimizer rewrites query for maximum performance.
Step 4: Convert to Physical Plan (Stages)
Stages break at Shuffle boundaries.
Step 5: Assign Tasks
Stages → tasks = smallest unit of work
Step 6: Executors Run Tasks
Parallel execution across cluster nodes.
Step 7: Results → Driver
.show() displays results on notebook/terminal.
🌉 Understanding Stages & Tasks
🔹 Stage : A group of tasks that can run in parallel without shuffle.
Example transformations that cause shuffle:
- groupBy
- join
- reduceByKey
🔹 Task : The unit of execution run by each executor.
If you have 100 partitions → Spark creates 100 tasks.
Common Spark Architecture Mistakes by Beginners
- Using .collect() on large datasets
- Repartitioning unnecessarily
- Not broadcasting small lookup tables
- Random executor memory allocation
- Running heavy Python UDFs on large data
Follow for more such content. Let me know if I missed anything in comments. Thank you!!
This content originally appeared on DEV Community and was authored by Sandeep
Sandeep | Sciencx (2025-12-02T17:45:45+00:00) 🔥 Day 2: Understanding Spark Architecture – How Spark Executes Your Code Internally. Retrieved from https://www.scien.cx/2025/12/02/%f0%9f%94%a5-day-2-understanding-spark-architecture-how-spark-executes-your-code-internally/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.