This content originally appeared on DEV Community and was authored by vinicius fagundes
You've got raw data flooding into S3. CSV files, JSON logs, event streamsβall sitting there, waiting to be transformed into something useful and loaded into your Snowflake data warehouse. Traditional approaches mean spinning up EMR clusters, maintaining Spark jobs, or running expensive transformation layers 24/7.
There's a better way. Let's build a serverless data pipeline that processes and prepares data for Snowflake at a fraction of the cost.
ποΈ Architecture Overview
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β S3 Bucket ββββββΆβ Lambda ββββββΆβ DuckDB/ ββββββΆβ S3 Output ββββββΆβ Snowflake β
β (Raw Data) β β (Triggered) β β Polars β β (Parquet) β β Warehouse β
βββββββββββββββ ββββββββββββββββ β Processing β ββββββββββββββββ ββββββββββββββββ
βββββββββββββββ
π° Cost Comparison: Why This Matters
Traditional Approach | Serverless Approach | Savings |
---|---|---|
EMR Cluster: $1,200/month (always-on) | Lambda: $150/month (on-demand) | 87% |
Fivetran/Airbyte: $2,000/month | DIY Lambda Pipeline: $0/month | 100% |
EC2 Processing: $800/month | Lambda + DuckDB: $80/month | 90% |
Total: $4,000/month | Total: $230/month | 94% |
Note: Snowflake storage and compute costs remain the same in both scenarios
π§ Option 1: DuckDB + Lambda (Analytics-First)
When to Use:
- β You need SQL-based transformations
- β Data is primarily read-heavy (analytics queries)
- β Files are in columnar formats (Parquet, CSV)
- β Processing < 50GB per invocation
- β Not ideal for complex multi-step transformations
Implementation:
# lambda_duckdb_processor.py
import duckdb
import boto3
import os
def lambda_handler(event, context):
# Get S3 event details
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Initialize DuckDB with S3 support
con = duckdb.connect()
con.execute("""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
""")
# Process directly from S3 - NO DATA MOVEMENT
query = f"""
COPY (
SELECT
date_trunc('day', timestamp) as date,
user_id,
SUM(revenue) as total_revenue,
COUNT(*) as transaction_count
FROM read_parquet('s3://{bucket}/{key}')
WHERE revenue > 0
GROUP BY date, user_id
) TO 's3://{bucket}/processed/{key}'
(FORMAT PARQUET, COMPRESSION ZSTD);
"""
con.execute(query)
return {
'statusCode': 200,
'body': f'Processed {key} successfully'
}
Performance Characteristics:
File Size | Processing Time | Lambda Cost | Total Cost
-------------|-----------------|-------------|------------
1 GB | 12 seconds | $0.002 | $0.002
10 GB | 98 seconds | $0.016 | $0.016
50 GB | 7.5 minutes | $0.075 | $0.075
Configuration:
# serverless.yml
functions:
duckdbProcessor:
handler: lambda_duckdb_processor.lambda_handler
runtime: python3.11
memorySize: 10240 # 10GB RAM
timeout: 900 # 15 minutes
ephemeralStorageSize: 10240 # 10GB temp storage
environment:
AWS_DEFAULT_REGION: us-east-1
events:
- s3:
bucket: raw-data-bucket
event: s3:ObjectCreated:*
rules:
- prefix: incoming/
- suffix: .parquet
β‘ Option 2: Polars + Lambda (Transform-Heavy)
When to Use:
- β Complex multi-step transformations
- β DataFrame operations (joins, pivots, window functions)
- β Need maximum performance for heavy compute
- β Working with mixed data formats
- β Not ideal for simple SQL-only queries
Implementation:
# lambda_polars_processor.py
import polars as pl
import boto3
from io import BytesIO
def lambda_handler(event, context):
s3 = boto3.client('s3')
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Read directly from S3 using Polars
obj = s3.get_object(Bucket=bucket, Key=key)
df = pl.read_parquet(BytesIO(obj['Body'].read()))
# Complex transformation pipeline
result = (
df
.filter(pl.col('revenue') > 0)
.with_columns([
pl.col('timestamp').dt.truncate('1d').alias('date'),
(pl.col('revenue') * 0.9).alias('net_revenue') # 10% commission
])
.group_by(['date', 'user_id'])
.agg([
pl.sum('net_revenue').alias('total_revenue'),
pl.count().alias('transaction_count'),
pl.mean('revenue').alias('avg_revenue')
])
.sort('date', descending=True)
)
# Write back to S3 as Parquet
output_buffer = BytesIO()
result.write_parquet(output_buffer, compression='zstd')
output_key = f"processed/{key}"
s3.put_object(
Bucket=bucket,
Key=output_key,
Body=output_buffer.getvalue()
)
return {
'statusCode': 200,
'body': f'Processed {len(result)} records'
}
Performance Characteristics:
File Size | Processing Time | Lambda Cost | Total Cost
-------------|-----------------|-------------|------------
1 GB | 8 seconds | $0.0013 | $0.0013
10 GB | 65 seconds | $0.011 | $0.011
50 GB | 5.8 minutes | $0.058 | $0.058
π Head-to-Head Comparison
Feature | DuckDB + Lambda | Polars + Lambda |
---|---|---|
SQL Support | Native, full SQL | Expressions (SQL-like) |
Performance | Excellent for analytics | 20-30% faster for transforms |
Memory Efficiency | Superior (columnar) | Very good |
Learning Curve | Easy (SQL) | Moderate (API) |
S3 Integration | Native, zero-copy | Requires download |
Complex Joins | Excellent | Excellent |
Best For | Analytics queries | ETL pipelines |
Community | Growing | Rapidly growing |
π Real-World Cost Example
Scenario: Processing 5TB of event data monthly and loading into Snowflake
Traditional Approach:
EMR Cluster (m5.xlarge Γ 3): $1,200/month
Fivetran/Airbyte: $2,000/month
S3 Storage: $115/month
Data Transfer: $450/month
Total Pipeline Cost: $3,765/month
Snowflake (separate): ~$1,500/month
Grand Total: $5,265/month
DuckDB Lambda Approach:
Lambda Compute: $95/month
S3 Storage: $115/month
S3 Requests: $25/month
Snowpipe (included): $0/month
Total Pipeline Cost: $235/month
Snowflake (same): ~$1,500/month
Grand Total: $1,735/month
Savings: $3,530/month (67% total, 94% on pipeline)
π οΈ Deployment Steps
1. Install Dependencies:
# For DuckDB
pip install duckdb boto3 -t ./package
cd package && zip -r ../lambda_duckdb.zip . && cd ..
zip -g lambda_duckdb.zip lambda_duckdb_processor.py
# For Polars
pip install polars boto3 -t ./package
cd package && zip -r ../lambda_polars.zip . && cd ..
zip -g lambda_polars.zip lambda_polars_processor.py
2. Create Lambda Layer (Recommended):
# layer/python/requirements.txt
duckdb==0.10.0
# OR
polars==0.20.3
3. Set Up S3 Trigger:
# terraform/main.tf
resource "aws_s3_bucket_notification" "data_pipeline" {
bucket = aws_s3_bucket.raw_data.id
lambda_function {
lambda_function_arn = aws_lambda_function.processor.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "incoming/"
filter_suffix = ".parquet"
}
}
4. Configure Snowflake Integration:
-- Create Snowflake external stage
CREATE OR REPLACE STAGE s3_processed_stage
URL = 's3://your-bucket/processed/'
CREDENTIALS = (AWS_KEY_ID = 'xxx' AWS_SECRET_KEY = 'xxx');
-- Create target table
CREATE OR REPLACE TABLE user_revenue (
date DATE,
user_id VARCHAR,
total_revenue DECIMAL(18,2),
transaction_count INTEGER
);
-- Set up Snowpipe for automatic ingestion
CREATE OR REPLACE PIPE user_revenue_pipe
AUTO_INGEST = TRUE
AS
COPY INTO user_revenue
FROM @s3_processed_stage
FILE_FORMAT = (TYPE = PARQUET)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
5. Enable S3 Event Notifications for Snowpipe:
# Add SNS notification for Snowpipe
resource "aws_s3_bucket_notification" "snowpipe_notification" {
bucket = aws_s3_bucket.processed_data.id
topic {
topic_arn = "arn:aws:sns:us-east-1:xxx:snowpipe-topic"
events = ["s3:ObjectCreated:*"]
filter_prefix = "processed/"
}
}
π Performance Optimization Tips
For DuckDB:
# Enable parallel processing
con.execute("SET threads=8;")
# Use memory limits
con.execute("SET memory_limit='8GB';")
# Optimize Parquet reading
con.execute("SET enable_object_cache=true;")
For Polars:
# Lazy evaluation for large datasets
result = (
pl.scan_parquet('s3://bucket/data/*.parquet')
.filter(pl.col('date') > '2024-01-01')
.group_by('user_id')
.agg(pl.sum('revenue'))
.collect(streaming=True) # Process in batches
)
π― Key Takeaways
- Start with DuckDB if your pipeline is analytics-heavy and SQL-based
- Choose Polars if you need complex transformations and maximum performance
- Always use Parquet for storageβit's 10x cheaper to query than JSON/CSV
- Lambda's 15-minute limit is sufficient for 95% of data processing tasks
- Monitor cold starts: Keep functions warm with scheduled pings if needed
π When NOT to Use This Architecture
β Processing > 100GB per invocation (Lambda limits)
β Real-time streaming (use Kinesis + Flink instead)
β Complex ML training (use SageMaker)
β Transactions requiring ACID guarantees (use RDS)
π‘ Production Checklist
- [ ] Enable S3 versioning for data recovery
- [ ] Set up CloudWatch alarms for Lambda errors
- [ ] Implement DLQ (Dead Letter Queue) for failed invocations
- [ ] Add Lambda Insights for detailed monitoring
- [ ] Configure S3 lifecycle policies to archive old data
- [ ] Use AWS Secrets Manager for Snowflake credentials
- [ ] Set up Snowpipe with AUTO_INGEST for automatic loading
- [ ] Monitor Snowpipe with SYSTEM$PIPE_STATUS()
- [ ] Configure Snowflake resource monitors to control costs
- [ ] Enable VPC for private data access (if required)
- [ ] Test Snowflake data quality with dbt tests
π Resources
- DuckDB Documentation
- Polars User Guide
- AWS Lambda Pricing Calculator
- Snowflake Snowpipe Documentation
- Snowflake External Stages Guide
Questions? Drop them in the commentsβI'd love to hear about your data pipeline challenges!
What's your data processing stack? Are you using Snowflake? Let me know if you'd like a deep dive into Snowpipe optimization or any specific part of this architecture! π
This content originally appeared on DEV Community and was authored by vinicius fagundes

vinicius fagundes | Sciencx (2025-09-30T12:56:49+00:00) π― The Challenge: Processing TBs of S3 Data Without Breaking the Bank. Retrieved from https://www.scien.cx/2025/09/30/%f0%9f%8e%af-the-challenge-processing-tbs-of-s3-data-without-breaking-the-bank/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.