This content originally appeared on DEV Community and was authored by Pat Sienkiewicz
In this multi-part series, we'll explore how to build a modern, scalable pipeline for processing telecom Call Detail Records (CDRs) using Databricks Delta Live Tables. Part 1 focuses on the foundation: data generation and the bronze layer implementation.
Introduction
Telecommunications companies process billions of Call Detail Records (CDRs) daily. These records capture every interaction with the network—voice calls, text messages, data sessions, and more. Processing this data efficiently is critical for billing, network optimization, fraud detection, and customer experience management.
In this series, we'll build a complete Telco CDR processing pipeline using Databricks Delta Live Tables (DLT). We'll follow the medallion architecture pattern, with bronze, silver, and gold layers that progressively refine raw data into valuable business insights.
The Challenge
Telecom data presents several unique challenges:
- Volume: Billions of records generated daily
- Variety: Multiple CDR types with different schemas
- Velocity: Real-time processing requirements
- Complexity: Intricate relationships between users, devices, and network elements
- Compliance: Strict regulatory requirements for data retention and privacy
Traditional batch processing approaches struggle with these challenges. We need a modern, streaming-first architecture that can handle the scale and complexity of telecom data.
Our Solution
git repo: dlt_telco
--
We're building a solution with two main components:
- Data Generator: A synthetic CDR generator that produces realistic telecom data
- DLT Pipeline: A Delta Live Tables pipeline that processes the data through medallion architecture layers
In Part 1, we'll focus on the data generator and the bronze layer implementation.
Data Generator: Creating Realistic Synthetic Data
For development and testing, we need a way to generate realistic CDR data. Our generator creates:
- User Profiles: Synthetic subscriber data with identifiers (MSISDN, IMSI, IMEI), plan details, and location information
- Multiple CDR Types: Voice, data, SMS, VoIP, and IMS records with appropriate attributes
- Kafka Integration: Direct streaming to Kafka topics for real-time ingestion
The generator ensures referential integrity between users and CDRs, making it possible to perform realistic joins and aggregations in downstream processing.
User Profile Generation
Our user generator creates profiles with realistic telecom attributes:
# Sample user profile structure
{
"user_id": "user_42",
"msisdn": "1234567890",
"imsi": "310150123456789",
"imei": "490154203237518",
"plan_name": "Premium Unlimited",
"data_limit_gb": 50,
"voice_minutes": 1000,
"sms_count": 500,
"registration_date": "2023-05-15",
"active": true,
"location": {
"city": "Seattle",
"state": "WA"
}
}
CDR Generation
The CDR generator produces five types of records, each with appropriate attributes:
- Voice CDRs: Call duration, calling/called numbers, cell tower IDs
- Data CDRs: Session duration, uplink/downlink volumes, APN information
- SMS CDRs: Message size, sender/receiver information
- VoIP CDRs: SIP endpoints, codec information, quality metrics
- IMS CDRs: Service type, session details, network elements
Kafka Integration
The generator streams data to dedicated Kafka topics:
-
telco-users
: User profile data -
telco-voice-cdrs
: Voice call records -
telco-data-cdrs
: Data usage records -
telco-sms-cdrs
: SMS message records -
telco-voip-cdrs
: VoIP call records -
telco-ims-cdrs
: IMS session records
This streaming approach mimics real-world telecom environments where CDRs flow continuously from network elements to processing systems.
Bronze Layer: Raw Data Ingestion with Delta Live Tables
The bronze layer is the foundation of our medallion architecture. It ingests raw data from Kafka with minimal transformation, preserving the original content for compliance and auditability.
Key Features
Our bronze layer implementation provides:
- Streaming Ingestion: Real-time data processing from Kafka
- Schema Preservation: Maintains original message structure
- Metadata Tracking: Captures Kafka metadata (timestamp, topic, key)
- Security: Secure credential management via Databricks secrets
- Scalability: Serverless Delta Live Tables for auto-scaling
Bronze Tables Structure
Our bronze layer includes 7 tables total:
Table Name | Source Topic | Description |
---|---|---|
bronze_users |
telco-users |
Raw user profile data with parsed JSON |
bronze_voice_cdrs |
telco-voice-cdrs |
Voice call detail records |
bronze_data_cdrs |
telco-data-cdrs |
Data session records |
bronze_sms_cdrs |
telco-sms-cdrs |
SMS message records |
bronze_voip_cdrs |
telco-voip-cdrs |
VoIP call records |
bronze_ims_cdrs |
telco-ims-cdrs |
IMS session records |
bronze_all_cdrs |
All CDR topics | Multiplexed view of all CDR types |
Each table preserves the original Kafka metadata (key, timestamp, topic) alongside the raw data, enabling reprocessing if needed.
Table Schema
All bronze tables follow a consistent schema:
CREATE TABLE bronze_<type> (
key STRING, -- Kafka message key
timestamp TIMESTAMP, -- Kafka message timestamp
topic STRING, -- Source Kafka topic
processing_time TIMESTAMP, -- DLT processing timestamp
raw_data STRING, -- Original JSON payload
parsed_data STRUCT<...> -- Parsed JSON (users table only)
)
Code Implementation
Here's how we define our bronze tables using DLT:
@dlt.table(
name="bronze_users",
comment="Raw user data from Kafka",
table_properties=get_bronze_table_properties()
)
def bronze_users():
df = get_standard_bronze_columns(read_from_kafka("telco-users"))
return df.withColumn("parsed_data", from_json(col("raw_data"), user_schema))
def create_bronze_cdr_table(cdr_type, topic_name):
"""Create a Bronze table for a specific CDR type"""
@dlt.table(
name=f"bronze_{cdr_type}_cdrs",
comment=f"Raw {cdr_type} CDR data from Kafka",
table_properties=get_bronze_table_properties()
)
def bronze_cdr_table():
return get_standard_bronze_columns(read_from_kafka(topic_name))
We use helper functions to ensure consistent table properties and column structures across all bronze tables:
def get_bronze_table_properties():
"""Return standard bronze table properties"""
return {
"quality": "bronze",
"pipelines.autoOptimize.managed": "true",
"pipelines.reset.allowed": "false"
}
def get_standard_bronze_columns(df):
"""Return standardized bronze layer columns"""
return df.select(
col("key"),
col("timestamp"),
col("topic"),
current_timestamp().alias("processing_time"),
col("value").cast("string").alias("raw_data")
)
Secure Credential Management
For security, we retrieve Kafka credentials from Databricks secrets:
# Get Kafka credentials from Databricks secret
dbutils = DBUtils(spark)
kafka_settings = json.loads(dbutils.secrets.get(scope=env_scope, key="telco-kafka"))
# Extract values from the secret
KAFKA_BOOTSTRAP_SERVERS = kafka_settings["bootstrap_server"]
api_key = kafka_settings["api_key"]
api_secret = kafka_settings["api_secret"]
This approach ensures that sensitive credentials are never hardcoded in our pipeline code.
Deployment Automation
We use Databricks Asset Bundles for deployment automation:
# Deploy to development
cd dlt_telco
databricks bundle deploy --target dev
# Deploy to production
databricks bundle deploy --target prod
The pipeline uses Databricks Asset Bundles for consistent deployment across environments with serverless compute for automatic scaling.
Results and Benefits
With our bronze layer implementation:
- Streaming Ingestion: CDRs are available for analysis seconds after generation
- Data Preservation: Original records are preserved for compliance and auditability
- Scalability: Serverless compute handles millions of records per minute
- Security: All credentials managed through Databricks Secret Scopes
- Automation: Asset Bundle deployment simplifies environment management
What's Next?
In Part 2 of this series, we'll build the silver layer of our medallion architecture. We'll focus on:
- Data validation and quality enforcement
- Schema standardization across CDR types
- Enrichment with user and reference data
- Error handling and data recovery patterns
Stay tuned as we continue building our Telco CDR processing pipeline!
This blog post is part of a series on building data processing pipelines for telecommunications using Databricks Delta Live Tables. Follow along as we progress from raw data ingestion to advanced analytics and machine learning.
This content originally appeared on DEV Community and was authored by Pat Sienkiewicz

Pat Sienkiewicz | Sciencx (2025-08-29T11:05:56+00:00) Building a Scalable Telco CDR Processing Pipeline with Databricks Delta Live Tables – Part 1 [Databricks Free Edition]. Retrieved from https://www.scien.cx/2025/08/29/building-a-scalable-telco-cdr-processing-pipeline-with-databricks-delta-live-tables-part-1-databricks-free-edition/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.