This content originally appeared on DEV Community and was authored by Sumedh Bala
Reservations, Locking, Availability & Queuing
Seat management ensures users can reliably reserve seats without conflicts, overselling, or double-booking. It handles reserved vs General Admission seats, tracks reservations until payment, manages concurrency, and provides real-time availability updates. Queuing prevents overload during high-demand events.
1. Baseline Functional Solution
Service Responsibilities & API Mapping
| Service | Primary Responsibilities | API Endpoints | Data Sources | 
|---|---|---|---|
| Seat Management Service | Seat reservations, locking, cancellations, expired cleanup | POST /events/{event_id}/reservations | Primary Database, Redis Cache | 
| Queue Management Service | User queuing, position tracking, admission control | POST /events/{event_id}/queue GET /events/{event_id}/queue/status | Redis Queue, Database | 
| Availability Aggregation Service | Section counts, real-time availability tracking | GET /events/{event_id}/sections | Message Bus, Redis Cache | 
| Real-time Notification Service | SSE connections for live updates | SSE /events/{event_id}/sections/updates | Redis Pub/Sub, Message Bus | 
Databases & Caches
- Primary Database (PostgreSQL / MySQL / DynamoDB): Stores canonical seat and reservation state.
- Reservation Store (DynamoDB / Redis / PostgreSQL): Tracks in-progress reservations. Only stores queue_token for anonymous users; logged-in users are identified via JWT.
- Cache (Redis / Memcached): Hot lookups for section counts, seat maps, and user queue positions.
- Message Bus (Kafka / Pulsar / Kinesis / Pub/Sub): Streams reservation events for real-time updates and availability aggregation.
- Queueing Infrastructure: Queued requests for high-demand events. Distributed coordination ensures unique positions and queue ordering.
- Real-Time Updates Infrastructure: In-memory counters maintain section-level seats_remaining. Fanout layer scales push notifications to many clients.
Example Database Schema
Note: The following are simplified schema overviews for understanding the basic structure.
Seats Table
seat_id – string (primary key, e.g., "A-101")
event_id – string (foreign key)
section_id – string (foreign key)
row – string
seat_number – string
status – ENUM(available, reserved, sold)
Reservations Table (Header)
reservation_id – UUID (primary key)
event_id – string (foreign key)
queue_token – string (nullable, only for anonymous users)
user_id – string (foreign key, nullable, for logged-in users)
status – ENUM(pending_payment, confirmed, canceled, expired)
total_amount_minor_units – integer (amount in smallest currency unit to avoid floating point precision issues)
currency – string (e.g., "USD")
payment_intent_id – string (nullable)
created_at – TIMESTAMP
expires_at – TIMESTAMP
confirmed_at – TIMESTAMP (nullable)
Reservation_Seats Table (Reserved Seats)
reservation_seat_id – UUID (primary key)
reservation_id – UUID (foreign key to reservations)
seat_id – string (foreign key to seats)
created_at – TIMESTAMP
Reservation_GA Table (General Admission)
reservation_ga_id – UUID (primary key)
reservation_id – UUID (foreign key to reservations)
section_id – string (foreign key to sections)
quantity – integer (number of GA tickets)
created_at – TIMESTAMP
Sections Table
section_id – string (primary key, e.g., "section_A")
event_id – string (foreign key)
name – string (e.g., "104")
capacity – integer
seats_remaining – integer (updated in real time)
Queue Table
queue_id – UUID (primary key)
event_id – FK
user_id – nullable (logged-in)
queue_token – nullable (anonymous, signed)
status – ENUM(waiting, ready, expired, completed)
created_at, last_updated_at – timestamps
2. Reservation Flow
Enter Queue (High-Demand Events)
- Anonymous users receive a queue_token.
- Logged-in users are identified via JWT; no token issued.
- If the queue is empty, users are immediately ready to reserve seats.
Browse Section Availability
- Anyone can fetch aggregated section counts before joining the queue.
- Individual reserved seat details require ready status (queue_token or JWT).
Reserve Seats / General Admission Slots
- Must be ready in the queue.
- Backend validates queue readiness via queue_token (anonymous) or JWT/account ID (logged-in).
- Reserved seats have TTL until payment is completed.
Real-Time Updates
- Users receive live updates on section availability via Server-Sent Events (SSE).
- Anonymous users subscribe with queue_token.
- Logged-in users subscribe with JWT.
3. APIs (in order of user flow)
Enter Queue
Endpoint: POST /events/{event_id}/queue
Request Body: {}
Response:
{
  "queue_token": "<signed_token>",
  "estimated_wait_time": 0,
  "position_in_queue": 0,
  "status": "ready"
}
Notes: Logged-in users receive no token; identity tracked via JWT.
Queue Status
Endpoint: GET /events/{event_id}/queue/status
Headers: Authorization: Bearer  (anonymous) OR Authorization: Bearer  (logged-in)
Response:
{
  "position_in_queue": 123,
  "status": "waiting"
}
Notes: All users must check queue status before reserving seats during high-demand events.
Get Section Availability
Endpoint: GET /events/{event_id}/sections
Headers: Optional
Response:
{
  "sections": [
    { "section_id": "A", "total_seats": 500, "available_seats": 120 },
    { "section_id": "B", "total_seats": 800, "available_seats": 300 },
    { "section_id": "GA1", "type": "General Admission", "available_seats": 2000 }
  ]
}
Notes: Anyone can fetch counts even before joining the queue. Reserved seat details are not returned.
Get Seats in a Section (Reserved Seating Only, Paginated)
Endpoint: GET /events/{event_id}/sections/{section_id}/seats
Headers: Authorization: Bearer  (anonymous, ready) OR Authorization: Bearer  (logged-in, ready)
Query Params: page=1, page_size=50
Response:
{
  "section_id": "A",
  "total_seats": 500,
  "page": 1,
  "page_size": 50,
  "seats": [
    { "seat_id": "A-101", "row": "A", "number": 1, "status": "available" },
    { "seat_id": "A-102", "row": "A", "number": 2, "status": "reserved" },
    { "seat_id": "A-103", "row": "A", "number": 3, "status": "available" }
  ]
}
Reserve Seats / General Admission Slots
Endpoint: POST /events/{event_id}/reservations
Headers: Authorization: Bearer  (anonymous) OR Authorization: Bearer  (logged-in)
Request Body:
{
  "seats": ["A-101","A-103"],
  "ga_section": "GA1",
  "ga_quantity": 2
}
Database Operations:
- 
Create Reservation Header: Insert into reservationstable
- 
Link Reserved Seats: Insert into reservation_seatstable for each seat
- 
Link GA Quantities: Insert into reservation_gatable for GA sections
- 
Update Seat Status: Mark seats as 'reserved' in seatstable
- 
Update Section Counts: Decrement seats_remaininginsectionstable
Response:
{
  "reservation_id": "res_789",
  "reserved_until": "2025-09-02T12:35:00Z",
  "seats": ["A-101","A-103"],
  "ga_section": "GA1",
  "ga_quantity": 2,
  "status": "pending_payment"
}
Notes: Backend validates queue readiness. TTL ensures release if payment isn't completed.
Real-Time Section Updates (Server-Sent Events)
Why Server-Sent Events (SSE) for Seat Notifications:
SSE is the optimal choice for seat availability notifications because:
- One-Way Communication: Seat updates flow server → client only (no client → server needed)
- Simpler Implementation: Native browser support with automatic reconnection
- Better Performance: Lower memory overhead (~2KB vs ~8KB per WebSocket connection)
- Higher Scalability: Can handle 50K+ connections per pod vs 10K for WebSockets
- Built-in Resilience: Automatic reconnection and error handling
- Perfect Use Case Match: Ideal for push notifications like seat availability changes
WebSocket vs SSE: When to Use Each
Why WebSocket Has Lower Latency (1ms vs 8ms):
- Persistent Connection: No HTTP handshake overhead per message
- Binary Protocol: More efficient than HTTP text-based SSE
- Bidirectional: Client can send acknowledgments, reducing server-side queuing
- Direct Connection: No intermediate services (SNS/SQS) adding latency
Use WebSocket When:
- Bidirectional Communication: Chat, real-time collaboration, gaming
- Low Latency Critical: Real-time trading, live sports updates
- Interactive Features: User can send commands/responses
- Custom Protocols: Need binary data or custom message formats
Use SSE When:
- One-Way Communication: Seat availability updates, notifications
- HTTP Infrastructure: Leverage existing HTTP caching, CDNs
- Browser Compatibility: Better support for automatic reconnection
- Simpler Implementation: No need for connection state management
Endpoint: /events/{event_id}/sections/updates
Headers: Authorization: Bearer  (anonymous, ready) OR Authorization: Bearer  (logged-in, ready)
Message:
{
  "section_id": "A",
  "available_seats": 120,
  "timestamp": "2025-09-02T12:30:00Z"
}
Notes: General Admission counts and reserved seat holds are pushed in real time. Updates are throttled to reduce load.
Why Params Are in the Path vs. Request Body
- Event ID, Section ID → Path Params: Represent resources in a hierarchy (/events/{event_id}/sections/{section_id}/seats). REST convention prefers path params when accessing a specific resource.
- Seats, GA Section, GA Quantity → Request Body: Represent actions or state changes (reserving seats). Describe what the client wants to do, not the resource being fetched.
- Queue Token / JWT → Headers: Authentication and authorization tokens always go in headers, not bodies or paths, to keep APIs clean, stateless, and cacheable.
This separation keeps the API design RESTful, predictable, and consistent with industry practices.
Key Takeaways
- Anonymous users are tracked via queue_token.
- Logged-in users are tracked via JWT/account ID; no token issued.
- Queue status is required for all users during high-demand events before reserving seats.
- Pagination improves performance for large sections.
- Real-time updates keep section availability accurate.
- Unauthorized users (no JWT or queue_token) cannot reserve seats.
The Virtual Waiting Room: A Deep Dive into High-Scale Queueing
This deep dive focuses on the Queue Management Service that gracefully manages demand spikes for seat reservations. We'll cover the full system design: its runtime architecture, the exact Redis data structures and operations, a modern approach to consistency and recovery, and a head-to-head comparison of two primary Redis queue implementations.
The Problem: Surviving a Demand Spike
Imagine a major concert ticket sale. Millions of users hit your site simultaneously. Without an admission control system, your backend services—for seat reservation and payment—would be instantly overwhelmed. This leads to a chaotic user experience: slow, unresponsive pages, frustrating timeouts, and a flood of angry customer service calls. This is a critical business problem, costing millions in lost sales and brand trust. The solution isn't to build a bigger backend overnight; it's to create a virtual waiting room that gracefully manages the flood of users and ensures an orderly process.
Goals
- Protect downstream seat/reservation systems from overload (admission control).
- Maintain predictable, mostly FIFO ordering across millions of waiting users.
- Provide "where am I?" and "how long left?" with low latency.
- Allow anonymous and logged-in users (queue_token vs JWT).
- Recover from cache failures quickly without losing order.
Non-Goals (handled elsewhere)
- Payments and final ticket issuance.
- Full bot mitigation (you will still rate-limit and verify).
High-Level Architecture
The system's core is a hybrid, two-part architecture: a durable database for long-term state and a blazing-fast Redis cache for real-time operations.
Write Path (Join Queue)
The API handles a dual-write process for speed and durability:
- A user joins the queue by hitting an API endpoint. The API authenticates them (logged-in via JWT, anonymous gets a signed queue_token).
- The API performs two writes in parallel: it persists a minimal user entry into the Queue DB (our single source of truth) and enqueues the user into Redis (our hot path for ordering).
- The API instantly returns the user's initial position and ETA, derived from the monotonic sequence number (join_seq) received from the Redis write. We'll discuss the O(1) position calculation algorithm in detail in the "Technical Deep Dive" section below.
Admission (Making People Ready)
A Gatekeeper worker steadily admits users from the head of the queue at a controlled rate (e.g., N users/sec). It marks them "ready" in the DB and emits an event to the message bus for observability and client notification.
Read Path (Position, ETA)
In steady state, all reads are served from Redis, providing low-latency updates. If Redis is degraded, the system falls back to a degraded mode (details in section 5).
State Transitions
waiting → ready → (reserve) → done/expired. A user who doesn't reserve in time can re-queue subject to system policy.
Data Model (DB + Redis)
Relational DB (minimal, durable)
This is our rock-solid, durable layer. We store only the essential, long-lived data.
Redis (Reconstructible Cache)
This is our high-performance, live-ordering layer. We'll explore two primary implementations.
A) Sorted Set (ZSET) per event
- Key: q:{event_id}:z
- Member: user_key (e.g., u:{user_id})
- Score: join_seq (monotonic sequence number)
B) List + Hash (list as deque, hash for metadata)
- Key: q:{event_id}:list (RPUSH/LPOP)
- Hash: q:{event_id}:meta:{user_key} → field: join_seq (monotonic sequence number)
Consistency, Failure & Recovery (Event-Sourced Approach)
To build a truly resilient system, we use an event-sourced model where the database is the single source of truth and Redis acts as a reconstructible, high-speed cache. This approach eliminates race conditions and manual intervention.
- DB-First Writes: All changes, like a user joining or being admitted, are first written to the database by the API.
- 
Optimized Recovery (The Snapshot + Stream Model): Our primary cache, Redis, is highly available but can still fail. To ensure a fast and accurate rebuild, we combine periodic snapshots with a continuous event stream.
- Periodic Snapshots: The system periodically takes a snapshot of the Redis queue's live ordering state. This is a quick baseline backup of the essential user data and their order. These snapshots are stored in durable storage.
- Rapid Rebuild: If Redis crashes, a dedicated recovery worker performs two steps: It loads the latest durable snapshot into Redis. It then queries the event log table to replay only the events that occurred since that snapshot was taken. This is a much smaller number of events, allowing for a fast and reliable catch-up.
 
How CDC Event Replay Works: CDC maintains a complete log of all database changes with timestamps. When Redis needs to be rebuilt, the recovery worker queries the CDC log for all changes that occurred after the last snapshot timestamp. This approach provides complete coverage since CDC captures every database change automatically, ensuring no events are missed during recovery.
Redis Design Choices & Time Complexity
A fundamental challenge in high-scale queueing is guaranteeing predictable behavior. A simple timestamp isn't reliable because millions of users could join the queue at the exact same millisecond, leading to a race condition and a random, non-deterministic order. Additionally, system clocks can sometimes go backwards due to clock synchronization issues (NTP adjustments, server reboots, etc.), which would make timestamp-based ordering even more confusing. To solve this, we use a monotonically increasing sequence number (join_seq) generated by Redis's atomic INCR command. This ensures that every single user gets a unique, sequential ticket, making the queue's internal logic perfectly deterministic.
This simple design choice unlocks a massive performance benefit: O(1) position lookup. While native Redis commands for position finding are slow (O(logN) for Sorted Sets and O(N) for Lists), we can bypass them entirely. We simply track the sequence number of the queue's head and subtract it from the user's join_seq to get their exact position in constant time. This is the "Ah-Ha!" moment that makes our system so fast and responsive for millions of users checking their status.
A) Sorted Set (ZSET)
A Redis Sorted Set is a unique data structure that combines a hash table and a skip list to balance performance. A skip list is a probabilistic data structure that offers logarithmic time complexity (O(logN)) for searches, insertions, and deletions, much like a balanced binary search tree, but is generally simpler to implement.
- 
Enqueue (ZADD): O(logN)
- Explanation: When you add a new user to the Sorted Set using the ZADD command, Redis needs to place it in the correct, sorted position. A Redis Sorted Set is a hybrid data structure that uses a skip list to maintain order. Finding the correct spot in a skip list to insert an element requires traversing a subset of the elements, which is a logarithmic time operation, hence O(logN).
 
- 
Dequeue (Batch ZPOPMIN): O(M⋅logN)
- Explanation: The ZPOPMIN command removes the element with the lowest score (the head of your queue). The complexity is O(logN) for a single removal. When the operation is done in a batch to remove M users, Redis has to perform this O(logN) removal process M times. This multiplies the complexity, resulting in O(M⋅logN).
 
- Get User's Position: O(1) (bypasses the native ZRANK command).
Why ZRANK Has O(logN) Complexity: Redis Sorted Sets use skip lists where each node maintains counts - the number of elements between itself and the next node at each level. To find a user's rank:
- Start at the header: Begin at the top level of the skip list
- Traverse levels: For each level, follow forward pointers while the next node's score < target score
- Accumulate counts: Add the count stored at each node to the running rank total
- Level descent: Move down one level and repeat until reaching the bottom level
Count Maintenance: When inserting/deleting elements, Redis must update counts at all affected levels. Each insertion requires updating counts at log(N) levels on average, and each deletion requires similar updates to maintain count accuracy.
Why O(logN): Even with counts, ZRANK must traverse log(N) levels, performing count accumulation at each level. For 1M users: ~20 level traversals × count operations = O(logN) complexity.
B) List + Hash
This design separates the two primary functions of the queue: a List for ordered processing and a Hash for user metadata. A List is a doubly linked list, optimized for fast additions and removals from the head and tail. A Hash provides O(1) access to a user's metadata.
- Enqueue (RPUSH): O(1)
- Dequeue (LPOP): O(1)
- Get User's Position: O(1) (bypasses the native LINDEX command).
Why LINDEX Has O(N) Complexity: The native LINDEX command finds an element at a specific index within a Redis List. A Redis List is implemented as a doubly linked list. To find the position of a specific user, Redis has no choice but to start from the beginning of the list and traverse each element one by one until it finds the matching user ID. This means the time it takes is directly proportional to the number of elements in the list, making it a linear time operation, or O(N). For a large queue with millions of users, this would be prohibitively slow and would not scale.
How We Achieve O(1)
We achieve a constant time position lookup by bypassing native Redis commands entirely and using simple application-level logic. This is possible because our queue design is based on a monotonically increasing sequence number (join_seq).
- Monotonic Counter: The key to this approach is that every user is assigned a unique, sequential number upon joining the queue. This number (join_seq) serves as their unchanging ticket number.
- Tracking the Head: The system maintains a global counter for the "head" of the queue, representing the join_seq of the user currently being processed or the last user admitted.
- The Simple Math: A user's position is simply the difference between their join_seq and the current head of the queue. This is a single subtraction operation.
Example: If the queue head is at join_seq = 1,000,000, and a user has join_seq = 1,000,150, their position is 1,000,150 − 1,000,000 = 150. This calculation is a single step that takes constant time, regardless of whether there are 10 users or 10 million in the queue. This mathematical approach transforms a potentially slow operation into an instant one.
The Con of O(1) Application-Level Logic
While powerful, relying on application-level logic for O(1) lookups creates a risk of race conditions. A user's position could be calculated incorrectly if the head_seq counter changes mid-request.
How to Prevent Race Conditions: The most reliable method is to use a Redis Lua script. The script atomically fetches the user's join_seq and the head_seq counter. Redis guarantees that the entire script runs as a single, indivisible operation, eliminating any risk of a race condition.
Lua Script for Atomic Position Lookup:
-- Atomic position calculation to prevent race conditions
local user_key = KEYS[1]
local head_seq_key = KEYS[2]
local queue_key = KEYS[3]
-- Get user's join sequence number
local user_seq = redis.call('ZSCORE', queue_key, user_key)
if not user_seq then
    return {err = "User not in queue"}
end
-- Get current head sequence
local head_seq = redis.call('GET', head_seq_key)
if not head_seq then
    head_seq = 0
end
-- Calculate position atomically
local position = tonumber(user_seq) - tonumber(head_seq)
return {position = position, user_seq = user_seq, head_seq = head_seq}
Race Condition Prevention: The Lua script ensures atomicity by fetching both user_seq and head_seq in a single Redis operation, preventing the race condition where head_seq changes between the two reads.
Note: The specific implementation of the Lua script will vary slightly based on the chosen Redis data structure. For the List + Hash model, the join_seq is fetched from the Hash, while for the Sorted Set model, it is retrieved as the score of the user's member in the set.
What This Gives Your Product
This design delivers a robust, scalable, and resilient queue that transforms a chaotic demand spike into a predictable user experience. By separating concerns—the DB for durable identity, Redis for real-time ordering, and an event bus for reliable communication—we've built a system that provides a smooth user experience, ensures business continuity, and is easy to maintain.
The Grand Unification of Ticketing: Seat Management Done Right
The art of building a reliable ticketing platform lies in one core principle: preventing chaos. When millions of fans are clamoring for a limited number of tickets, your system must be a bastion of order. This deep dive into the Seat Management Service reveals the crucial mechanisms—database transactions and locking—that ensure every reservation is a clean, atomic operation, preventing the nightmare of overselling and double-booking.
We'll build this system using three key tables: Seats to manage unique tickets, Sections for overall availability, and Reservations to handle temporary holds.
The Database as Your Security Guard 🛡️
At the heart of our strategy is the database, specifically PostgreSQL. Its powerful transactional capabilities allow us to treat a series of operations as a single, indivisible unit. The entire reservation process either succeeds completely, or it fails and is entirely rolled back, leaving no trace behind. This is the ACID principle in action, guaranteeing atomicity and consistency.
The Atomic Reservation Flow
When a user requests seats, the backend initiates a carefully choreographed sequence of database operations.
Step 1: Start the Transaction
The very first action is to begin a transaction. Think of this as putting a "Do Not Disturb" sign on the data you're about to work on.
BEGIN;
Step 2: Check & Lock for a Flawless Hold
This is where we prevent overselling. The process differs slightly depending on the type of seating.
For Reserved Seating: Pessimistic Locking 🔒
When a user selects a unique seat (e.g., "Seat A101"), the system immediately places a lock on that exact seat row in the database. This is a pessimistic lock, so named because we're being pessimistic and assuming another user might want the same seat. It guarantees that no other user can even read or attempt to modify that seat's status until our transaction is complete. The other user is forced to wait, preventing a conflict from ever occurring.
This approach is perfect for reserved seats because they are unique and non-fungible. If the query returns fewer seats than requested, it means some were already taken. The transaction is immediately rolled back, and the user receives an error.
-- Select the requested seat and lock it for update
SELECT seat_id, status FROM seats WHERE seat_id = 'A101' FOR UPDATE;
Why not Optimistic Locking for Reserved Seats? While optimistic locking can offer higher concurrency, it's a poor fit for unique, non-fungible items like reserved seats. The approach involves checking for conflicts at the final moment of the transaction. A second user could start a reservation on the same seat, only to have their entire transaction rejected at the very end when the system detects the conflict. This creates a frustrating and unpredictable user experience, as a "seat not available" message delivered late in the process is far worse than an immediate one.
Comparing the Costs:
- Cost of a Pessimistic Lock: The cost is a single, brief wait at the beginning of the process. The application sends one SELECT...FOR UPDATE query to the database, which handles all locking and waiting internally. This is a predictable, easy-to-manage cost.
- Cost of a Failed Optimistic Lock: This cost is deceptive. While it avoids an up-front wait, it introduces the high cost of a complete re-run when a conflict occurs. The application must perform an entire sequence of operations—fetching data, running business logic, and attempting the final update—only for the update to fail. The application then has to discard all the work and restart the entire process, leading to redundant CPU cycles and wasted database round trips.
The essential difference is that a pessimistic lock's cost is a single, brief, and transparent wait. A failed optimistic lock's cost is the wasteful re-execution of a full and complex reservation attempt.
For General Admission (GA): Optimistic Locking 🟢
For GA, we must avoid a bottleneck. Many users will try to reserve GA slots at the same time, all hitting the same Sections table row. A pessimistic lock would serialize these requests, forcing them to wait in a single line, which defeats the purpose of the queuing system.
Instead, we use optimistic locking, which assumes conflicts are rare. We don't lock the data upfront. We rely on a single, atomic SQL statement to perform a compare-and-swap (CAS) operation, checking and updating the value simultaneously. This approach is highly performant and scalable.
Why Conflicts Are Rare in GA Reservations:
Conflicts occur when seats_remaining drops to 0 or below the requested quantity between the time a user checks availability and attempts to reserve. This is rare because:
- Large Capacity Buffers: GA sections typically have thousands of seats (2,000-10,000+), making it unlikely for the last few seats to be contested simultaneously. 
- Queue-Controlled Access: The queuing system already limits concurrent users. Only users who have been "admitted" from the queue can attempt reservations, reducing simultaneous access. 
- Time Windows: Users have limited time (10-15 minutes) to complete reservations, and most successful reservations happen within the first few minutes of admission. 
- Natural Distribution: User behavior naturally spreads out reservation attempts - some users are faster at selecting, others take time to decide. 
- Batch Processing: The system can process multiple small reservations (1-4 tickets) simultaneously without conflict, as the remaining capacity buffer is usually large enough. 
When Conflicts Do Occur: Conflicts become more likely only when seats_remaining approaches very low numbers (e.g., < 10 seats remaining), but by then most users have already completed their reservations, and the queue system has already controlled the flow.
Why Conflicts Are NOT Rare for Reserved Seats: Unlike GA sections with thousands of seats, reserved seating creates high conflict scenarios. Popular seats (front row, center sections, VIP areas) are unique and non-fungible - only one person can have seat A-101. When thousands of users simultaneously target the same premium seats, conflicts are inevitable. This is why reserved seating requires pessimistic locking to prevent overselling and ensure data consistency, while GA's large capacity buffers make optimistic locking viable.
UPDATE sections
SET seats_remaining = seats_remaining - :ga_quantity
WHERE section_id = 'section_GA1' AND seats_remaining >= :ga_quantity
RETURNING seats_remaining;
This single statement is executed atomically. The WHERE clause acts as our check, ensuring the database only performs the UPDATE if the seats_remaining count is sufficient at that exact moment. If another transaction has already updated the count, the WHERE condition will fail, and the UPDATE will affect zero rows.
Why this is better than Pessimistic Locking for GA: This approach avoids the single-row bottleneck. The database does not need to serialize all reservation requests; it only needs to check for a conflict at the moment of the update. This allows for massive parallelism, making the system highly scalable for high-demand GA events.
Performance Analysis: Locking Strategies Comparison
Pessimistic Locking Performance (Reserved Seats):
- Row Lookup: O(log N) - B-tree index lookup by seat_id
- Lock Acquisition: O(1) - Single row lock after finding row
- Lock Duration: O(1) - Constant time for transaction
- Contention Impact: O(N) - Linear with concurrent users
- Best Case: 10-20ms (no contention)
- Worst Case: 5-10 seconds (high contention)
- Average Case: 50-100ms (moderate contention)
CAS-Based Performance (General Admission):
- Row Lookup: O(log N) - B-tree index lookup by section_id
- Update Operation: O(1) - Single row update with condition check
- No Retries: O(1) - Either succeeds or fails atomically
- Success Case: 20-30ms (atomic update succeeds)
- Failure Case: 20-30ms (atomic update fails due to insufficient seats)
- High Concurrency: O(log N) per transaction regardless of load
Locking Strategy Decision Matrix:
| Factor | Pessimistic | CAS | 
|---|---|---|
| Consistency | Strong | Strong | 
| Concurrency | Low | High | 
| Latency (Low Load) | Fast | Fast | 
| Latency (High Load) | Slow | Moderate | 
| Memory Usage | High | Low | 
| Implementation | Simple | Complex | 
| Best For | Reserved Seats | General Admission | 
Step 3: Update Counts
With the seats or GA section securely locked, we can now confidently update their availability.
-- For reserved seating, update the seat status to 'reserved' (with availability check)
UPDATE seats SET status = 'reserved' WHERE seat_id = 'A101' AND status = 'available';
-- Also update the count in the sections table
UPDATE sections SET seats_remaining = seats_remaining - 1 WHERE section_id = 'section_A';
This step ensures the section-level availability information is always accurate for other users browsing the event.
Step 4: Create the Reservation Record
A new record is inserted into the Reservations table. This record is the master holding information, linking the user to their selected seats or GA quantity. Crucially, we set a Time-to-Live (TTL) using the expires_at timestamp. This is an essential safety net. It ensures that if a user doesn't complete the payment, the reservation will automatically expire and release the seats.
-- Create reservation header
INSERT INTO reservations (reservation_id, user_id, expires_at, status, total_amount_minor_units, currency, created_at)
VALUES ('res_789', 'user_123', NOW() + INTERVAL '10 minutes', 'pending_payment', 8950, 'USD', NOW());
-- Link reserved seat to reservation
INSERT INTO reservation_seats (reservation_seat_id, reservation_id, seat_id, created_at)
VALUES ('rs_001', 'res_789', 'A-101', NOW());
Step 5: The Final Commitment
If all preceding steps succeed, we execute the COMMIT command. This makes all changes permanent. At this point, the seats are officially marked as reserved (or the GA count is reduced), and the locks are released. If any step failed, the ROLLBACK command is issued, and the database returns to its state before the transaction began.
-- All changes are made permanent and locks are released
COMMIT;
-- If a failure occurs, all changes are undone
-- ROLLBACK;
Expiring Abandoned Reservations
No matter how robust your system, some users will abandon their carts. To prevent these "seat leaks," a separate background process, or worker, periodically scans the Reservations table for records where expires_at is in the past. When an expired reservation is found, the worker executes a new transaction to return the seats to the available pool. For reserved seats, it updates their status to available. For GA, it adds the ga_quantity back to the seats_remaining in the Sections table. This mechanism ensures that inventory is never held indefinitely.
⚠️ Race Condition Considerations: The cleanup process must handle concurrent operations safely. Multiple cleanup jobs, new reservations being created during cleanup, and concurrent seat releases can all cause data inconsistency if not properly managed.
-- Cleanup log table for monitoring and debugging
-- This table tracks every cleanup operation so we can monitor the system's health
CREATE TABLE cleanup_log (
    log_id INT AUTO_INCREMENT PRIMARY KEY,           -- Unique identifier for each log entry
    section_id VARCHAR(50),                          -- Which section was cleaned up
    expired_ga_count INT,                            -- How many GA seats were released
    expired_reserved_count INT,                      -- How many reserved seats were released
    affected_reservations INT,                       -- How many reservations were marked as expired
    cleaned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When the cleanup happened
    INDEX idx_cleaned_at (cleaned_at)               -- Index for efficient queries by time
);
-- PostgreSQL version (more robust with better locking)
CREATE OR REPLACE PROCEDURE cleanup_expired_reservations(p_section_id VARCHAR(50))
LANGUAGE plpgsql
AS $$
DECLARE
    v_expired_ga_count INT := 0;
    v_expired_reserved_count INT := 0;
    v_affected_rows INT := 0;
BEGIN
    -- Start a transaction block (automatically handled in procedures in PG >=11)
    -- We still use EXCEPTION handling for rollback on any error.
    BEGIN
        -- Lock the section row to prevent concurrent cleanups
        PERFORM 1 FROM sections WHERE section_id = p_section_id FOR UPDATE;
        -- Lock all affected reservations early to prevent concurrent modifications
        PERFORM 1
        FROM reservations r
        LEFT JOIN reservation_ga rga ON r.reservation_id = rga.reservation_id
        LEFT JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        LEFT JOIN seats s ON rs.seat_id = s.seat_id
        WHERE (rga.section_id = p_section_id OR s.section_id = p_section_id)
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment'
        FOR UPDATE;
        -- Count expired GA reservations for this section
        SELECT COALESCE(SUM(rga.quantity), 0)
        INTO v_expired_ga_count
        FROM reservations r
        JOIN reservation_ga rga ON r.reservation_id = rga.reservation_id
        WHERE rga.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';
        -- Count expired reserved seat reservations for this section
        SELECT COUNT(*)
        INTO v_expired_reserved_count
        FROM reservations r
        JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        JOIN seats s ON rs.seat_id = s.seat_id
        WHERE s.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';
        -- Restore seats_remaining count
        IF (v_expired_ga_count > 0 OR v_expired_reserved_count > 0) THEN
            UPDATE sections
            SET seats_remaining = seats_remaining + v_expired_ga_count + v_expired_reserved_count
            WHERE section_id = p_section_id;
        END IF;
        -- Re-release reserved seats for this section only
        UPDATE seats s
        SET status = 'available'
        FROM reservations r
        JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        WHERE rs.seat_id = s.seat_id
          AND s.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';
        -- Mark expired reservations (both GA and reserved) as expired for this section
        UPDATE reservations r
        SET status = 'expired'
        WHERE r.expires_at < NOW()
          AND r.status = 'pending_payment'
          AND (
              r.reservation_id IN (
                  SELECT rga.reservation_id FROM reservation_ga rga WHERE rga.section_id = p_section_id
              ) OR
              r.reservation_id IN (
                  SELECT rs.reservation_id FROM reservation_seats rs 
                  JOIN seats s ON rs.seat_id = s.seat_id 
                  WHERE s.section_id = p_section_id
              )
          );
        -- Count affected rows (optional: count only those for this section)
        SELECT COUNT(*)
        INTO v_affected_rows
        FROM reservations r
        WHERE r.status = 'expired'
          AND r.expires_at < NOW()
          AND (
              r.reservation_id IN (
                  SELECT rga.reservation_id FROM reservation_ga rga WHERE rga.section_id = p_section_id
              ) OR
              r.reservation_id IN (
                  SELECT rs.reservation_id FROM reservation_seats rs 
                  JOIN seats s ON rs.seat_id = s.seat_id 
                  WHERE s.section_id = p_section_id
              )
          );
        -- Log cleanup operation
        INSERT INTO cleanup_log (
            section_id, 
            expired_ga_count, 
            expired_reserved_count,
            affected_reservations, 
            cleaned_at
        )
        VALUES (
            p_section_id, 
            v_expired_ga_count, 
            v_expired_reserved_count, 
            v_affected_rows, 
            NOW()
        );
    EXCEPTION
        WHEN OTHERS THEN
            RAISE NOTICE 'Error during cleanup: %', SQLERRM;
            ROLLBACK;
            RAISE;
    END;
END;
$$;
Key Race Condition Protections:
- 
Row-Level Locking: FOR UPDATEprevents concurrent section modifications
- Atomic Operations: All updates in a single transaction
- Error Handling: Automatic rollback on any failure
- Audit Trail: Preserves expired reservations for compliance
- Monitoring: Cleanup log tracks operations for debugging
The Anatomy of a Hyper-Scale Real-Time System: A Production-Ready Deep Dive
A detailed technical guide to designing a resilient, high-performance, and scalable notification platform.
In the high-stakes world of live event ticketing, a real-time notification system is a critical business tool, not a luxury. It directly drives revenue by preventing lost sales and builds brand trust by providing a transparent user experience.
A static "seats available" number leads to frustration and cart abandonment when users attempt to reserve seats that are already gone. A real-time system provides an accurate, live view of inventory, which eliminates this negative experience. Additionally, seeing seat availability change dynamically creates a sense of urgency that encourages faster purchasing decisions.
Building a real-time system capable of serving millions of concurrent users requires a series of deliberate architectural choices that prioritize decoupling, resilience, and performance. This article provides a deep dive into a production-ready architecture for a large-scale ticketing notification system. We will define each component's role, trace the data flow in detail, and analyze the system's resilience to failure.
Part 1: Foundational Components
This architecture is composed of several specialized components, each responsible for a distinct part of the data lifecycle.
Event Producers & The Durable Log (Amazon MSK)
The system is event-driven. State changes are captured automatically using Change Data Capture (CDC) from the database, ensuring complete decoupling between high-frequency reservation operations and event publishing.
How CDC Works in Practice:
- Database Changes: When the Seat Management Service commits a reservation transaction (e.g., seat status changed from 'available' to 'reserved'), the database records the change
- 
Automatic Capture: CDC automatically captures all database changes at the transaction log level, including:
- Table modifications (seats, reservations, sections)
- Before/after values for each change
- Transaction metadata and timestamps
- User context from the application
 
- 
Event Transformation: A CDC processor transforms raw database changes into business events:
- Raw UPDATE → 'seat_reserved' event
- Raw INSERT → 'reservation_created' event
- Raw DELETE → 'reservation_expired' event
 
- MSK Publishing: Transformed events are published to appropriate Kafka topics (e.g., 'seat-events', 'reservation-events')
- Zero Performance Impact: Reservation operations have no latency impact since CDC runs asynchronously
- Downstream Processing: Other services (Availability Aggregation, Notifications) consume these events to update their state
MSK serves as the system's durable log and provides several critical functions:
- Decoupling: It decouples producers from consumers, allowing them to operate and scale independently.
- Durability & Replayability: Events are durably stored, allowing them to be re-processed in case of a consumer-side error.
- Backpressure Management: The bus acts as a large buffer, absorbing traffic spikes.
The Processing Engine (Apache Flink)
The Availability Aggregation Service is implemented as a stateful stream processing application using Apache Flink. It consumes the raw event stream from MSK and transforms it into clean, aggregated availability counts.
- Stateful Processing: Flink uses a keyBy operation on the section_id to ensure all events for a single section are processed sequentially by the same task.
- Fault Tolerance: Flink is configured to take periodic checkpoints of its state to a durable store like Amazon S3, guaranteeing data accuracy through failures.
- Windowed Aggregation & Throttling: Instead of emitting an update for every single event, Flink can operate on windows of time. For example, it can collect all events for a specific section over a 2-second window, perform a single aggregation at the end of that window, and then emit a single, consolidated update. This micro-batching is crucial for throttling the update stream, reducing the load on the entire downstream notification system and preventing the end user's UI from flickering with an excessive number of rapid updates.
The Distribution Fleet (Amazon EKS)
The Real-time Notification Service is a fleet of containerized applications deployed on Amazon EKS (Elastic Kubernetes Service). These long-running pods are the "SSE hosts."
- Persistent Connections: Each pod is capable of maintaining thousands of persistent Server-Sent Events (SSE) connections with clients.
- Local State: Each pod maintains two private, in-memory hash tables to manage its local connections, enabling extremely fast lookups.
- Horizontal Scalability: The fleet runs as a Kubernetes Deployment, allowing it to be scaled horizontally.
The Connection Manager (Application Load Balancer)
An Application Load Balancer (ALB) sits in front of the EKS fleet, terminating TLS, performing health checks, and distributing incoming SSE connection requests to the EKS pods using a "Least Connections" algorithm.
The Routing Directory (Amazon ElastiCache for Redis)
A central, low-latency in-memory database acts as our real-time routing table, or "Director."
- Function: Its primary job is to maintain a live map between a logical topic (e.g., section_id) and the physical identifiers of the EKS pods that are currently serving clients interested in that topic.
- Implementation: This is implemented using Amazon ElastiCache for Redis, configured for high availability. The data structure is a simple Redis Set: section:104:subscribers -> { "pod-a-ip", "pod-c-ip" }.
Part 2: The Core Architecture — A Deep Dive into the Data Flow
The architecture is designed for low-latency, targeted message delivery.
Phase 1: Connection & State Registration
This phase establishes the client connection and builds the two-layer routing map.
- A client initiates an SSE connection request to the ALB. The ALB selects a target pod, pod-C, and forwards the request.
- The application inside pod-C establishes the SSE connection and assigns it a unique internal ID.
- The client sends a subscription message (e.g., for section-104).
- pod-C updates its two local, in-memory hash tables:
- Forward Map (connectionId -> sections): Used for efficient cleanup when the client disconnects.
- Reverse Map (section -> connectionIds): Used for O(1) lookups during message delivery.
 
- pod-C then updates the Redis Director, adding its own unique, addressable identifier to the Redis Set for section-104.
   redis.SADD("section:104:subscribers", "pod-c-ip")
Phase 2: Real-Time Message Delivery
This phase traces an event from processing to final delivery.
- The Flink application publishes a processed update for section-104 to an Amazon SNS Topic.
- SNS invokes a subscribed AWS Lambda function (the "Router Lambda").
- The Lambda function queries the Redis Director to get the set of pod identifiers subscribed to section-104.
   redis.SMEMBERS("section:104:subscribers") -> returns { "pod-c-ip", "pod-f-ip" }
- The Lambda performs a direct, service-to-service push to the target pods via an internal API endpoint on each pod.
- The target pod, pod-C, receives this internal API call.
- pod-C performs a highly efficient O(1) lookup in its local in-memory reverse_map to get the list of specific client connectionIds.
- The pod's application iterates through this list and writes the message into the open SSE streams for each connection.
Message Delivery Optimization Analysis
Fanout Pattern Performance:
Direct Broadcast (Naive Approach):
1 message → 100K users = 100K individual WebSocket writes
Time Complexity: O(N) where N = number of subscribers
Memory Complexity: O(N) for connection management
SNS Fanout (Optimized Approach):
1 message → SNS → 100K Lambda invocations → 100K WebSocket writes
Time Complexity: O(1) for publish + O(N) for delivery
Memory Complexity: O(1) for SNS + O(N) for Lambda memory
Batching Optimization:
100 messages → Batch → 1K WebSocket writes
Time Complexity: O(N/K) where K = batch size
Memory Complexity: O(N/K) for batched connections
Message Delivery Guarantees:
At-Most-Once Delivery:
- Implementation: Fire-and-forget with no acknowledgments
- Use Case: Non-critical updates (seat availability changes)
- Performance: Highest throughput, lowest latency
- Trade-off: Some messages may be lost
At-Least-Once Delivery:
- Implementation: Retry mechanism with acknowledgments
- Use Case: Critical updates (payment confirmations)
- Performance: Lower throughput, higher latency
- Trade-off: Some messages may be duplicated
Exactly-Once Delivery:
- Implementation: Idempotent processing with deduplication
- Use Case: Financial transactions
- Performance: Lowest throughput, highest latency
- Trade-off: Highest complexity, highest reliability
Part 3: Bulletproofing the System — A Deep Dive into Resilience
A production-ready architecture must be designed explicitly for failure.
Resilience of the EKS Fleet
- Pod Failure: If a pod crashes, Kubernetes's ReplicaSet immediately launches a replacement. The ALB's health check will have already stopped routing traffic to the failed pod. Clients that were connected to it will trigger their automatic reconnect logic and will be seamlessly routed to a healthy pod by the ALB.
- Node Failure: The EKS worker nodes are managed by an Auto Scaling Group. If an EC2 instance fails, the ASG will terminate it and launch a replacement. Kubernetes will then automatically reschedule the affected pods onto the remaining healthy nodes in the cluster.
Resilience of the Director (Redis Cluster)
- Primary Defense (High Availability): The Director is deployed as an Amazon ElastiCache for Redis cluster with Multi-AZ and automatic failover enabled. If the primary Redis node fails, ElastiCache automatically promotes a replica.
- Disaster Recovery (Graceful Degradation): In a total service failure, the Router Lambda can be programmed with a fallback. Upon detecting that Redis is unavailable, it could revert to a less efficient broadcast model or log the failure and wait for recovery.
Performance Benchmarks
Queue Management Performance Comparison
Maximum QPS & Memory at 10,000 Concurrent Users:
| Category | Method | P95 Latency | Max QPS | Memory (10K users) | Complexity | Notes | 
|---|---|---|---|---|---|---|
| Queue | Redis ZSET | 5ms | 25,000 | 400MB | O(logN) | Skip list implementation | 
| Queue | Redis List+Hash | 2ms | 50,000 | 320MB | O(1) | Linked list + hash table | 
| Queue | Database Queue | 50ms | 2,000 | 1GB | O(N) | PostgreSQL-based | 
| Queue | In-Memory Array | 1ms | 100,000 | 200MB | O(1) | Single-threaded only | 
| Reservation | Pessimistic Lock | 30ms | 2,000 | 800MB | O(1) | Strong consistency | 
| Reservation | Optimistic Lock | 15ms | 8,000 | 600MB | O(1) | High concurrency | 
| Reservation | Database Transaction | 40ms | 1,500 | 500MB | O(1) | ACID guarantees | 
| Notification | Redis Pub/Sub | 2ms | 30,000 | 200MB | O(1) | Fire-and-forget | 
| Notification | WebSocket Direct | 1ms | 5,000 | 800MB | O(1) | Bidirectional, persistent connection | 
| Notification | SSE + SNS | 8ms | 15,000 | 1.2GB | O(1) | AWS managed | 
Memory Usage Calculation Breakdown
Performance at 10,000 Concurrent Users:
Queue Methods:
- Redis ZSET (400MB): 10K users × 32 bytes (user_id + score + skip list pointers) + 25% Redis overhead
- Redis List+Hash (320MB): 10K users × 40 bytes (list + hash entries) + Redis metadata
- Database Queue (1GB): 10K users × 100 bytes (PostgreSQL row) + connection pools + indexes
- In-Memory Array (200MB): 10K users × 8 bytes (integer user_id) + JVM overhead + GC buffers
Reservation Methods:
- Pessimistic Lock (800MB): 10K concurrent locks × 1KB (lock metadata) + database buffers
- Optimistic Lock (600MB): No lock overhead, just version numbers + database buffers
- Database Transaction (500MB): 10K transactions × 50 bytes (transaction state) + database buffers
Notification Methods:
- Redis Pub/Sub (200MB): 10K QPS × 100 bytes (message overhead) + connection buffers
- WebSocket Direct (800MB): 10K connections × 8KB (connection state) + WebSocket server overhead
- SSE + SNS (1.2GB): 10K connections × 2KB (SSE state) + AWS Lambda + SNS message processing
Key Assumptions: 10K concurrent users baseline, Redis overhead 20-25%, database connection pools, network buffers, AWS service overhead. Actual usage varies by implementation details and configuration.
Production Metrics
High-Demand Event Benchmarks:
- Peak Queue Length: 2M+ users waiting
- Admission Rate: 1,000 users/second (controlled)
- Reservation Success Rate: 95%+ (5% timeout/abandonment)
- Notification Delivery: 99.9% within 2 seconds
- System Recovery: <30 seconds from Redis failure
Resource Utilization:
- Redis Cluster: 3 nodes, 16GB RAM each, 50% CPU utilization
- Database: 8 cores, 32GB RAM, 70% CPU utilization
- Notification Fleet: 20 pods, 4GB RAM each, 60% CPU utilization
Monitoring & Observability
Key Metrics
Queue Management Metrics:
- Queue Length: Current number of users waiting
- Average Wait Time: Mean time from join to admission
- Admission Rate: Users admitted per second
- Queue Position Accuracy: Consistency of position calculations
- Redis Memory Usage: Memory consumption by queue data structures
Reservation System Metrics:
- Reservation Success Rate: Percentage of successful reservations
- Lock Contention: Number of lock conflicts per second
- Transaction Duration: Average time for reservation transactions
- Abandonment Rate: Percentage of users who don't complete payment
- Seat Release Rate: Expired reservations released per minute
Real-Time Notification Metrics:
- Message Delivery Rate: Notifications delivered per second
- Delivery Latency: Time from event to user notification
- Connection Health: Active WebSocket/SSE connections
- Message Loss Rate: Failed deliveries due to connection drops
- Fanout Efficiency: Messages per SNS publish
Alerting Thresholds
Critical Alerts:
- Queue length > 1M users
- Reservation failure rate > 10%
- Notification delivery latency > 5 seconds
- Redis memory usage > 80%
- Database connection pool exhaustion
Warning Alerts:
- Queue length > 500K users
- Reservation failure rate > 5%
- Notification delivery latency > 2 seconds
- Redis memory usage > 60%
- Average wait time > 15 minutes
Channels:
- Critical: PagerDuty (immediate escalation)
- Warning: Slack (team notification)
- Info: Email (daily reports)
Logging Strategy
Queue System Logs:
- User join/leave events with timestamps
- Position changes and ETA updates
- Redis operation performance
- Queue admission decisions
Reservation System Logs:
- Lock acquisition/release events
- Transaction success/failure with reasons
- Seat status changes
- Payment timeout events
Notification System Logs:
- Message publish events
- Delivery confirmations
- Connection establishment/teardown
- Fanout performance metrics
Disaster Recovery
Failure Scenarios
Queue System Failures:
- Redis Cluster Failure: 15-30 second recovery with automatic failover
- Queue Order Corruption: 2-5 minute recovery using database reconstruction
- Position Calculation Errors: Immediate detection via monitoring, <1 minute fix
- Admission Rate Overload: Automatic throttling, 30 second stabilization
Reservation System Failures:
- Database Connection Loss: 10-20 second recovery with connection pooling
- Lock Deadlock: Automatic detection and resolution in <5 seconds
- Transaction Rollback: Immediate cleanup, no data corruption
- Seat Status Inconsistency: Background reconciliation process
Notification System Failures:
- WebSocket Connection Drops: Automatic reconnection in <3 seconds
- SNS Service Outage: Fallback to direct database polling
- Lambda Function Timeout: Automatic retry with exponential backoff
- Message Bus Failure: Graceful degradation to polling mode
Recovery Procedures
Queue System Recovery:
- Redis Failover: Automatic promotion of replica to primary
- Queue Reconstruction: Rebuild from database using join_seq ordering
- Position Recalculation: Update all user positions using head_seq tracking
- Admission Resume: Restart controlled admission at safe rate
Reservation System Recovery:
- Connection Pool Reset: Clear failed connections and establish new ones
- Lock Cleanup: Release any orphaned locks from failed transactions
- Seat Status Audit: Verify all seat statuses match reservation records
- Transaction Log Replay: Replay any committed transactions that weren't reflected
Notification System Recovery:
- Connection Re-establishment: Reconnect all dropped WebSocket/SSE connections
- Message Replay: Replay missed notifications from event log
- Fanout Restart: Resume SNS-based message distribution
- Client Notification: Notify users of temporary service interruption
Backup Strategy
Queue State Backup:
- Real-time Replication: Continuous Redis replication to standby cluster
- Event Log Backup: All queue events stored in durable database (primary recovery mechanism)
- Queue Reconstruction: Rebuild queue from database events if Redis fails
- RTO/RPO: 30 second recovery time, 5 second data loss maximum
Reservation Data Backup:
- Database Replication: Real-time replication to standby database
- Transaction Log Backup: Continuous backup of all reservation transactions
- Seat Map Backup: Daily backup of seat configuration and status
- RTO/RPO: 15 second recovery time, 1 second data loss maximum
Notification System Backup:
- Message Queue Backup: SNS topic replication across regions
- Connection State Backup: Periodic backup of active connections
- Event Stream Backup: All notification events stored in durable log
- RTO/RPO: 45 second recovery time, 10 second data loss maximum
Next
This content originally appeared on DEV Community and was authored by Sumedh Bala
 
	
			Sumedh Bala | Sciencx (2025-10-23T17:48:52+00:00) Part 3: Seat Management. Retrieved from https://www.scien.cx/2025/10/23/part-3-seat-management-2/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.
