This content originally appeared on Level Up Coding - Medium and was authored by Shireesh Kumar Singh
Smarter, receiver-driven flow control for synchronized GPU communication in large-scale training.

Context
Ever wonder why training models like GPT-4 costs millions? A huge part of the cost isn’t just the expensive GPUs but a hidden problem. It’s the challenge of getting thousands of them to work together.
Today’s AI models are gigantic, with billions of settings that have to be trained across thousands of GPUs at once. Each GPU tackles a piece of the puzzle, but to stay on track, they must constantly share their work with all the others.
This constant need to sync up creates a huge communication overhead. On a big project like GPT-4 this can mean moving terabytes of information across the network every few minutes just to keep all the GPUs in sync. This is often where things start to slow down, not because the GPUs cannot keep up, but because the network connecting them becomes the real bottleneck. Even the fastest chips cannot do much if they are stuck waiting on data.
Figure 1: shows a typical spine-leaf topology used in AI clusters. Leaf switches(ToRs) connect directly to GPU nodes, while spine switches link all the leafs together. This design gives every GPU a predictable, low-latency path to any other node. It’s ideal for large-scale training where fast, all-to-all communication is critical.

To see what this looks like in the real world, let’s imagine training a 10 billion parameter model on just three node cluster, each with eight H100 GPUs.
Training Setup: 24 H100 GPUs training a 10B parameter model across 3 nodes Node Configuration: Each node has 8 GPUs (0–8), with 8 400G RDMA NIC
Model Parameters: 10 billion parameters × 2 bytes (FP16) = 20GB model size Communication Pattern: Only GPU0 from each node participates in inter-node RDMA
AllReduce Flow —
- A0 → B0: Node A’s GPU0 sends aggregated gradients to Node B’s GPU0
- C0 → B0: Node C’s GPU0 sends aggregated gradients to Node B’s GPU0
- B0 → A0: Node B’s GPU0 sends results back to Node A’s GPU0
- B0 → C0: Node B’s GPU0 sends results back to Node C’s GPU0
Even in this relatively small setup, each node needs to send about 6.7GB (20 GB/3) of data to other nodes during every training step. When this happens thousands of times during training, we are moving petabytes of data across the network.
Production systems handle this by being smart about communication patterns. Instead of all the GPUs talking across the network, they send their data to one GPU in the node, which then talks to the others. GPUs within a node use high-speed local connections called NVLink to share data with one designated GPU, which then handles all the network communication to other nodes.
Even with all the performance tuning, moving petabytes of data across GPUs is still a huge challenge. That is where Remote Direct Memory Access(RDMA) helps. With RDMA, GPUs no longer need to route data through the CPU for every transfer. Instead, they can write directly to each other’s memory over the network. It’s a direct connection that skips the CPU entirely, reducing latency and avoiding unnecessary overhead.
RDMA uses the RoCEv2 protocol, which runs on standard Ethernet infrastructure, so it doesn’t require any changes to existing network setups. Data Center Quantized Congestion Notification (DCQCN) automatically manages congestion by slowing things down when the network gets crowded. Priority Flow Control (PFC) helps prevent data loss by signaling the sender to pause whenever a switch’s buffer starts to fill up.
Together, RoCEv2, DCQCN, and PFC provide high-speed, low-latency communication with hardware-managed reliability. This seems perfect for AI workloads. The problem is that AI training breaks all the assumptions these systems were built on.
Let’s look at why this could fail for real AI workloads.
Problems with RoCEv2 + PFC + DCQCN
Let’s look at our three-node setup. We have 24 H100 GPUs training a 10B parameter model. Each node has 8 GPUs, with GPU0 responsible for inter-node communication. During each AllReduce step, around 6.7GB of gradient data needs to be exchanged per direction. That’s the point where trouble really begins.
The traditional RDMA stack wasn’t designed for this kind of traffic. RoCEv2 with DCQCN and PFC was mainly built for storage and transactional workloads. This traffic consists of many small flows, distributed over time and targeting different destinations. So, congestion builds up slowly, giving ECN enough time to respond before the problem begins.
AI traffic is not distributed over time, it is bursty in nature. When AllReduce starts, multiple GPUs send large chunks of data at the same time, often to the same destination. In our setup, both A0 and C0 begin sending 6.7GB of data to B0 at the same time, creating a sudden 13GB surge that hits the network all at once and quickly overwhelms it.
ECMP Collisions
ECMP hash collision is a common problem in this type of topology. ECMP just looks at packet headers when it makes its forwarding decision. It has no idea which link is idle and which one is overloaded. In our example, both the A0→B0 and C0→B0 flows end up getting hashed to Spine1 while Spine2 is sitting idle.
Forced Pause
Things start to fail at the ToR switch of Node B. When both A0 and C0 push their 6.7GB data at the same time, the egress queue fills up fast. That triggers PFC, which sends pause signals back to both senders. So both A0 and C0 get paused, even though another path through Spine2 is completely free.
Slow to respond
The problem with DCQCN is that it is reactive. Switches have to mark packets with Explicit Congestion Notification (ECN) when queues cross a certain threshold. When the receiver gets these ECN-marked packets, it sends Congestion Notification Packets (CNPs) back to the senders to slow them down. But with bursty AI traffic, that whole process takes too long. By the time DCQCN sends the CNPs, PFC has already triggered, and the senders are halted.
Example

Figure 2 shows exactly how this plays out in practice:
- t0 A0 and C0 both start sending to B0 at the same time
- t1 ECMP routes both flows through the same spine uplink
- t2 The ToR2 queue to Node B fills up completely
- t3 PFC is triggered, pausing A0 and C0
- t4 , ToR2 queue is drained, switch lifts the pause and A0 and C0 can finally resume sending data
The slowdown is significant. Instead of wrapping up quickly, the gradient exchange gets stuck waiting for the network to recover.
Companies like Meta have given up on DCQCN for AI workloads. They disable congestion control entirely and try to manage everything at the application level. But that can cause problems. Without network congestion control, things can quickly spiral into deadlocks and unfair traffic behavior.
So if traditional congestion control doesn’t work, what’s the alternative? That’s where UCCL comes in, offering a new approach designed specifically for AI training traffic.
How UCCL Rethinks GPU Communication for AI Training?
UCCL introduces a smarter software-defined transport layer that fits between NCCL, the collective communication library, and the NIC driver. This setup is important because it gives UCCL full visibility into the communication pattern before any traffic reaches the network. With that early insight, it can make smarter decisions about how data should flow.
In our example with 24 H100 GPUs spread across three nodes, the typical RDMA stack kicks in immediately. A0 and C0 each begin sending 6.7 gigabytes of data to B0. These transfers rely on hardware-level Queue Pairs, and when both get mapped to the same uplink due to an ECMP hash collision, congestion builds up fast. The switch queue fills, PFC triggers, and both senders are forced to stop. DCQCN tries to help, but it reacts too late. The queue is already backed up, and everything stalls.
UCCL handles this differently.
Receiver Control
UCCL gives the receiver control over incoming traffic, based on how full its queues are. In our example, B0 can decide when A0 and C0 are allowed to send data traffic. This kind of control helps spread out the transfers instead of letting everything hit at once. By pacing the traffic at the receiver, UCCL keeps queues from overflowing and avoids network congestion.
Software Queues
It moves away from assigning dedicated hardware queue pairs to each communication flow. Instead, it maintains virtual queues in software allowing it to stage transfers until the receiver is ready. This approach gives the system precise control over how and when data is sent, particularly useful when multiple GPUs are pushing large messages at the same time.
Shared Queue Pair
Instead of one QP per GPU flow, UCCL uses a single QP per NIC. It keeps track of which GPU owns which message and sends them through that shared channel in an orderly way. This prevents ECMP from making blind decisions about routing and gives more consistent behavior.
Smart Scheduling
UCCL ties everything together by understanding the application’s intent. Because UCCL plugs into NCCL, it knows when AllReduce is starting, how much data is about to be sent, and who is involved. Knowing that in advance lets it plan the data flow so things don’t collide or pile up in the queues.
Prevents Congestion
Instead of waiting for congestion to happen and then trying to fix it, UCCL steps in early and prevents the problem altogether. That matters a lot for AI training, where large bursts of data hit the network all at once.
Evaluation
UCCL has proven effective in real-world tests on GPU clusters running standard training jobs. AllReduce operations finished more quickly, queues stayed healthy, and network slowdowns caused by congestion dropped noticeably. The key is that UCCL decides who can send data and when, before anything hits the wire, which helps keep the network steady even when traffic spikes. These improvements showed up across different environments from high-end H100 nodes to more modest T4 setups, and in every case, UCCL outperformed standard NCCL.
One of the biggest advantages is that UCCL works with the network gear teams already have. There is no need to rip out existing infrastructure or upgrade switches just to get these benefits. That makes it much easier to adopt in production without major disruption.
Conclusion
Modern AI training is not just about fast GPUs. It is about getting them to work together across a network without stepping on each other’s toes. Traditional congestion control systems like DCQCN and PFC were not built for the kind of synchronized, high-volume bursts deep learning generates.
UCCL takes a more practical approach. Instead of waiting for congestion to happen and then trying to fix it, it puts the receiver in charge of when each GPU can send. It holds messages in software queues and uses its understanding of collective operations to schedule things in a way that avoids pileups. Most importantly, it runs on the network infrastructure teams already have, with no changes needed.
Reference
https://arxiv.org/pdf/2504.17307
Fixing Network Congestion in AI Training with UCCL was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding - Medium and was authored by Shireesh Kumar Singh
Shireesh Kumar Singh | Sciencx (2025-07-28T01:50:49+00:00) Fixing Network Congestion in AI Training with UCCL. Retrieved from https://www.scien.cx/2025/07/28/fixing-network-congestion-in-ai-training-with-uccl/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.