Timing Series: Part 4 of 6

Previous: Pipelining Without Breaking Your Protocol


Friday Afternoon#

Friday afternoon. Timing met with 200 ps margin. I merged a teammate’s “small” feature branch-an extra output mux, some debug registers, maybe 2000 LUTs.

Monday morning: 847 failing paths. Six-hour build times. Demo on Thursday.

I hadn’t added logic. I’d crossed the congestion cliff.

Before merge:                    After merge:
LUT utilization: 89%             LUT utilization: 93%
WNS: +0.200 ns                   WNS: -0.847 ns
Build time: 1.5 hours            Build time: 6.2 hours
Failing paths: 0                 Failing paths: 847

The utilization report said I was fine. 93% fits. But the router disagreed. Every wire fought for routing tracks. Critical paths detoured through congested regions. Placement became a puzzle with no good solutions.

You need to read the reports that matter, recognize the patterns that predict trouble, and know when to optimize-before you hit the cliff, not after.


Resource Smells#

Experienced engineers recognize these patterns in utilization reports. They rarely write them down:

PatternWhat It MeansLikely Problem
LUTs » FFs (e.g., 85% vs 25%)Logic-heavy, under-pipelinedLong combinational paths, timing pressure
FFs » LUTs (e.g., 20% vs 60%)Over-pipelined or shift-register heavyProbably fine, but check for wasted pipeline stages
DSPs at 0% but LUTs highMultiplies in fabricLUT explosion; force DSP inference
BRAM at 100%, LUTs lowMemory-bound designConsider URAM, external memory, or algorithmic changes
Build time doubles with 5% more logicCongestion cliffYou’re at the edge; optimize or upsize device
Same paths fail with different seedsPlacement-sensitive designFragile timing; need headroom, not constraints

When you see these patterns, don’t wait for timing to fail. Act while you have options.


What You’re Actually Buying#

An FPGA isn’t a homogeneous sea of logic. It’s a grid of heterogeneous resources arranged in columns:

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ CLB │ CLB │BRAM │ CLB │ DSP │ CLB │ CLB │BRAM │ CLB │ CLB │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ CLB │ CLB │BRAM │ CLB │ DSP │ CLB │ CLB │BRAM │ CLB │ CLB │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ CLB │ CLB │BRAM │ CLB │ DSP │ CLB │ CLB │BRAM │ CLB │ CLB │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
         │           │                    │
         └── BRAM    └── DSP              └── BRAM
             Column      Column               Column

CLBs (Configurable Logic Blocks) contain LUTs and registers. BRAM columns hold block RAM. DSP columns hold DSP48 slices. The resources aren’t interchangeable, and their locations are fixed.

This matters because:

  1. DSPs are location-locked. A multiply in your datapath must route to a DSP column. If your logic is on the far side of the device, that’s a long wire.

  2. BRAMs cluster. A memory-heavy design pulls logic toward BRAM columns. If you have two memory-heavy modules, they fight for the same columns.

  3. CLBs fill from the inside out. High utilization means logic gets pushed to the edges, making routes longer.

The utilization percentage is a scalar. The reality is spatial. Two designs at 80% LUT can have completely different timing outcomes depending on how that 80% is distributed.


LUTs: The Universal Soldier#

A LUT (Look-Up Table) is a truth table in silicon. A 6-input LUT (LUT6) stores 64 bits: one for each combination of six inputs. Any combinational function of six or fewer inputs fits in one LUT.

// This is one LUT
assign y = (a & b) | (c ^ d) | (e & ~f);

// But a 7-input function needs two LUTs:
assign y = a ? (b ? c : d) : (e ? f : g);  // 7 inputs = 2 LUTs

Modern FPGAs use fracturable LUTs. One LUT6 can function as two independent LUT5s sharing the same inputs, or one LUT6 with a single output. The tools handle this automatically, but it means LUT utilization isn’t always intuitive. Sometimes adding logic costs nothing because it fits in the unused half of an existing LUT.

LUTs as Memory#

LUTs can also function as distributed RAM or shift registers:

Distributed RAM: Small, fast memories implemented in LUTs. A 64×1 RAM uses one LUT. Good for register files, small FIFOs, and CAMs.

Shift Registers (SRL): A LUT configured as a 32-deep shift register. SRL32 uses one LUT for 32 bits of delay. No routing between stages. The shift happens inside the LUT.

// Infers SRL32 - one LUT for 32 cycles of delay (1-bit wide)
logic [31:0] delay_line;  // 32-stage shift register, 1-bit wide
always_ff @(posedge clk)
    delay_line <= {delay_line[30:0], data_in};
assign data_out = delay_line[31];

When to use SRL vs. BRAM:

DepthWidthRecommendation
≤ 32AnySRL (one LUT per bit)
33-64AnyTwo SRLs cascaded
> 64> 8BRAM
> 64≤ 8SRL cascade may still win

The crossover depends on your LUT budget. If you’re LUT-constrained, use BRAM earlier. If you’re BRAM-constrained, cascade SRLs deeper.


Registers: The “Free” Resource#

Every LUT in a CLB has flip-flops attached. A Xilinx UltraScale CLB has 8 LUTs and 16 flip-flops. You get roughly 2 FFs per LUT, and they’re often underutilized.

Look at that utilization report again:

| CLB LUTs       | 142847 | 152064|  93.94%    |
| CLB Registers  | 98234  | 304128|  32.30%    |

LUTs at 94%, registers at 32%. This is typical for logic-heavy designs. The registers are “free” in the sense that they’re already there. Using more FFs doesn’t cost LUTs.

This is why pipelining is cheap. Adding a register stage costs FFs, not LUTs. The logic between pipeline stages costs LUTs. When you pipeline a design:

// Unpipelined: long combinational path
assign result = complex_function(a, b, c, d, e, f, g, h);

// Pipelined: same LUTs, more FFs, shorter paths
always_ff @(posedge clk) begin
    stage1 <= partial_function(a, b, c, d);
    stage2 <= partial_function(e, f, g, h);
    result <= combine(stage1, stage2);
end

The LUT count stays roughly the same. The FF count increases. Timing improves because each path is shorter.

Register replication is another “free” trick. When a register drives high fanout, the tools can duplicate it:

// One register driving 500 loads - long routes, high delay
logic broadcast_reg;

// Tools replicate to reduce fanout - more FFs, shorter routes
(* max_fanout = 50 *) logic broadcast_reg;

The duplicate registers cost FFs, which you have in abundance. The reduced fanout improves timing.

The exception: control sets. Each unique combination of clock, enable, and reset creates a control set. Xilinx CLBs have limited control sets per slice. Excessive variety forces suboptimal packing and can make “free” registers expensive.


BRAM: The Memory Question#

Block RAM (BRAM) is dedicated memory silicon. A BRAM36 provides 36 Kb of storage, configurable in various aspect ratios:

ConfigurationDepthWidthPorts
32K × 1327681True dual-port
16K × 2163842True dual-port
4K × 940969 (8+parity)True dual-port
2K × 18204818True dual-port
1K × 36102436True dual-port
512 × 7251272Simple dual-port*

*Simple dual-port: one dedicated read port and one dedicated write port, versus true dual-port where both ports can read or write independently.

A BRAM36 can split into two independent BRAM18s. True dual-port means both ports can read or write independently on each clock edge.

When BRAM, When Distributed?#

Memory SizeRecommendationReason
< 64 bits totalRegistersDon’t waste a BRAM
< 256 × 8Distributed RAMLUT-based, fast, close to logic
≥ 256 × 8BRAMDedicated silicon, no LUT cost
> 36 KbCascade BRAMsOr use UltraRAM if available

UltraRAM (URAM): On UltraScale+ devices, UltraRAM provides 288 Kb blocks, 8× larger than BRAM36. Use URAM for large buffers, deep FIFOs, or lookup tables where BRAM count would otherwise dominate. URAM is column-locked like BRAM, but with fewer columns, so placement matters even more.

The threshold isn’t exact. If you’re LUT-constrained, push more memories into BRAM. If you’re BRAM-constrained, push small memories into distributed RAM.

Common BRAM inference pattern:

// Infers BRAM with read-first behavior
logic [31:0] mem [0:1023];
logic [31:0] dout;

always_ff @(posedge clk) begin
    if (we)
        mem[addr] <= din;
    dout <= mem[addr];  // Read-first: reads old value on write
end

Read-first vs. write-first matters. Read-first reads the old value when writing to the same address. Write-first reads the new value. No-change holds the output. Get this wrong and your simulation won’t match hardware.


DSP: The Multiplier Trap#

A DSP48E2 (UltraScale) is a hardened arithmetic unit:

        ┌─────────────────────────────────────┐
A ──────┤                                     │
        │   Pre-adder ─► Multiplier ─► ALU    ├──► P (48-bit)
B ──────┤       (D±A)      (27×18)    (+/−)   │
        │                                     │
C ──────┤────────────────────────────────────►│
D ──────┤                                     │
        └─────────────────────────────────────┘

One DSP48 can do:

  • 27×18 signed multiply
  • 48-bit add/subtract/accumulate
  • Pre-add (D ± A) before multiply
  • Pattern detect (for saturation, convergent rounding)

All in one cycle, with optional pipeline registers at each stage.

The fabric alternative is expensive:

OperationDSP CostFabric Cost (LUTs)
18×18 multiply1 DSP~300 LUTs
27×18 multiply1 DSP~500 LUTs
48-bit add0 (use ALU)~50 LUTs
MAC (multiply-accumulate)1 DSP~400 LUTs

If you run out of DSPs and the tools spill multiplies to fabric, your LUT count explodes. A design that “fits” on paper suddenly consumes 40% more LUTs than expected.

DSP inference pattern:

// Infers one DSP48 with internal pipeline
logic signed [26:0] a;
logic signed [17:0] b;
logic signed [47:0] p;

always_ff @(posedge clk) begin
    p <= a * b;  // Single-cycle, pipelined inside DSP
end

Fabric multiply (avoid if possible):

// Forces fabric implementation - LUT explosion
(* use_dsp = "no" *) logic signed [47:0] p;
always_ff @(posedge clk) begin
    p <= a * b;
end

DSPs are column-locked. If your multipliers are in one module and the data sources are across the device, routing becomes the bottleneck. Place DSP-heavy logic near DSP columns, or let the tools floorplan for you.


The Utilization Cliff#

The relationship between utilization and timing is non-linear:

Timing Margin
 100%│ ─────────────────────┐
     │                      │
  75%│                      └───────┐
     │                              │
  50%│                              └─────┐
     │                                    │
  25%│                                    └───────┐
     │                                            │
   0%│────────────────────────────────────────────└──
     └────────────────────────────────────────────────
       0%    50%    70%    85%    95%    100%
                    LUT Utilization
UtilizationExperience
< 70%Comfortable. Timing closes easily. Incremental builds work.
70-85%Watch closely. Timing becomes sensitive to placement. Some builds fail.
85-95%Pain. Long compile times. Fragile timing. Incremental fails often.
> 95%War. Hours of routing. Many failing paths. Design may not close.

The cliff isn’t exactly at these numbers. It depends on your device, clock frequency, and design structure. But the pattern holds: timing degrades gradually until you hit a threshold, then collapses.

Why does this happen?

  1. Routing congestion. At high utilization, wires compete for routing tracks. The router takes longer paths, adding delay.

  2. Placement pressure. The placer can’t keep related logic together. Fanout routes get longer.

  3. Feedback loops. Longer routes mean more delay. More delay means the placer tries different arrangements. Different arrangements may have even longer routes.


When Congestion Hits Your Timing Report#

The utilization report won’t warn you. But the timing report will-if you know what to look for.

Route-dominated paths:

  Location             Delay type                Incr(ns)  Path(ns)
  -------------------------------------------------------------------
  SLICE_X47Y120        net (fanout=1, routed)     0.832     4.123
  SLICE_X89Y156        net (fanout=1, routed)     1.247     6.891
                                                  ^^^^^
                                          Route delay >> logic delay

When route delays dwarf logic delays on critical paths, you have congestion. A path shouldn’t need 1.2 ns to route between two slices unless wires are fighting for tracks.

Unexpected clock region crossings:

report_timing -from [get_pins */clk] -to [get_pins */D] -max_paths 10

Path crosses clock regions: X2Y1 → X0Y3 → X2Y2 → X1Y4

If your critical path bounces across clock regions, the placer couldn’t keep logic together. This is a spatial problem, not a constraint problem.

Congestion reports:

# Vivado
report_design_analysis -congestion

Congestion Report
-----------------
Direction  Level  Regions
---------  -----  -------
North      5      X0Y2:X3Y2
East       4      X2Y1:X2Y4
Global     3      X1Y2:X2Y3    Critical paths likely route through here
# Quartus
report_routing_utilization
# Look for regions with >80% horizontal or vertical track usage

What congestion looks like in report_timing:

Slack (VIOLATED): -0.847ns
  Source: processing/stage2/data_reg[15]/C
  Destination: processing/stage3/result_reg[15]/D

Data Path Delay:      4.847ns  (logic 1.234ns  route 3.613ns)
                                              ^^^^^^^^^^^^
                                              75% is routing!

When 70%+ of your path delay is routing, you’ve hit the wall. No constraint changes will help. You need fewer LUTs or a bigger device.


Case Study: From 93% to 53%#

Here’s how I recovered from Monday morning’s disaster. Starting point: 93% LUT, 847 failing paths, 6-hour builds.

Step 1: Find the hogs

report_utilization -hierarchical -hierarchical_depth 3
+--------------------------------+--------+-------+
| Instance                       | LUTs   | Util% |
+--------------------------------+--------+-------+
| top                            | 142847 | 93.9% |
|   packet_buffer                | 28400  | 18.7% | ← Distributed RAM
|   processing/correlator        | 31200  | 20.5% | ← Fabric multiplies
|   debug_mux                    | 8900   | 5.9%  | ← Debug logic
|   addr_decode (×4 instances)   | 6200   | 4.1%  | ← Duplicated
+--------------------------------+--------+-------+

Step 2: Apply targeted fixes

ChangeLUTs BeforeLUTs AfterSaved
Move packet_buffer from distributed RAM to BRAM28,4002,10026,300
Replace fabric multiplies with DSP inference31,2008,40022,800
Remove debug MUXes (gate behind DEBUG parameter)8,90008,900
Share address decoder across instances6,2001,8004,400

Total saved: 62,400 LUTs

Step 3: Results

Before:                          After:
LUT utilization: 93.9%           LUT utilization: 52.9%
WNS: -0.847 ns                   WNS: +0.892 ns
Build time: 6.2 hours            Build time: 38 minutes
Failing paths: 847               Failing paths: 0

I overshot. 53% gives room for the next feature. The packet buffer change was the biggest win-distributed RAM for a 4K×64 buffer was burning LUTs that BRAM handles for free.

The multiplier fix:

// Before: fabric multiply (tools didn't infer DSP due to width mismatch)
logic [31:0] a, b;
logic [63:0] product;
assign product = a * b;  // 32×32 = 4 DSPs, but tools used fabric

// After: explicit DSP-friendly width
logic signed [26:0] a_dsp;
logic signed [17:0] b_dsp;
logic signed [47:0] product;
always_ff @(posedge clk)
    product <= a_dsp * b_dsp;  // Infers 1 DSP

The original code used 32-bit operands. The tools would need 4 DSPs for a full 32×32 multiply, so they fell back to fabric. Reducing to 27×18 (which fit my actual data range) gave me 1 DSP per multiply.


Quick Wins Checklist#

Before upsizing the device, try these:

Memory:

  • Any distributed RAM > 256×8? Move to BRAM.
  • Any BRAM < 256×8? Move to distributed RAM.
  • Deep shift registers (>32)? Consider BRAM-based delay line.

Arithmetic:

  • DSP utilization low but LUTs high? Check for fabric multiplies.
  • Multiplier operands wider than 27×18? Can you reduce precision?
  • Multiply chains? Restructure for DSP cascade.

Debug/Development:

  • Debug MUXes and ILA still instantiated? Gate behind parameter.
  • Assertion logic synthesized in? Use translate_off.
  • Unused module outputs? Remove dead logic.

Structure:

  • Same logic instantiated multiple times? Share it.
  • Large one-hot state machines? Consider binary encoding.
  • Wide muxes (>16:1)? Add pipeline stage or restructure.

Timing:

  • Route-dominated critical paths? Need fewer LUTs, not constraints.
  • Cross-clock-region paths? Consider floorplanning.
  • High fanout nets? Add max_fanout attribute.

Reading the Report#

The utilization report has layers. The summary hides important details.

Hierarchical breakdown:

report_utilization -hierarchical -hierarchical_depth 3
+--------------------------------+--------+-------+--------+
| Instance                       | LUTs   | FFs   | BRAM   |
+--------------------------------+--------+-------+--------+
| top                            | 142847 | 98234 | 289    |
|   eth_rx                       | 12340  | 8920  | 16     |
|   eth_tx                       | 11280  | 7650  | 12     |
|   processing                   | 98000  | 65000 | 240    |
|     stage1                     | 24500  | 16250 | 60     |
|     stage2                     | 24500  | 16250 | 60     |
|     stage3                     | 24500  | 16250 | 60     |
|     stage4                     | 24500  | 16250 | 60     |
+--------------------------------+--------+-------+--------+

Now you know where the resources went. If processing uses 69% of your LUTs, that’s your optimization target.

Primitives vs. logical resources:

The report shows both. “LUTs” might mean:

  • LUT6 (6-input LUT as logic)
  • LUT5 (fractured half of LUT6)
  • LUTRAM (LUT configured as distributed RAM)
  • SRL (LUT configured as shift register)

Quartus equivalents:

# Resource usage by entity
report_resource_usage -hierarchy

# ALM utilization detail
report_fitter_resource_usage -resource alm

Estimating Before Synthesis#

Before you write RTL, estimate whether it fits:

ResourceEstimation Rule
DSPsCount your multipliers. Each 18×18 or smaller = 1 DSP. Larger = 2-4 DSPs.
BRAMsSum memory bits ÷ 36,000 = BRAM36 count. Round up per memory.
LUTsEverything else. State machines, muxes, comparators, control logic.
FFsPipeline depth × datapath width + control registers. Usually not the constraint.

Example: Ethernet processing pipeline

4 × 18×18 multipliers        =  4 DSPs
2 × 4K×32 buffers            =  8 BRAM36 (each 4K×32 = 131Kb → 4 BRAMs; 131Kb ÷ 36Kb ≈ 4)
1 × 16K×8 lookup table       =  4 BRAM36
16-stage pipeline, 256-bit   = ~4000 FFs
Control + datapath logic     = ~15000 LUTs (estimate 3× your intuition)

Rule of thumb: estimate LUTs at 3× your intuition. You always forget the muxes, the debug logic, the edge cases, and the tool overhead.


Common Resource Traps#

TrapSymptomFix
Inferred latchUnexpected LUT+FF combo, simulation mismatchComplete all branches in combinational always blocks
Distributed RAM instead of BRAMLUT explosion, works in simCheck memory size, ensure proper read/write templates
BRAM for tiny memoryBRAM wasted on 16×8Use registers or distributed RAM for small memories
Fabric multiplyLUT explosion, DSPs unusedRemove use_dsp = "no", check operand widths
SRL where you need resetSRL doesn’t reset, FF doesUse registers if reset required
High fanout registerLong routes, timing failAdd max_fanout attribute, let tools replicate
Uninitialized BRAMWorks in sim, fails in hardwareAdd explicit initialization or reset sequence

Inferred latch example:

// BAD: infers latch because else branch missing
always_comb begin
    if (sel)
        out = a;
    // else??? latch!
end

// GOOD: all branches covered
always_comb begin
    if (sel)
        out = a;
    else
        out = b;
end

The Audit Checklist#

Before synthesis:

  • All memories sized and type decided (BRAM vs. distributed)
  • Multiplier count known, DSP budget checked
  • Pipeline depth estimated, register budget checked
  • Total LUT estimate < 70% of target device

After synthesis:

  • Check hierarchical utilization. Any surprise hogs?
  • Check DSP and BRAM inference. Expected counts?
  • Check for inferred latches (report_drc in Vivado)
  • Check for fabric multipliers when DSPs expected

After implementation:

  • Utilization under 85%? If not, consider larger device.
  • Congestion report clean? If not, floorplan or refactor.
  • Timing met with margin? If barely, you’re fragile.

When to stop and refactor:

  • LUT utilization > 90% and timing fails
  • Build times exceed 4 hours
  • Incremental builds fail repeatedly
  • Same paths fail with different placement seeds

Quick Reference#

Resource Summary#

ResourceWhat It IsWhen to UseWatch Out For
LUT6-input truth tableLogic, small muxes, distributed RAM, SRLUtilization > 85% kills timing
FFFlip-flopPipelining, registersUsually abundant; use freely
BRAM36Kb memory blockBuffers > 256×8, lookup tablesColumn-locked; overuse creates placement pressure
DSPMultiply-accumulateAny multiply ≥ 8×8Column-locked; fabric fallback is expensive

Utilization Thresholds#

LUT %StatusAction
< 70%ComfortableContinue normally
70-85%CautionMonitor timing closely
85-95%WarningConsider refactoring or larger device
> 95%CriticalDesign likely won’t close; refactor required

Resource Smells (Quick Check)#

PatternMeaning
LUTs » FFsUnder-pipelined, logic-heavy
DSPs = 0%, LUTs highFabric multiplies
BRAM = 100%Memory-bound
Build time spikesCongestion cliff

Estimation Formulas#

BRAM count ≈ Σ (memory_bits / 36,000), rounded up per memory
DSP count  ≈ Σ multipliers (width ≤ 27×18 = 1 DSP each)
LUT count  ≈ 3 × your intuition
FF count   ≈ pipeline_depth × datapath_width + control

Key Commands#

Vivado:

report_utilization -hierarchical
report_design_analysis -congestion
report_timing_summary -delay_type min_max
# Check for route-dominated paths:
report_timing -max_paths 10 -nworst 1 -input_pins

Quartus:

report_resource_usage -hierarchy
report_fitter_resource_usage -resource alm
report_routing_utilization
report_ram_utilization

Resource budgeting isn’t about fitting. It’s about leaving room. Room for timing closure. Room for last-minute features. Room for the next engineer who inherits your design. A design at 70% utilization with margin is worth more than a design at 95% that barely closes.

The utilization report gives you one number. Reality is a spatial puzzle of competing resources, congested routes, and non-linear cliffs. Know what you’re buying, know where the cliffs are, and leave room to maneuver.


Timing Series#

  1. Your FPGA Lives a Lifetime While You Blink - Why timing satisfies or breaks
  2. Constraints: The Contract You Forgot to Sign - How to write constraints
  3. Understanding Timing Analysis - How to read timing reports
  4. Pipelining Without Breaking Your Protocol - How to fix violations
  5. Silicon Real Estate: Your Resource Budget - How to manage resources (you are here)
  6. CDC: Two Flip-Flops Are Not Magic - How to cross clock domains
  7. Resets: The Timing Event You Forgot - How to handle resets