Pipelining Without Breaking Your Protocol

Timing Series: Part 3 of 6

Previous: Understanding Timing Analysis

The Register That Broke Everything#

You added a pipeline register to fix timing. The path closed. The simulation failed.

// Before: timing violation, functionally correct
assign data_out = transform(data_in);
assign valid_out = valid_in;

// After: timing clean, protocol broken
always_ff @(posedge clk) data_out <= transform(data_in);
assign valid_out = valid_in;  // Still combinational

You pipelined data_out but not valid_out. For one cycle, valid_out asserts while data_out holds stale data. Downstream captures garbage.

If you registered data, register valid in the same always_ff:

// Fixed: valid and data move together
always_ff @(posedge clk) begin
    if (rst) begin
        valid_out <= 1'b0;
    end else begin
        valid_out <= valid_in;
        data_out  <= transform(data_in);
    end
end

One cycle mismatch equals one beat of garbage. The other half of “pipelining broke my design” tickets? Backpressure. If you ignore it, you lose data.

Two Problems, Two Solutions#

Pipelining a valid/ready interface requires solving two distinct problems:

Forward path timing: Break the combinational path from input to output (add registers)
Backpressure handling: Don’t lose data when downstream stalls (add storage)

These are complementary, not alternatives. A bypass skid buffer handles backpressure but doesn’t break timing. A naive register breaks timing but loses data under backpressure. A proper register slice does both.

The Handshake Contract#

Every streaming interface you use runs on valid/ready. AXI-Stream just put a logo on it.

A transfer occurs when valid AND ready are both high on the same clock edge.

         ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐
clk      │   │   │   │   │   │   │   │   │   │   │   │
      ───┘   └───┘   └───┘   └───┘   └───┘   └───┘   └───
              _______________
valid    ____/               \___________________________

                      _______
ready    ____________/       \___________________________

data     ----<  A  ><  B  ><  C  >-----------------------
                          ^
                          └── Transfer occurs here (B captured)

A is presented but not transferred (ready low). B is presented and transferred (both high). C never has valid asserted.

The producer controls valid and data. The consumer controls ready. Neither knows the other’s next-cycle behavior. This is where pipelining goes to die:

You cannot add latency to valid without adding the same latency to data.

You cannot add latency to ready without storage for the data already in flight.

And the invariant that protocol specs bury in fine print:

When valid is high and ready is low, the source must hold valid and data stable until the transfer completes.

Every pipelining strategy depends on this. Violate it and nothing works. Every slice and FIFO in this article assumes the upstream obeys “hold stable while stalled.” If it doesn’t, your protocol is already broken-no amount of buffering will save you.

The Happy Path: When Stalls Don’t Exist#

Before the complexity, here’s the simplest correct pipeline stage:

// Simplest correct pipeline stage
// ONLY works when downstream NEVER stalls (ready always high)
always_ff @(posedge clk) begin
    if (rst) begin
        valid_out <= 1'b0;
    end else begin
        valid_out <= valid_in;
        data_out  <= data_in;
    end
end

assign ready_out = 1'b1;  // Always accept

This works when ready_in is always high. Internal transform stages where you control both ends. Straight pipelines with no backpressure. Simple and correct.

The moment downstream can stall, everything changes.

When ready_in goes low while valid_out is high, you must hold data_out stable. But this register updates unconditionally every cycle. It overwrites data the downstream hasn’t accepted yet.

That’s where all the complexity comes from: storage for data that’s been sent but not yet received.

The Forward Path: The Bug Everyone Writes#

Here’s the broken register everyone writes first:

// BROKEN: updates unconditionally, violates protocol
// Signal naming: *_upstream = toward source, *_downstream = toward sink
always_ff @(posedge clk) begin
    if (rst) begin
        valid_downstream <= 1'b0;
    end else begin
        valid_downstream <= valid_upstream;
        data_downstream  <= data_upstream;
    end
end
assign ready_upstream = ready_downstream;

This updates data_downstream every cycle, even when stalled. When ready_downstream is low and valid_downstream is high, you must hold data_downstream stable. You didn’t. You overwrote it with whatever garbage data_upstream had next.

The fix: gate updates on “output can accept”:

// CORRECT: single-entry register slice
wire accept = ready_downstream || !valid_downstream;

always_ff @(posedge clk) begin
    if (rst) begin
        valid_downstream <= 1'b0;
    end else if (accept) begin
        valid_downstream <= valid_upstream;
        if (valid_upstream) data_downstream <= data_upstream;
    end
end

assign ready_upstream = accept;

This is a single-entry slice. It breaks forward timing and handles backpressure correctly. But ready_upstream is still combinational-chain ten of these and ready becomes your critical path.

The Concrete Example: CRC Calculator#

You have a CRC calculator with 4 LUT levels. Timing fails by -0.3 ns on the path from input to CRC output.

You add a pipeline register after LUT level 2:

// Stage 1: first half of CRC calculation
always_ff @(posedge clk) begin
    crc_partial <= crc_stage1(data_in);
end

// Stage 2: second half
always_ff @(posedge clk) begin
    crc_out <= crc_stage2(crc_partial);
end

Timing closes. You ship. Production reports: every other packet has a bad CRC.

The problem: valid_in still propagates combinationally while data_in now takes two cycles. When valid_out asserts, it’s pointing at data from two cycles ago-but the CRC reflects only one cycle of delay.

         ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐
clk      │   │   │   │   │   │   │   │   │   │
      ───┘   └───┘   └───┘   └───┘   └───┘   └───

valid_in ─────────┐ A ┌───┐ B ┌───┐ C ┌─────────
                      │       │       │
valid_out ────────────│───────│───────│───────── (still combinational!)
                      │       │       │
data_out  ────────────┤partial│partial│─────────
                      │   A   │   B   │
crc_out   ────────────┴───────┴───────┴───────── (2-cycle latency)
                          ↑
                          └── valid_out points here, but CRC is for PREVIOUS beat

The fix: Pipeline valid with the same latency as the data path:

// Valid pipeline matches data pipeline depth
logic valid_d1, valid_d2;

always_ff @(posedge clk) begin
    if (rst) begin
        valid_d1 <= 1'b0;
        valid_d2 <= 1'b0;
    end else begin
        valid_d1 <= valid_in;
        valid_d2 <= valid_d1;
    end
end

assign valid_out = valid_d2;  // Now aligned with crc_out

Every signal must have the same pipeline depth. Valid, data, CRC, and any sidebands. No exceptions.

The Backward Path: Why Registering Ready Breaks#

You can’t just register ready without storage:

// BROKEN: ready is pipelined without storage
always_ff @(posedge clk) begin
    ready_out <= ready_in;  // One cycle late
end

When downstream deasserts ready, upstream doesn’t see it until one cycle later. During that cycle, upstream sends data that has nowhere to go:

         ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐
clk      │   │   │   │   │   │   │   │   │   │
      ───┘   └───┘   └───┘   └───┘   └───┘   └───
          _______________________
valid_in /                       \___________
              _______________
ready_in ____/               \___________________ <- Deasserts here
                      _______
ready_out ___________/       \___________________ <- Upstream sees it late
                          ^
                          └── Data sent here is lost (no storage)

Registering ready is fine if you add storage for the in-flight beat. That’s what a skid buffer provides.

Which Structure Do I Need?#

Before diving into implementations, here’s how to choose:

Does downstream ever stall (ready goes low)?
│
├─ No  → Simple registered stage (no storage needed)
│        Use "The Happy Path" code above
│
└─ Yes → Do you need to break FORWARD timing (valid/data path)?
         │
         ├─ No  → Bypass skid buffer
         │        Handles backpressure only
         │        Forward path is still combinational
         │
         └─ Yes → Do you need to break BACKWARD timing (ready path)?
                  │
                  ├─ No  → Single-entry register slice
                  │        The "accept = ready || !valid" pattern
                  │        Ready is still combinational
                  │
                  └─ Yes → 2-entry FIFO or vendor register slice
                           Fully registered both directions
                           Use this or the vendor IP

Most production designs end up at the bottom: 2-entry FIFO or vendor IP. But knowing the progression helps you understand why.

Bypass Skid Buffer: Backpressure Without Timing Fix#

A bypass skid buffer catches the beat in flight when backpressure arrives late. It does NOT break the forward timing path-data passes through combinationally when the buffer is empty.

module bypass_skid #(
    parameter WIDTH = 32
) (
    input  logic             clk,
    input  logic             rst,
    input  logic             valid_in,
    output logic             ready_out,
    input  logic [WIDTH-1:0] data_in,
    output logic             valid_out,
    input  logic             ready_in,
    output logic [WIDTH-1:0] data_out
);

    logic             buf_valid;
    logic [WIDTH-1:0] buf_data;

    // COMBINATIONAL forward path when buffer empty
    assign valid_out = buf_valid || valid_in;
    assign data_out  = buf_valid ? buf_data : data_in;

    // Ready when buffer is empty
    assign ready_out = !buf_valid;

    always_ff @(posedge clk) begin
        if (rst) begin
            buf_valid <= 1'b0;
        end else if (buf_valid) begin
            // Buffer full: drain when downstream ready
            if (ready_in) buf_valid <= 1'b0;
        end else begin
            // Buffer empty: catch if downstream stalls
            if (valid_in && !ready_in) begin
                buf_valid <= 1'b1;
                buf_data  <= data_in;
            end
        end
    end

endmodule

What this does: Catches one beat when backpressure arrives, then stalls upstream until the buffer drains. This version is intentionally conservative: once it captures one beat, it deasserts ready_out until that beat is fully drained. If the upstream violates “hold stable while stalled,” this skid will faithfully buffer garbage.

What this does NOT do: Break the forward timing path. If data_in has negative slack, data_out still has negative slack.

Throughput: If ready_in pulses every other cycle (1-0-1-0), throughput drops to 50%. If ready_in stays low, throughput is 0% until it returns. Upstream sees a bubble after draining: ready_out is low while the buffer is full, so you cannot accept a new beat in the same cycle you forward the buffered one.

Here’s the behavior under alternating ready_in:

         ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐
clk      │   │   │   │   │   │   │   │   │   │   │   │   │   │
      ───┘   └───┘   └───┘   └───┘   └───┘   └───┘   └───┘   └───
          _________________________________________________
valid_in /                                                 \_
          _______         _______         _______
ready_in /       \_______/       \_______/       \___________
          _______                 _______                 ___
ready_out/       \_______       _/       \_______       _/
                  ^               ^               ^
                  |               |               └── Bubble
                  |               └── Buffer drains
                  └── Buffer catches beat

data     < A  >< B  >< B  >< C  >< C  >< D  >< D  >< E  >
                      held        held        held

Beat B gets caught in the buffer, then delivered. During the drain cycle, ready_out is low (buffer full), so we can’t accept C yet. This repeats, giving 50% effective throughput.

Reset note: With synchronous reset, hold rst high for at least one clock so internal state is known. After that, ready_out is high because the buffer is empty. Reset flushes any buffered beat-if you need lossless reset, add reset fencing or replay above this block.

Two-Entry FIFO: Forward Timing and Backpressure#

To break the forward timing path and absorb stalls, use a 2-entry FIFO. This is boring. It is correct. That is the point.

When I say “break the forward timing path,” I mean removing the same-cycle combinational path from input payload signals (data_in, valid_in, sidebands) to output payload signals (data_out, valid_out, sidebands). The ready path is a separate concern.

This 2-entry FIFO is the core logic inside what vendors call a “fully registered slice.”

Warning: ready_out is still combinational-it depends on ready_in via pop. If you chain many of these, ready can still become your critical path. Vendor register slices solve this with internal staging. Never generate ready_in combinationally from ready_out-that creates a ready loop that can deadlock or oscillate. Same rule for valid: don’t build combinational loops involving valid and ready across blocks.

This implementation uses explicit flops (not array inference) for deterministic behavior:

module stream_fifo2 #(
    parameter WIDTH = 32
) (
    input  logic             clk,
    input  logic             rst,
    input  logic             valid_in,
    output logic             ready_out,
    input  logic [WIDTH-1:0] data_in,
    output logic             valid_out,
    input  logic             ready_in,
    output logic [WIDTH-1:0] data_out
);

    // Explicit flops, not array (deterministic inference)
    logic [WIDTH-1:0] q0, q1;
    logic [1:0]       count;

    assign valid_out = (count != 2'd0);
    assign data_out  = q0;  // Combinational read from storage flop

    // Pop-enables-push: can accept when not full OR when popping
    wire pop  = valid_out && ready_in;
    assign ready_out = (count != 2'd2) || pop;

    wire push = valid_in && ready_out;

    always_ff @(posedge clk) begin
        if (rst) begin
            count <= 2'd0;
        end else begin
            unique case ({push, pop})
                2'b10: begin  // Push only
                    if (count == 2'd0) q0 <= data_in;
                    else               q1 <= data_in;
                    count <= count + 1;
                end
                2'b01: begin  // Pop only
                    // Shift (q1 don't-care when count==1 because valid_out deasserts)
                    q0 <= q1;
                    count <= count - 1;
                end
                2'b11: begin  // Push and pop simultaneously
                    if (count == 2'd1) begin
                        q0 <= data_in;  // Replace head
                    end else begin  // count == 2
                        q0 <= q1;       // Shift
                        q1 <= data_in;  // Refill
                    end
                    // count unchanged
                end
                default: ;  // Neither
            endcase
        end
    end

endmodule

What this does:

Storage in explicit flops (q0, q1)-no RAM inference ambiguity
data_out = q0 comes from a storage flop, not a bypass mux
ready_out includes pop-enables-push logic: can accept a new beat in the same cycle we’re draining one
No drain bubbles when popping and pushing simultaneously

Latency: Minimum 1 cycle. A beat accepted on the input in cycle N can be accepted by the sink no earlier than cycle N+1. Both data and valid are effectively registered (data_out reads from q0, valid_out derives from the registered count).

Reset note: Same as bypass skid-with synchronous reset, hold rst high for at least one clock so count is known. After that, ready_out is high because the FIFO is empty. Reset flushes any buffered beats.

Cycle-by-cycle under alternating ready (valid_in always high, presenting A, B, C, D, E, F):

This FIFO is not fall-through: an input transfer loads q0 on the clock edge, and valid_out asserts in the following cycle.

         ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐   ┌───┐
clk      │   │   │   │   │   │   │   │   │   │   │   │   │   │
      ───┘   └───┘   └───┘   └───┘   └───┘   └───┘   └───┘   └───
              _________________________________________________
valid_in     /                                                 \

          _______         _______         _______         _______
ready_in /       \_______/       \_______/       \_______/
          _____________________________________________________
valid_out     /                                                 \
                  (invalid)
                      ^           ^           ^           ^
                      │           │           │           │
                      A xfer      B xfer      C xfer      D xfer

data_out ────< ? ><  A  ><  A  ><  B  ><  B  ><  C  ><  C  ><  D  >
                          held        held        held

count    ──< 0 ><  1  ><  2  ><  2  ><  2  ><  2  ><  2  ><  2  >
                              ↑           ↑           ↑
                              │           │           └── push+pop: count stays 2
                              │           └── push+pop: count stays 2
                              └── FIFO full, but pop enables push

Key insight: When count == 2 and ready_in goes high, pop is true, so ready_out = (count != 2) || pop = 0 || 1 = 1. We push and pop simultaneously, avoiding drain bubbles.

With ready_in alternating 1/0, sink throughput is 50% by definition-the sink only accepts on half the cycles. The FIFO’s advantage is that it keeps accepting from upstream on pop cycles even when full, avoiding extra bubbles beyond what the sink forces.

The advantages over bypass skid are:

Output comes from storage flops, not a combinational bypass mux
Two beats absorbed before stalling (vs one)
Cleaner timing characteristics (though not fully isolated without an output register)

Resource cost: Two data flops, count logic. Expect low hundreds of LUTs depending on width. Measure in your build.

Conceptually, fully-registered AXI register slices behave like a small FIFO. Use the vendor IP for production; use this code to understand what it does.

Verifying Your Pipeline: SVA Assertions#

Drop these into your testbench to catch protocol violations:

// Data must stay stable when valid && !ready
property data_stable_under_backpressure;
    @(posedge clk) disable iff (rst)
    (valid_out && !ready_in) |=> ($stable(data_out));
endproperty
assert property (data_stable_under_backpressure);

// Valid cannot drop without a transfer
property valid_until_accepted;
    @(posedge clk) disable iff (rst)
    (valid_out && !ready_in) |=> valid_out;
endproperty
assert property (valid_until_accepted);

For signals with sidebands (AXI-Stream’s tlast, tkeep, tuser), either assert each separately or pack them into a struct and assert stability on the struct. Replace valid_out with m_axis_tvalid, ready_in with m_axis_tready.

The Timing Connection#

Article 2 showed why a pipeline register can make timing worse. Here’s how that connects:

Logic-dominated path: A register slice’s output register breaks the combinational depth. This is what you want.

Route-dominated path: Adding a register slice doesn’t help if the problem is wire length. Use pblocks (Vivado) or Logic Lock regions (Quartus) to pull logic together.

Bypass skid tradeoff: Solves backpressure but adds a mux on the data path. If your path is already logic-dominated, this can hurt timing. Use a 2-entry FIFO instead-output reads from a storage flop rather than a bypass mux, giving cleaner timing characteristics.

Register slice tradeoff: Breaks timing paths but can worsen placement if it pulls logic apart across the die. Check placement before and after.

Common Pipelining Mistakes#

Mistake	What Happens	Fix
Pipeline `data` but not `valid`	Downstream captures stale data	Always register together
Pipeline `valid` but not `data`	Downstream captures garbage	Always register together
Register `ready` without storage	Data lost during backpressure	Use skid buffer or FIFO
Bypass skid on timing-critical path	Forward path still combinational	Use 2-entry FIFO
Update output unconditionally	Overwrites data during stall	Gate on `ready` ‖ `!valid`
Pipeline depth mismatch	Control and data desynchronize	Audit pipeline depths
Forget to pipeline sideband	`last`, `keep`, `user` out of sync	Include ALL signals

The Sideband Trap#

AXI-Stream has tvalid, tready, tdata-and also tlast, tkeep, tuser, tid, tdest. Every signal must pipeline together:

// Single-entry slice with sidebands
wire accept = m_axis_tready || !m_axis_tvalid;

always_ff @(posedge clk) begin
    if (rst) begin
        m_axis_tvalid <= 1'b0;
    end else if (accept) begin
        m_axis_tvalid <= s_axis_tvalid;
        m_axis_tdata  <= s_axis_tdata;
        m_axis_tlast  <= s_axis_tlast;
        m_axis_tkeep  <= s_axis_tkeep;
        m_axis_tuser  <= s_axis_tuser;
    end
end

assign s_axis_tready = accept;

This single-entry slice breaks forward timing. The ready path remains combinational. It’s protocol-correct under sustained backpressure, but provides no extra buffering. If ready is delayed anywhere in your pipeline, you need a 2-entry FIFO or the vendor IP.

Retiming: Let the Tool Move Registers#

Before adding manual pipeline stages, try retiming.

Vivado (synthesis option):

# Enable retiming for the synthesis run
set_property -name {STEPS.SYNTH_DESIGN.ARGS.MORE OPTIONS} \
    -value {-retiming} -objects [get_runs synth_1]

Quartus:

set_global_assignment -name ALLOW_REGISTER_RETIMING ON

Retiming moves existing registers across combinational logic to balance path delays. It can split a 4-level logic path into two 2-level paths by moving a downstream register backward.

Limitations:

Won’t move registers across module hierarchy
Won’t move registers with asynchronous resets
Won’t fix route-dominated paths
Confused by feedback loops: retiming algorithms struggle with ready signals because they feed backward. The tool often refuses to move registers across logic involving valid/ready loops.
Makes RTL-to-netlist debug harder (register names change)

Check the synthesis log to confirm retiming occurred. If it didn’t, you’re back to manual pipelining.

Latency vs. Throughput: Know What You’re Trading#

Pipelining doesn’t make logic faster. It lets you clock faster by reducing combinational depth per stage. The tradeoff is latency in cycles.

Conceptual example (ideal scaling):

Metric	Before	+1 stage	+3 stages
Combinational delay	8 ns	4 ns	2 ns
Max clock period	8 ns	4 ns	2 ns
Latency (cycles)	1	2	4
Throughput @ max clock	1x	2x	4x

In practice, register overhead and routing don’t scale this cleanly. Profile your design.

If your system has a latency constraint (feedback loop, control path, real-time deadline), you can’t just add stages. Know which problem you have before adding registers.

When NOT to Pipeline#

Feedback loops: If you pipeline threshold_reached in a credit counter, the loop reacts one cycle late. This can cause overflow. Don’t pipeline feedback paths-pipeline the fanout instead.

Fixed-latency protocols: PCIe completion timeout, DDR read latency, video blanking. Adding stages without adjusting the protocol breaks things at the system level.

AXI Register Slice IP#

For production designs, use the vendor IP.

Vivado IP Catalog: Search for “axis_register_slice”

REG_CONFIG settings (check current Product Guide-values change between versions):

0 = Bypass (no registers)
1 = Fully registered (2-entry storage, 1-2 cycle latency)
Other values for SLR crossing, lightweight modes

Warning: Lightweight modes may insert bubble cycles. If throughput matters, use fully registered mode and verify behavior.

Instantiation: Use the generated wrapper name from your IP catalog. Hardcoded version strings become stale.

Debug Checklist#

When pipelining breaks something:

Did you pipeline all signals together? Valid, data, and all sidebands must have matching latency
Did you gate updates on ready || !valid? Unconditional updates break the protocol
Did you handle backpressure? If ready is registered anywhere, you need storage
Did you add latency to a feedback path? Check credit counters, flow control, state machines
Did placement change? New registers can pull logic apart-compare placement reports

Quick Reference#

The fundamental rule: A transfer happens when valid && ready. Both must see the same data on that edge.

Protocol invariant: When valid && !ready, the source must hold valid and data stable.

Decision tree:

No stalls → Simple registered stage
Stalls, no forward timing issue → Bypass skid buffer
Stalls, forward timing issue, ready can be combinational → Single-entry slice
Stalls, forward timing issue, need full isolation → 2-entry FIFO or vendor IP

Forward path (valid + data): Gate updates on ready || !valid. This breaks timing.

Backward path (ready): Cannot register without storage. Use a skid buffer or FIFO.

Bypass skid buffer: Handles backpressure. Does NOT break forward timing (combinational pass-through). 50% throughput under alternating ready.

2-entry FIFO: Breaks forward timing path, absorbs stalls. Ready is still combinational. Absorbs two beats before stalling. Use this or the vendor IP.

The Protocol Isn’t Optional#

Article 2 taught you to read timing reports-to trace the math from requirement to arrival. You can close timing now.

But timing closure means nothing if the design doesn’t work.

A bypass skid buffer handles backpressure but doesn’t break the forward timing path. A naive register breaks timing but loses data. A 2-entry FIFO breaks forward timing and absorbs stalls-but ready is still combinational. Know which problem you’re solving.

The timing tool doesn’t know your protocol. It sees flip-flops and combinational logic. It doesn’t know that valid and data must move together, or that registering ready needs storage, or that your unconditional update just violated the handshake.

You know. That’s why you’re the engineer.

Timing Series#

Your FPGA Lives a Lifetime While You Blink - Why timing satisfies or breaks
Constraints: The Contract You Forgot to Sign - How to write constraints
Understanding Timing Analysis - How to read timing reports
Pipelining Without Breaking Your Protocol - How to fix violations (you are here)
Silicon Real Estate: Your Resource Budget - How to manage resources
CDC: Two Flip-Flops Are Not Magic - How to cross clock domains
Resets: The Timing Event You Forgot - How to handle resets