1 of 18

Turbine deep dive + Alpenglow overview

P2P Networking Collaborators

Raúl Kripalani

P2P networking, Ethereum Foundation

2 of 18

Turbine Deep Dive

Goals and priors

High-level concepts

TVU stages

Tree-based topology

Routing rules

Transport and connectivity

FEC

Gulf Stream

CRDS

Takeways and Ideas

3 of 18

Turbine goals and priors

Context

Chain timing

  • Slot times: ~400ms (vs. Ethereum’s 12s)
  • Epoch duration: 432k slots = ~2 days.

Consensus

  • PoS + Tower BFT (pBFT family).
  • Leader schedule revealed at the start of an epoch for the current and next epochs (significant “heads up”).
  • Validator votes for a fork on chain (vote tx), vote is locked for an exponentially increasing “timeout”.
  • Finality achieved via 3f+1 supermajority of stake.
  • Output state (“bank”) is recorded as a hash on the PoH stream.

Tx ordering

  • PoH: “a decentralized clock”. Stream of high-frequency VDF runs representing the passage of time. (Not a vector clock; no notion of causality).
  • Transactions are interleaved into the PoH stream when placed in a block.

Mempool

  • There is no mempool in Solana. Txs are forwarded to expected leaders via Gulf Stream (QUIC based protocol).

Problem framing. How to quickly disseminate large amounts of block data to thousands of validators globally with minimal latency.

4 of 18

High-level concepts

  • Blocks are groupings of Entries that a leader includes in a particular Slot.
  • Blocks are broken into smaller MTU-sized pieces called Shreds.
  • Shreds are erasure-coded with Reed-Solomon, merkleized, and and rapidly distributed to validators via Turbine.
  • Validators are continuously receiving shreds and attempting to reconstruct the original Vec<Entry>.
  • Each Entry includes PoH metadata and the sequence of transactions included.
  • Every block can include DEFAULT_TICKS_PER_SLOT=64 Entries.
  • While Entries are necessarily serialized due to PoH chaining, this design allows for multi-threaded staging and packing.

5 of 18

TVU stages

Anza’s Agave validator pipelines heavily revolve around the notion of stages and phases.�TVU (Transaction Validation Unit) stages:

1. Shred Fetch Stage

((TURBINE))

Receives shreds from the network via Turbine. Listens for incoming shreds propagated by other validators (parents in the Turbine tree or directly from the leader). Buffers these shreds for processing.

2. SigVerify Shreds Stage

Verifies shred signatures and prepares them for further processing. Deduplicates incoming shreds. Verifies the original leader's signature on each shred. Verifies the signature of the immediate retransmitter node (if applicable). Re-signs shreds that the validator will retransmit. Forwards verified shreds to the Retransmit Stage and Window Stage.

3. Retransmit Stage

((TURBINE))

Propagates shreds further down the Turbine tree. Deduplicates shreds again with a more nuanced filter. Determines the next set of child validators using ClusterNodes and AddrCache (stake-weighted, deterministic tree logic). Forwards the verified and re-signed shreds to these children (UDP/XDP/QUIC).

4. Window Stage (Block Reconstruction & Replay)

Reconstructs blocks, verifies PoH, and replays transactions. Inserts verified shreds into the local Blockstore. Initiates repair requests for missing shreds from peers or reconstructs using erasure codes. Reconstructs PoH entries from complete sets of shreds. Verifies the Proof of History sequence. Replays the transactions from the verified entries against its local bank state to ensure the leader's execution was correct.

5. Voting Stage

Participates in consensus by voting on validated forks. Once a block (or a fork derived from it) is validated through the Replay phase, the validator casts a vote. Votes are part of Solana's Tower BFT consensus mechanism, contributing to finality. Votes are also transactions and are gossiped to the network.

6 of 18

Tree-based deterministic topology

Permissioned: participation in Turbine is restricted to active validators only.

Every shred S that the leader L intends to transmit at slot N receives its own propagation tree, with a fanout of 200 (DATA_PLANE_FANOUT).

The shred’s topology is deterministically calculated by performing a stake-weighted shuffle, seeded with shred ID (slot, index, type) and the leader’s pubkey as pseudo-randomness.

Every validator derives the same routing table for the shred in question.

7 of 18

Routing rules

Propagation:

  • For each S, the leader only sends to ONE other validator (the Root, or L0, for that S).
  • L0 then retransmits to DATA_PLANE_FANOUT validators, based on its own position in the vector.
  • Same for L1, L2, etc. (I assume the routing table is a vector that gets partitioned in layers).
  • Due to the large fanout and the limited number of Solana validators, every validator can be reached in 2 hops.

Processes:

  • Validation: authenticity of every shred is validated via merkle proof and signatures.
  • Passive repair: validators constantly try to complete slots in their Blockstore by “deshredding”.
  • Active repair: if unable to repair, the Window stage of the TVU will request missing shreds from their expected parents via RPC.
  • No recoding happens in transit.

8 of 18

Transport and connectivity

Transports:

  • UDP (with XDP support on Linux)
  • QUIC (AFAIK, preferred)

Remarks:

  • Data frames are MTU-sized.
  • Note that every validator can be chosen to relay to any other 200 validators at any point in time.
  • And multiple times within a slot (~400ms).
  • So bandwidth and latency requirements for validators are extremely intense (as are compute and storage).

Some insightful data from the Alpenglow paper:

With a bandwidth of 1Gb/s, transmitting n = 1,500 shreds takes 18 ms (well below the average network delay of about 80 ms). To get to 80% of the total stake we need to reach n ≈ 150 nodes, which takes only about 2 ms. [suggests high stake concentration]

9 of 18

FEC

Distinction between data shreds and coding shreds at the protocol level.

  • Reed-Solomon.
  • 1:1 ratio of data shreds to coding shreds.
  • Data shreds are grouped into FEC sets.
  • Coding shreds are generated for each set independently.
  • The last FEC set in a slot might be smaller if the total number of data shreds isn’t a perfect multiple of the FEC set size.

10 of 18

Gulf Stream (tx forwarding)

Key properties:

  • Leader schedule awareness
    • Solana has a deterministic leader schedule, so every node can know the upcoming leader schedule.
  • Transaction ingress and forwarding
    • When a client sends a transaction, it normally sends it to an RPC node, which then forwards it to the current leader or an upcoming leader.
  • Proactive forwarding to future leaders
    • If the validator that received the transaction is not the current leader, it will forward the transaction directly to the validator(s) designated to be the leader(s) in the near future.
    • “Smart clients” can also route at submission time.
  • Leader pre-processing
    • Validators about to become leaders can begin preprocessing and staging transactions before their slot begins.
  • Anticipatory nature
    • Leaders-to-be can hit the ground running because he network helps them via Gulf Stream.

Protocol: QUIC streams.

11 of 18

CRDS (Cluster Replicated Data Store)

A Gossip mechanism to maintain an eventually-consistent shared view of members of the “Solana cluster” through timestamped, self-certified data entries called CrdsValues.

Key data propagated:

  • ContactInfo: validator IP addresses, ports, public keys.
  • Vote states: signed messages indicating a validator’s vote on specific forks/slots.
  • EpochSlots: information for the active repair mechanism (who has what shreds).
  • Version: validator software versions.
  • Other metadata like LowestSlot, snapshot hashes, etc.

CRDS seems to not be permissioned.

12 of 18

CRDS (Cluster Replicated Data Store)

Two mechanisms.

  • Push:
    • Validators periodically send new/updated CRDS entries from their table to a small, random set of peers (fanout=9 currently).
    • Prune messages are used to optimize push paths and reduce redundancy.
  • Pull:
    • Validators periodically request missing/newer data from a random peer using Bloom filters (to specify what they don't need).
    • Validators match against their view of the cluster and send the missing CRDS.

Data integrity and updates: All CrdsValue entries are signed by the originator. Wallclock timestamps are used to determine the latest version of an entry (ensuring eventual consistency).

13 of 18

Interesting takeaways and ideas

  • Heavy reliance on connectionless protocols to enable extreme fanouts and rapidly-rotating peering.
  • Heavy QUIC users (Turbine and Gulf Stream)
  • XDP to squeeze every single bit of raw throughput (when using UDP).
  • Vertically optimized. Shreds do not exceed MTU.
  • Deterministic, highly ephemeral tree routing.
  • Permissioning at the Turbine and Gulf Stream level.
  • The role of stake-weighting in routing.
  • Uses gossip to maintain cluster information (instead of a DHT).
  • Requires tons of bandwidth and good peering (Alpenglow model reveals some network stats).

14 of 18

Alpenglow

15 of 18

High-level notes

Top-level goals:

  • Achieve faster block finality (single round if >80% stake, two rounds if >60%).
  • Relaxes consensus security model to 5f+1 (only tolerates 20% faulty nodes).
  • Stake-proportional bandwidth utilization for block dissemination.

Two key changes at the networking layer:

  • Replace Turbine with Rotor.
  • CRDS is split into on-chain transactions for critical validator info (identity, net addresses, etc.) and direct broadcasting for votes. It appears that gossip is out the door.

16 of 18

Rotor

Replaces Turbine for block dissemination.

  • An optimized and simplified version of Turbine, still based on erasure coding and shreds.

Slices:

  • Introduces “slices”. The leader constructs blocks from transactions, and breaks it into slices before erasure-coding them into shreds.
  • Each slice has a Merkle root signed by the leader, and shreds contain Merkle paths for verification.
  • Assuming this pagination technique produces more efficient Merkle proofs (vs. 1500 node Merkle tree).

2-hop broadcast via relay layer:

  • For each slice, the leader sends each of its constituent shreds directly to a specifically sampled “shred relay” node.
  • “Partition Sampling” method is used to select these relays based on stake, aiming to improve resilience over previous methods.
  • Higher stake => higher bandwidth requirements to alleviate the leader bottleneck. Each node is expected to transmit data proportional to its stake.
  • Each shred relay is then responsible for broadcasting its assigned shred to all other nodes.
  • Relays send to the next leader first as an optimization, then to others in decreasing order of stake.

17 of 18

CRDS

Node information updates:

  • Critical validator information (public key, stake, IP address, port numbers, etc.) is announced and updated by including it in an on-chain transaction.
  • This ensures consistency, guarded by chain consensus.

Vote and certificate dissemination:

  • Instead of relying on traditional CRDS push/pull gossip for votes and certificates, Alpenglow specifies that nodes “broadcast” these messages.
  • This “broadcast” is described as the sender looping over all other nodes and sending the message sequentially, or potentially using a multicast primitive (also mentions DoubleZero).

18 of 18

Thank you!