1 of 10

Binary Tower Efficiency

Ulvetanna

2 of 10

Case Study: Polygon zkEVM

  • Polygon zkEVM is a STARK-based L2 zkEVM using Goldilocks and the Poseidon hash
  • The major computations are
    • Merkle tree construction (hashing)
    • Low-degree extension (NTT)
    • Polynomial composition (ext field muls)
    • FRI batching (ext field muls)
    • FRI
    • Witness generation (???)
  • Working memory is ~512 GiB RAM
  • Many committed values are less than full field elements
    • Starky keccak circuit uses �2,531 columns, 2,328 are 1-bit

3 of 10

Polynomial Commitments

We measure the time to commit a polynomial with 228 coefficients (in seconds).

FRI and Binius are run with Reed-Solomon rate ½. FRI is batched for efficiency.

All benchmarks are on an 13th Gen Intel(R) Core(TM) i9-13900H, single-threaded.

1-bit elements

8-bit elements

32-bit elements

64-bit elements

Hyrax BN254 (lasso)

85.02

118.6

286.9

605.5

FRI GL64, Poseidon (plonky2)

98.718

98.718

98.817

98.817

FRI GL64, Keccak (plonky2)

29.36

29.36

29.36

29.36

Binius

0.5860

5.173

29.36

58.62

4 of 10

CPU Benchmarks: Hashing Throughput

Hash

Throughput (MiB / s)

SHA-256

1,722

Poseidon Goldilocks 64b (plonky2)

53.14

Poseidon2 BabyBear 31b (risc0)

32.57

Grøstl (Binary Tower 8b) (binius)

433.6

5 of 10

CPU Benchmarks: NTT

NTT times are normalized a 64-bit field size (ie. normalized for the bytes of data encoded).

216

220

224

Goldilocks (plonky2)

1.082

18.88

366.7

Baby Bear (risc0)

3.476

68.19

1,297

Binary Tower (binius)

2.453

39.01

747.9

6 of 10

CPU Benchmarks: Multiplications

We measure the time to multiply cryptographically large extension field elements, at least 128 bits.

This is the computational bottleneck of quotient polynomial computation and multivariate sumcheck.

7 of 10

CPU Benchmarks: Multiplications

* For binius we show the timing with an pre-release optimization, called mixed monomial-tower basis.

Field

Time / mul (ns)

Goldilocks^2 (plonky2)

3.5231

BabyBear^4 (risc0)

55.194

Binary Tower 128b (binius*)

31.308* / 71.012

Binary POLYVAL (binius / Rust-Crypto)

4.8002

BN254 Scalar Field (ark-bn254)

58.746

8 of 10

FPGA Multiplication Operations

Area (Logic Gates)

Delay (ns)

Clock (MHz)

Area x Delay

Mersenne 31b

12,306

6.863

146

84,460

Binary Tower 32b

4,894

3.289

304

16,096

Binary Tower 16b

1,582

1.993

502

3,152

Our experiments measure logic utilization (silicon area) on Xilinx Ultrascale+ device family, which is implemented in 16nm TSMC process node, and associated propagation delays, which are the key indicators for achieving maximum operating frequency.

A key fact is that binary multiplication is more efficient and has lower propagation delay because addition and multiplication are carryless.

9 of 10

FPGA NTT Operations

  • Gate count is the estimated gate equivalent of FPGA resources
  • NTT / s is measured at a data input of 512 bits / cycle at 250 MHz
  • We define HW efficiency os the ratio of NTT / s to Gate Count

Gate Count (M)

NTT/s

Efficiency

Goldilocks 64b

12.16

61,035

5,019

BabyBear 31b

13.66

122,070

8,936

Mersenne^2 64b

10.85

61,035

5,625

Binary Tower 16b

1.65

244,141

147,964

Binary Tower 32b

3.54

122,070

34,483

10 of 10

FPGA Hash Operations

We measure the rate of the sponge function for Poseidon, or the size of the compression inputs for Grøstl.

We define Efficiency as the ratio of the rate with the gate count.

Gate Count (M)

Rate (bits)

Efficiency

Poseidon

66.14

512

7.741

Poseidon2

42.28

512

12.11

Grøstl

1.76

256

145.5