2 of 10

Case Study: Polygon zkEVM

Polygon zkEVM is a STARK-based L2 zkEVM using Goldilocks and the Poseidon hash
The major computations are

Merkle tree construction (hashing)
Low-degree extension (NTT)
Polynomial composition (ext field muls)
FRI batching (ext field muls)
FRI
Witness generation (???)

Working memory is ~512 GiB RAM
Many committed values are less than full field elements

Starky keccak circuit uses �2,531 columns, 2,328 are 1-bit

3 of 10

Polynomial Commitments

We measure the time to commit a polynomial with 2²⁸ coefficients (in seconds).

FRI and Binius are run with Reed-Solomon rate ½. FRI is batched for efficiency.

All benchmarks are on an 13th Gen Intel(R) Core(TM) i9-13900H, single-threaded.

	1-bit elements	8-bit elements	32-bit elements	64-bit elements
Hyrax BN254 (lasso)	85.02	118.6	286.9	605.5
FRI GL64, Poseidon (plonky2)	98.718	98.718	98.817	98.817
FRI GL64, Keccak (plonky2)	29.36	29.36	29.36	29.36
Binius	0.5860	5.173	29.36	58.62

4 of 10

CPU Benchmarks: Hashing Throughput

Hash	Throughput (MiB / s)
SHA-256	1,722
Poseidon Goldilocks 64b (plonky2)	53.14
Poseidon2 BabyBear 31b (risc0)	32.57
Grøstl (Binary Tower 8b) (binius)	433.6

5 of 10

CPU Benchmarks: NTT

NTT times are normalized a 64-bit field size (ie. normalized for the bytes of data encoded).

	2¹⁶	2²⁰	2²⁴
Goldilocks (plonky2)	1.082	18.88	366.7
Baby Bear (risc0)	3.476	68.19	1,297
Binary Tower (binius)	2.453	39.01	747.9

6 of 10

CPU Benchmarks: Multiplications

We measure the time to multiply cryptographically large extension field elements, at least 128 bits.

This is the computational bottleneck of quotient polynomial computation and multivariate sumcheck.

7 of 10

CPU Benchmarks: Multiplications

* For binius we show the timing with an pre-release optimization, called mixed monomial-tower basis.

Field	Time / mul (ns)
Goldilocks^2 (plonky2)	3.5231
BabyBear^4 (risc0)	55.194
Binary Tower 128b (binius*)	31.308* / 71.012
Binary POLYVAL (binius / Rust-Crypto)	4.8002
BN254 Scalar Field (ark-bn254)	58.746

8 of 10

FPGA Multiplication Operations

	Area (Logic Gates)	Delay (ns)	Clock (MHz)	Area x Delay
Mersenne 31b	12,306	6.863	146	84,460
Binary Tower 32b	4,894	3.289	304	16,096
Binary Tower 16b	1,582	1.993	502	3,152

Our experiments measure logic utilization (silicon area) on Xilinx Ultrascale+ device family, which is implemented in 16nm TSMC process node, and associated propagation delays, which are the key indicators for achieving maximum operating frequency.

A key fact is that binary multiplication is more efficient and has lower propagation delay because addition and multiplication are carryless.

9 of 10

FPGA NTT Operations

Gate count is the estimated gate equivalent of FPGA resources
NTT / s is measured at a data input of 512 bits / cycle at 250 MHz
We define HW efficiency os the ratio of NTT / s to Gate Count

	Gate Count (M)	NTT/s	Efficiency
Goldilocks 64b	12.16	61,035	5,019
BabyBear 31b	13.66	122,070	8,936
Mersenne^2 64b	10.85	61,035	5,625
Binary Tower 16b	1.65	244,141	147,964
Binary Tower 32b	3.54	122,070	34,483

10 of 10

FPGA Hash Operations

We measure the rate of the sponge function for Poseidon, or the size of the compression inputs for Grøstl.

We define Efficiency as the ratio of the rate with the gate count.

	Gate Count (M)	Rate (bits)	Efficiency
Poseidon	66.14	512	7.741
Poseidon2	42.28	512	12.11
Grøstl	1.76	256	145.5

1 of 10