1 of 34

a Hardware-Optimized SNARK

Ulvetanna | Jim Posen, CTO

2 of 34

Hardware-Software Co-design Examples

NAND Flash SSDs

Flash memory has a limited number of write-erase cycles, forcing storage firmware and operating systems to implement wear-leveling

TPUs for Machine Learning quantize floating point values to 8-bits because of the improved efficiency of 8-bit integer multiplication

Source: https://towardsdatascience.com/

Source: https://www.kingston.com/

3 of 34

Profiling Keccak-256

  • Keccak is the #1 bottleneck for ZK protocols scaling Ethereum
  • Scroll’s halo2 uses KZG elliptic curve commitment scheme (256-bit prime)
  • Plonky3 uses FRI commitment scheme (32-bit prime)
  • Benchmarks are on a GCP c3-highcpu-22 (22 vCPU, 11 core)

# Keccak-f

Proof Size (KiB)

Prove time (s)

Verify time (ms)

Scroll

3493 (464 KiB)

39.5

261

49.4

Plonky3

(Keccak)

4096 (544 KiB)

1,790

16.2

262

Plonky3 (Poseidon2)

4096 (544 KiB)

1,790

20.0

351

4 of 34

Profiling Keccak-256

Four key steps account for majority of computation in both systems

  1. Generating the witness
  2. Committing the witness
    1. Reed–Solomon encoding (NTT) + merkleization
    2. Elliptic curve multiscalar multiplication (MSM)
  3. Vanishing argument (lots of arithmetic circuit evaluations)
  4. Opening proof (crypto magic)

5 of 34

Profiling Keccak-256

Plonky3 w/ Poseidon2

Plonky3 / Keccak

6 of 34

Why are towers of binary fields the right foundation for fast verifiable computing?

7 of 34

Why are towers of binary fields the right foundation for fast verifiable computing?

  1. Efficient Arithmetic
  2. Efficient Arithmetization

8 of 34

  1. Efficient Arithmetic

9 of 34

Finite Field Arithmetic

Prime Fields 𝔽p

Representation is not canonical

Addition is integer addition with a conditional subtraction

Multiplication is integer multiplication with a reduction method

  • Barrett reduction
  • Montgomery reduction
  • Special reduction (Mersenne-31, Goldilocks-64)

Binary Fields 𝔽2k

Representation is canonical

Addition is bitwise XOR

Multiplication is carryless multiplication with a reduction method

  • Special (AES)
  • Montgomery reduction (POLYVAL)
  • Recursive reduction (Tower)

10 of 34

Prime vs Binary Field Multiplication

Carryless multiplication reduces propagation delay and enables higher clock speed

Source: “Exploring the Design Space of Prime Field vs. Binary Field ECC-Hardware Implementations”

11 of 34

Prime vs Binary Field Squaring

Squaring is much cheaper in binary fields because (X + Y)2 = X2 + Y2

Source: “Exploring the Design Space of Prime Field vs. Binary Field ECC-Hardware Implementations”

12 of 34

Binary Fields in Cryptography

Advanced Encryption Standard (AES) is based on 𝔽28

Galois Message Authentication Code (GMAC) is based on 𝔽2128

Binary field elliptic curves are well-studied and very efficient in hardware

QR codes use Reed–Solomon encoding over 𝔽28

Original FRI and zk-STARK protocols

Grøstl SHA-3 finalist hash function is based on 𝔽28

13 of 34

Binary Tower Fields

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

𝔽2128

14 of 34

Binary Tower Fields

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

𝔽2128

𝔽232 𝔽232 𝔽232 𝔽232

15 of 34

Binary Tower Fields

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

𝔽2128

𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28

𝔽232 𝔽232 𝔽232 𝔽232

16 of 34

Binary Tower Fields

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

0 0 1 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 ………… 0 1 0 1 0 0 0 0 1 1 0 0 1 0 1 0 0 0

128 x 𝔽2

𝔽2128

𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28 𝔽28

𝔽232 𝔽232 𝔽232 𝔽232

17 of 34

Binary Tower Fields

Multiplication

Subfield Mult

(m-bit subfield)

Squaring

Inversion

O(n1.58)

O(n m0.58)

O(n log n)

O(n1.58)

18 of 34

CPU Benchmarks: Multiplications

We measure the time to multiply cryptographically large extension field elements, at least 128 bits. Benchmarks run on GCP C3 instance, Intel Sapphire Rapids.

Field

Throughput (106 mul / s)

Goldilocks^2 (plonky2)

216.07

BabyBear^4 (plonky3)

52.87

Binary Tower 128b (binius)

69.87

Binary POLYVAL (binius / Rust-Crypto)

613.2

BN254 Scalar Field (ark-bn254)

47.36

19 of 34

FPGA Estimates: Multiplication

Our experiments measure logic utilization (silicon area) on Xilinx Ultrascale+ device family, which is implemented in 16nm TSMC process node, and associated propagation delays, which are the key indicators for achieving maximum operating frequency.

Area (Logic Gates)

Delay (ns)

Clock (MHz)

Area x Delay

Mersenne 31b

12,306

6.863

146

84,460

BN254 Scalar

2,538,066

3.753

266 (*49 cycles)

9,525,000

Binary Tower 16b

1,582

1.993

502

3,152

Binary Tower 32b

4,894

3.289

304

16,100

Binary Tower 128b

58,096

3.400

294

197,500

20 of 34

Vision Hash

Vision is an arithmetization-optimized hash function over binary fields

S-box uses both squaring and inversion

“Design of Symmetric-Key Primitives for Advanced Cryptographic Protocols”

by Aly, Ashur, Ben-Sasson, Dhooghe, and Szepieniec

21 of 34

Vision Mark-32

Vision Mark-32 is an optimized instance of Vision using the 32-bit tower field

  • S-box optimized with
    • fast recursive inversion in the Wiedemann tower field
    • arithmetic verification uses fast squaring
    • linearized polynomial evaluation as an 𝔽2-affine transformation
  • MDS matrix optimized with
    • 8-bit tower field coefficients and subfield multiplication
    • novel polynomial basis additive NTT techniques

Joint effort with 3MI Labs (Tomer Ashur & Mairon Mahzoun)

22 of 34

Efficient Arithmetization

23 of 34

Plonky3 Keccak Arithmetization

  • Plonky3 uses AIR arithmetization
  • Each Keccak-f permutation is 24 rows
  • One row has 2,328 1-bit columns and 404 16-bit columns
  • Trace has 1,048 non-linear constraints

24 of 34

Plonky3 Keccak Arithmetization

  • Plonky3 uses AIR arithmetization
  • Each Keccak-f permutation is 24 rows
  • One row has 2,328 1-bit columns and 404 16-bit columns
  • Trace has 1,048 non-linear constraints

In order to commit 1-bit or 16-bit columns, STARKs must pad them to full field elements. This is up to a 32x loss in efficiency!

25 of 34

Embedding Overhead

Can we commit small values faster than large values?

  • Using elliptic curve commitment schemes: yes, with caveats
  • With FRI commitments: not below 32 bits

26 of 34

Embedding Overhead

Can we commit small values faster than large values?

  • Using elliptic curve commitment schemes: yes, with caveats
  • With FRI commitments: not below 32 bits
  • With Binius: yes, using block-level encoding!

The Binius polynomial commitment scheme maximizes the information density of the committed data.

27 of 34

Binius: SNARKs over Towers of Binary Fields

Binius combines

  • Arithmetization over towers of binary fields
  • Multilinear poly IOP based on HyperPlonk
  • Generalized Brakedown + FRI commitment that eliminates embedding overhead

28 of 34

Binius: SNARKs over Towers of Binary Fields

Binius combines

  • Arithmetization over towers of binary fields
  • Multilinear poly IOP based on HyperPlonk
  • Generalized Brakedown + FRI commitment that eliminates embedding overhead

…also,

  • A new multilinear shift argument
  • Adaptation of Lasso lookup argument

29 of 34

Eliminating Embedding Overhead

  • Instead of embedding single coefficients into an extension field, we embed chunks of coefficients into the tower algebra.
  • Reed–Solomon encode with the tower algebra as the alphabet.
  • Generalize Brakedown and BaseFold’s multilinear FRI PCS to operate over the tower algebra.

30 of 34

Binius Keccak Arithmetization

  • Each Keccak-f permutation is 24 rows over 𝔽264 / 1,536 rows over 𝔽2
  • One row has 60 committed columns and 80 virtual columns
  • Trace has 50 non-linear constraints and 10 linear constraints

31 of 34

Binius Keccak Arithmetization

In[25]

C[5]

C’[5]

D[5]

B[25]

Out[25]

64b

64b

32 of 34

Profiling Keccak-256

Benchmarks run on GCP c3-highcpu-22 (22 vCPU, 11 core)

# Keccak-f

Proof Size (KiB)

Prove time (s)

Verify time (ms)

halo2

3493 (464 KiB)

39.5

261

49.41

Plonky3

(Keccak)

4096 (544 KiB)

1,790

16.2

262

Plonky3 (Poseidon2)

4096 (544 KiB)

1,790

20.0

351

Binius

4096 (544 KiB)

4,737

8.10

73.5

33 of 34

Profiling: Keccak-f[1600]

Binius w/ Grøstl

Plonky3 w/ Keccak

34 of 34

Stay tuned!