High performance regular expressions using RE2 and WebAssembly, no cgo required
Anuraag (ラグ) Agrawal
Self-introduction
Coraza
WAF
for rule := range rules {
if rule.Evaluate(tx) {
rule.ExecuteActions(tx)
}
}
Normal regex
regexp.MustCompile(`^[0-9]{3}-?[0-9]{4}$`)
WAF regex…
regexp.MustCompile(`_(?:\$\$ND_FUNC\$\$_|_js_function)|(?:\beval|new[\s\v]+Function[\s\v]*)\(|String\.fromCharCode|function\(\)\{|this\.constructor|module\.exports=|\([\s\v]*[^0-9A-Z_a-z]child_process[^0-9A-Z_a-z][\s\v]*\)|process(?:\.(?:(?:a(?:ccess|ppendfile|rgv|vailability)|c(?:aveats|h(?:mod|own)|(?:los|opyfil)e|p|reate(?:read|write)stream)|ex(?:ec(?:file)?|ists)|f(?:ch(?:mod|own)|data(?:sync)?|s(?:tat|ync)|utimes)|inodes|l(?:chmod|ink|stat|utimes)|mkd(?:ir|temp)|open(?:dir)?|r(?:e(?:ad(?:dir|file|link|v)?|name)|m)|s(?:pawn(?:file)?|tat|ymlink)|truncate|u(?:n(?:link|watchfile)|times)|w(?:atchfile|rite(?:file|v)?))(?:sync)?(?:\.call)?\(|binding|constructor|env|global|main(?:Module)?|process|require)|\[[\"'`](?:(?:a(?:ccess|ppendfile|rgv|vailability)|c(?:aveats|h(?:mod|own)|(?:los|opyfil)e|p|reate(?:read|write)stream)|ex(?:ec(?:file)?|ists)|f(?:ch(?:mod|own)|data(?:sync)?|s(?:tat|ync)|utimes)|inodes|l(?:chmod|ink|stat|utimes)|mkd(?:ir|temp)|open(?:dir)?|r(?:e(?:ad(?:dir|file|link|v)?|name)|m)|s(?:pawn(?:file)?|tat|ymlink)|truncate|u(?:n(?:link|watchfile)|times)|w(?:atchfile|rite(?:file|v)?))(?:sync)?|binding|constructor|env|global|main(?:Module)?|process|require)[\"'`]\])|(?:binding|constructor|env|global|main(?:Module)?|process|require)\[|console(?:\.(?:debug|error|info|trace|warn)(?:\.call)?\(|\[[\"'`](?:debug|error|info|trace|warn)[\"'`]\])|require(?:\.(?:resolve(?:\.call)?\(|main|extensions|cache)|\[[\"'`](?:(?:resolv|cach)e|main|extensions)[\"'`]\])`)
WAF
for rule := range rules {
if rule.MatchVeryHugeRegexp(tx) {
rule.ExecuteActions(tx)
}
}
Go regexp is really slow
https://github.com/golang/go/issues/26623
https://github.com/golang/go/issues/11646
Lacks DFA implementation of other regexp libraries
Unicode implementation makes it hard to optimize in a compatible way
C++ re2 is fast
But cgo is not Go
Notably, blocks library from being used in popular projects like k8s, otel
go-re2
WebAssembly
Compile existing binary for loading in web (not as fast as native but maybe enough)
Allow extending apps in a safe way (or only way, e.g. Go)
wazero
Can we compile re2 to Wasm?
Yes!
LLVM is a C++ compiler that can target WebAssembly
wasi-sdk provides libc, libc++ for –target=wasm32-wasi
Compiling re2 to Wasm
git clone https://github.com/google/re2 --branch 2023-03-01
docker run -it --rm -v `pwd`:/workspace --workdir /workspace ghcr.io/webassembly/wasi-sdk:wasi-sdk-20 sh -c 'CXXFLAGS="-DRE2_NO_THREADS $CXXFLAGS" make obj/libre2.a'
Compiling re2 to callable Wasm
RUN mkdir -p /re2 && curl -L https://github.com/google/re2/archive/954656f47fe8fb505d4818da1e128417a79ea500.tar.gz | tar -xz --strip-components 1 -C /re2
WORKDIR /re2
ENV CXXFLAGS -fno-exceptions -O3 ${CXXFLAGS}
ENV RE2_CXXFLAGS -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -I. -DRE2_NO_THREADS
RUN make obj/libre2.a
WORKDIR /cre2
ADD internal/cre2/cre2.cpp /cre2
ADD internal/cre2/cre2.h /cre2
# Just one source file so not worth running make
RUN $CXX -c cre2.cpp -o cre2.o -I. -I/re2 $CXXFLAGS && \
$AR cru libcre2.a cre2.o && \
$RANLIB libcre2.a
# Separate step so exports can be updated without recompiling.
# Number of layers isn't really a concern for this image.
# global-base=1024 same as emcc and allows further wasm-opt optimizations
RUN $CXX -o libcre2-noopt.so -Wl,--global-base=1024 -mexec-model=reactor --rtlib=compiler-rt --target=wasm32-wasi -shared \
-nostdlib /wasi-sysroot/lib/wasm32-wasi/crt1-reactor.o \
/usr/lib/llvm-$LLVM_VERSION/lib/clang/$LLVM_VERSION/lib/wasi/libclang_rt.builtins-wasm32.a \
/re2/obj/libre2.a \
/cre2/libcre2.a \
-L/wasi-sysroot/lib/wasm32-wasi -lc++ -lc++abi -lc \
--sysroot=/wasi-sysroot -Wl,--demangle -Wl,--allow-undefined \
-Wl,--export=malloc \
-Wl,--export=free \
-Wl,--export=cre2_new \
-Wl,--export=cre2_delete \
-Wl,--export=cre2_opt_new \
-Wl,--export=cre2_opt_delete
…
Compiling regex with wazero
//go:embed wasm/libcre2.so
var libre2 []byte
var (
wasmRT wazero.Runtime
wasmCompiled wazero.CompiledModule
)
func init() {
ctx := context.Background()
rt := wazero.NewRuntime(ctx)
wasi_snapshot_preview1.MustInstantiate(ctx, rt)
code, err := rt.CompileModule(ctx, libre2)
if err != nil {
panic(err)
}
wasmCompiled = code
wasmRT = rt
}
Compiling regex with wazero
func newABI() *libre2ABI {
ctx := context.Background()
mod, err := wasmRT.InstantiateModule(ctx, wasmCompiled, wazero.NewModuleConfig().WithName(""))
if err != nil {
panic(err)
}
callStack := make([]uint64, 8) // Needs to be sized to the method with most parameters, which is cre2_match
abi := &libre2ABI{
cre2New: newLazyFunction(mod, "cre2_new", callStack),
cre2Delete: newLazyFunction(mod, "cre2_delete", callStack),
cre2Match: newLazyFunction(mod, "cre2_match", callStack),
cre2NumCapturingGroups: newLazyFunction(mod, "cre2_num_capturing_groups", callStack),
cre2ErrorCode: newLazyFunction(mod, "cre2_error_code", callStack),
cre2ErrorArg: newLazyFunction(mod, "cre2_error_arg", callStack),
cre2NamedGroupsIterNew: newLazyFunction(mod, "cre2_named_groups_iter_new", callStack),
cre2NamedGroupsIterNext: newLazyFunction(mod, "cre2_named_groups_iter_next", callStack),
cre2NamedGroupsIterDelete: newLazyFunction(mod, "cre2_named_groups_iter_delete", callStack),
cre2GlobalReplace: newLazyFunction(mod, "cre2_global_replace_re", callStack),
cre2OptNew: newLazyFunction(mod, "cre2_opt_new", callStack),
cre2OptDelete: newLazyFunction(mod, "cre2_opt_delete", callStack),
cre2OptSetLongestMatch: newLazyFunction(mod, "cre2_opt_set_longest_match", callStack),
cre2OptSetPosixSyntax: newLazyFunction(mod, "cre2_opt_set_posix_syntax", callStack),
cre2OptSetCaseSensitive: newLazyFunction(mod, "cre2_opt_set_case_sensitive", callStack),
cre2OptSetLatin1Encoding: newLazyFunction(mod, "cre2_opt_set_latin1_encoding", callStack),
malloc: mod.ExportedFunction("malloc"),
free: mod.ExportedFunction("free"),
wasmMemory: mod.Memory(),
mod: mod,
callStack: callStack,
}
return abi
}
Compiling regex with wazero
func newRE(abi *libre2ABI, pattern cString, opts CompileOptions) uintptr {
ctx := context.Background()
optPtr := uintptr(0)
if opts != (CompileOptions{}) {
res, err := abi.cre2OptNew.Call0(ctx)
if err != nil {
panic(err)
}
optPtr = uintptr(res)
defer func() {
if _, err := abi.cre2OptDelete.Call1(ctx, uint64(optPtr)); err != nil {
panic(err)
}
}()
if opts.Longest {
_, err = abi.cre2OptSetLongestMatch.Call2(ctx, uint64(optPtr), 1)
if err != nil {
panic(err)
}
}
}
res, err := abi.cre2New.Call3(ctx, uint64(pattern.ptr), uint64(pattern.length), uint64(optPtr))
if err != nil {
panic(err)
}
return uintptr(res)
}
Compiling regex with wazero
https://github.com/wasilibs/go-re2/blob/main/internal/re2_wazero.go#L24
When starting out, look at wazero’s examples!
type cString struct {
ptr uintptr
length int
}
func newCString(abi *libre2ABI, s string) cString {
ptr := abi.memory.writeString(abi, s)
return cString{
ptr: ptr,
length: len(s),
}
}
Real WAF logic
name \ time/op build/wafbench_stdlib.txt build/wafbench.txt build/wafbench_cgo.txt
WAF/POST/1-2 3.359m ± ∞ ¹ 3.550m ± ∞ ¹ +5.70% (p=0.008 n=5) 3.563m ± ∞ ¹ +6.09% (p=0.008 n=5)
WAF/POST/1000-2 20.532m ± ∞ ¹ 6.194m ± ∞ ¹ -69.83% (p=0.008 n=5) 5.211m ± ∞ ¹ -74.62% (p=0.008 n=5)
WAF/POST/10000-2 187.29m ± ∞ ¹ 25.94m ± ∞ ¹ -86.15% (p=0.008 n=5) 17.69m ± ∞ ¹ -90.56% (p=0.008 n=5)
WAF/POST/100000-2 1852.4m ± ∞ ¹ 220.2m ± ∞ ¹ -88.11% (p=0.008 n=5) 143.8m ± ∞ ¹ -92.23% (p=0.008 n=5)
Lack of threads support
RE2 Memory
“unexpected )”
RE2 Memory
“unexpected )”
RE2 Memory
“unexpected )”
RE2 Memory
“unexpected )”
MustCompile(“.*”)
MustCompile(“.*”)
MustCompile(“.*”)
MustCompile(“.*”)
Tradeoffs
WebAssembly threads proposal at phase 3
In the meantime, go-re2 supports cgo too with build tag
Challenges
docker run -it --rm -v `pwd`:/workspace --workdir /workspace ghcr.io/webassembly/wasi-sdk:wasi-sdk-20 sh -c 'CXXFLAGS="-DRE2_NO_THREADS $CXXFLAGS" make'
wasm-ld-16: error: unknown argument: -soname
-pthread.h not found
etc,etc
Before wrapping C files with cgo, consider compiling to Wasm and using wazero
import regexp “github.com/wasilibs/go-re2”