Incremental processing with Watchman
Don't daemonize that build!
About
Talk
Part 1:
Warranting Watchman
Build tasks outside the build system
File syncing | Many companies have built solutions to sync local source code to a remote machine Example: Stripe's monorepo development environment (Elhage 2024) |
Dynamic dependencies | |
IDE services | Latency-sensitive; need to process changes to source code on order of <100ms Example: Language Server Protocol exposes facility to subscribe to workspace file changes |
Example repo: nixpkgs
Nixpkgs: repository traversal timing
Remote development sync
Problem: Need to efficiently sync files to remote machine during development
Solution: Can use rsync to sync local source code to remote
Nixpkgs: no-op rsync time
# Sync the entire repository to second directory on local machine.
# rsync will traverse and hash the entire directory contents.
# A real use-case would involve network time.
$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced'
Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced
Time (mean ± σ): 1.438 s ± 0.075 s [User: 0.476 s, System: 1.448 s]
Range (min … max): 1.322 s … 1.521 s 10 runs
rsync woes
Problem: rsync always walks + hashes all files
⚠️ Solution: Maintain persistent index/cache, like git
❌ Solution: Run background process ("daemon"), subscribe to inotify (etc.)
What is daemonization?
Why daemonize?
Why not daemonize?
Don't daemonize that build!
Watchman
Warming up with Watchman
Start watching
$ watchman watch-project .
{
"version": "20240926.093054.0",
"watcher": "fsevents",
"watch": "/Users/waleed/Workspace/nixpkgs"
}
Data
$ watchman --json <<<'["query", ".", {}]' | jq '.files[0]'
{
"size": 480,
"new": true,
"exists": true,
"mode": 16877,
"name": ".git"
}
Metadata
$ watchman --json <<<'["query", ".", {"expression": "false"}]'
{
"version": "20240926.093054.0",
"files": [],
"clock": "c:1727393614:39384:2:7984",
"is_fresh_instance": true,
"debug": {
"cookie_files": [
"/Users/waleed/Workspace/nixpkgs/.git/.watchman-cookie-waleedkhan.local-39384-342"
]
}
}
Clock IDs
# get clock ID
$ watchman --json <<<'["query", ".", {}]' | jq '.clock'
"c:1727486318:13768:1:268"
# no-op
$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \
| jq '{ clock, files: .files[:1] }'
{
"clock": "c:1727486318:13768:1:271",
"files": []
}
Modified files
# create (or update) a file
$ touch foo�
$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \
| jq '{ clock, files: .files[:1] }'
{
"clock": "c:1727486318:13768:1:284",
"files": [
{
"size": 0,
"new": true,
"exists": true,
"mode": 33188,
"name": "foo"
}
]
}
Deleted files
# delete a file
$ rm foo
$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:284"}]' \
| jq '{ clock, files: .files[:1] }'
{
"clock": "c:1727486318:13768:1:292",
"files": [
{
"size": 0,
"new": false,
"exists": false,
"mode": 33188,
"name": "foo"
}
]
}
Complex queries
$ watchman --json <<<'["query", ".", {
"fields": ["name", "size", "mode", "content.sha1hex"],
"expression": [
"allof",
["type", "f"],
["not", ["dirname", ".git"]]
]
}]' | jq '.files[0]'
{
"content.sha1hex": "f4466aeb9bf2306565967c66a8f070821969755c",
"mode": 33188,
"size": 110,
"name": "pkgs/development/tools/build-managers/gradle/tests/java-application/src/main/java/Main.java"
}
Working with Watchman
Incremental rsync with Watchman
#!/bin/bash
set -euo pipefail
watchman >/dev/null watch-project .
WATCHMAN_CLOCK=$(cat .watchman-clock 2>/dev/null || echo 'c:0:0')
WATCHMAN_QUERY=$(jq -n --arg clock "$WATCHMAN_CLOCK" '["query", ".", {"fields": ["name"], "since": $clock}]')
WATCHMAN_RESULT=$(watchman <<<"$WATCHMAN_QUERY" --json)
if jq <<<"$WATCHMAN_RESULT" >/dev/null -e '.is_fresh_instance'; then
echo 'Full sync'
RSYNC_ARGS=(--delete-after)
else
jq <<<"$WATCHMAN_RESULT" >.watchman-files -r '.files[]'
printf 'Incrementally syncing %d files\n' "$(wc <.watchman-files -l)"
RSYNC_ARGS=(--files-from='.watchman-files' --delete-missing-args)
fi
rsync --archive --compress "${RSYNC_ARGS[@]}" ./ ../nixpkgs-synced
jq <<<"$WATCHMAN_RESULT" >.watchman-clock -r '.clock'
Setup
Load clock
Query Watchman
Full sync
Incremental sync
Execute
Commit clock
Incremental rsync with Watchman
$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced' './sync.sh'
Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced
Time (mean ± σ): 1.443 s ± 0.046 s [User: 0.435 s, System: 1.458 s]
Range (min … max): 1.393 s … 1.522 s 10 runs
Benchmark 2: ./sync.sh
Time (mean ± σ): 317.2 ms ± 17.1 ms [User: 209.8 ms, System: 35.9 ms]
Range (min … max): 298.5 ms … 350.8 ms 10 runs
Summary
'./sync.sh' ran
4.55 ± 0.29 times faster than 'rsync --archive --compress ./ ../nixpkgs-synced'
Why Watchman?
Watchman advantages — reliability
[Stripe] also invested heavily in reliability and self-healing on errors and network problems. One “internal” but significant improvement involved migrating to watchman for file-watching: In our testing, it was by far the most robust file-watcher we found, vastly reducing the frequency of missed updates or a “stuck” file-watcher.
�— Nelson Elhage, Stripe's monorepo developer environment (2024)
“
”
Watchman advantages — architectural
Wondering about Watchman?
Watchman modes
Main modes of operation:
Watchman data structure
Part 2:
Watchman Wisdom
Warnings about Watchman
[1/3] Handling is_fresh_instance
[2/3] Changing the query
[3/3] Eventual consistency
Wielding Watchman well
[1/5] Unifying full + incremental processing
watchman_result = query_watchman()
if watchman_result["is_fresh_instance"]:
paths_to_visit = None
else:
paths_to_visit = set()
for entry in watchman_result["files"]:
path = Path(entry["name"])
paths_to_visit.add(path)
paths_to_visit.update(path.parents)
def should_visit(path: Path) -> bool:
return paths_to_visit is None or path in paths_to_visit
[1/5] Unifying full + incremental processing
def visit(path: Path) -> None:
if not should_visit(path):
return
if path.is_file():
print(f"Visiting file: {path}")
for child_path in path.iterdir():
visit(child_path)
visit(Path("."))
[2/5] Persisting state
[3/5] Invalidating derived data
[4/5] Caching persistent state
[5/5] Distributing persistent state
A
B
C
main
D
E
warm
⋮
source control
(changed files)
artifact / service
(precomputed data)
Watchman
(changed files)
HEAD
Thanks for watching!
Takeaways:
— Waleed Khan <me@waleedkhan.name>
Deleted slides
Who is this talk for?
Crawls a large repository
Executes "build" tasks outside of the build system
Performs fine-grained incremental processing
Watchman advantages — functionality
These two lines establish a watch on a source directory and then set up a trigger named buildme that will run a tool named minify-css whenever a CSS file is changed. The tool will be passed a list of the changed filenames.�
$ watchman watch ~/src
# the single quotes around '*.css' are important!
$ watchman -- trigger ~/src buildme '*.css' -- minify-css
�The output for buildme will land in the Watchman log file unless you send it somewhere else.
Watchman advantages — functionality
Watchman advantages — reliability
Watchman advantages — efficiency
Watchman advantages — architectural
What is Watchman?
Watchman disadvantages
Watchman concepts
Watchman since