1 of 56

Incremental processing with Watchman

Don't daemonize that build!

2 of 56

About

  • Career in optimizing the developer feedback loop
    • Meta: Hack language + IDE services
    • Twitter: Scala tooling; source control
    • Hudson River Trading: C++/Python build system
  • Watchman is an open-source file watching service

  • Slides:

3 of 56

Talk

  • Part 1: Warranting Watchman
    • Justification
    • Problems + solutions
    • Abstractions
  • Part 2: Watchman Wisdom
    • Stand-alone pieces of advice

  • Slides:

4 of 56

Part 1:

Warranting Watchman

5 of 56

Build tasks outside the build system

File syncing

Many companies have built solutions to sync local source code to a remote machine

Example: Stripe's monorepo development environment (Elhage 2024)

Dynamic dependencies

Generating the build graph itself programmatically

Example: Gazelle can be used to generate BUILD files for Bazel

IDE services

Latency-sensitive; need to process changes to source code on order of <100ms

Example: Language Server Protocol exposes facility to subscribe to workspace file changes

6 of 56

Example repo: nixpkgs

  • Testing using https://github.com/NixOS/nixpkgs
    • 685k commits
    • 43k files in working copy
    • (Not that big by monorepo standards)
  • Device:
    • MacBook Pro (Retina, 15-inch, Mid 2015)
    • 2.5 GHz Quad-Core Intel Core i7
    • 16 GB 1600 MHz DDR3

7 of 56

Nixpkgs: repository traversal timing

# Benchmark with hyperfine: traversing all files using ripgrep

# (it has optimized parallel filesystem traversal code).

$ hyperfine --warmup=3 'rg --files'

Benchmark 1: rg --files

Time (mean ± σ): 752.9 ms ± 26.4 ms [User: 1295.8 ms, System: 3845.9 ms]

Range (min … max): 707.2 ms … 794.4 ms 10 runs

8 of 56

Remote development sync

Problem: Need to efficiently sync files to remote machine during development

Solution: Can use rsync to sync local source code to remote

9 of 56

Nixpkgs: no-op rsync time

# Sync the entire repository to second directory on local machine.

# rsync will traverse and hash the entire directory contents.

# A real use-case would involve network time.

$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced'

Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced

Time (mean ± σ): 1.438 s ± 0.075 s [User: 0.476 s, System: 1.448 s]

Range (min … max): 1.322 s … 1.521 s 10 runs

10 of 56

rsync woes

Problem: rsync always walks + hashes all files

⚠️ Solution: Maintain persistent index/cache, like git

❌ Solution: Run background process ("daemon"), subscribe to inotify (etc.)

11 of 56

What is daemonization?

  • See: https://en.wikipedia.org/wiki/Daemon_(computing)
  • Convert existing build task into background process
    • For efficiency
  • Oftentimes:
    • Build task split into front-end client and back-end server
    • Program starts daemon if not already running
    • Commands become RPCs to daemon

12 of 56

Why daemonize?

  • Keep persistent O(repo) state in memory
    • Skip loading/saving
    • Efficient queries and updates
  • Subscribe to OS filesystem notifications
    • Example: inotify (Linux)
    • Example: kqueue (BSD)
    • Example: FSEvents (macOS)
  • Reduce startup latency
    • Example: Nailgun (Java)
    • Example: CHg (Python)

13 of 56

Why not daemonize?

  • Huge incidental complexity around service management
    • Running exactly 0–1 instances
    • Process groups, signal handling, etc.
    • Forward/backward compatibility; upgrading running process version
    • RPC is fundamentally more complex than in-process function calls
  • Observability is difficult
    • Partially a tooling problem
    • Partially a design problem, when it's easy to persist arbitrary state
  • Long-lived processes make long-lived mistakes
    • Corrupted/erroneous persistent state
    • Deadlock
    • Resource leaks
    • "Runaway" processes

14 of 56

Don't daemonize that build!

  • ...at least until you have no other choice
  • Adds substantial incidental complexity

15 of 56

Watchman

  • Try Watchman, a file watching service
  • High-level primitive for custom incremental builds
    • Higher level of abstraction than filesystem APIs
    • Uses its own daemon, so that we don't have to 🫡
  • Can buy a lot of development runway before needing to daemonize

16 of 56

Warming up with Watchman

17 of 56

Start watching

$ watchman watch-project .

{

"version": "20240926.093054.0",

"watcher": "fsevents",

"watch": "/Users/waleed/Workspace/nixpkgs"

}

18 of 56

Data

$ watchman --json <<<'["query", ".", {}]' | jq '.files[0]'

{

"size": 480,

"new": true,

"exists": true,

"mode": 16877,

"name": ".git"

}

19 of 56

Metadata

$ watchman --json <<<'["query", ".", {"expression": "false"}]'

{

"version": "20240926.093054.0",

"files": [],

"clock": "c:1727393614:39384:2:7984",

"is_fresh_instance": true,

"debug": {

"cookie_files": [

"/Users/waleed/Workspace/nixpkgs/.git/.watchman-cookie-waleedkhan.local-39384-342"

]

}

}

20 of 56

Clock IDs

# get clock ID

$ watchman --json <<<'["query", ".", {}]' | jq '.clock'

"c:1727486318:13768:1:268"

# no-op

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:271",

"files": []

}

21 of 56

Modified files

# create (or update) a file

$ touch foo�

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:284",

"files": [

{

"size": 0,

"new": true,

"exists": true,

"mode": 33188,

"name": "foo"

}

]

}

22 of 56

Deleted files

# delete a file

$ rm foo

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:284"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:292",

"files": [

{

"size": 0,

"new": false,

"exists": false,

"mode": 33188,

"name": "foo"

}

]

}

23 of 56

Complex queries

$ watchman --json <<<'["query", ".", {

"fields": ["name", "size", "mode", "content.sha1hex"],

"expression": [

"allof",

["type", "f"],

["not", ["dirname", ".git"]]

]

}]' | jq '.files[0]'

{

"content.sha1hex": "f4466aeb9bf2306565967c66a8f070821969755c",

"mode": 33188,

"size": 110,

"name": "pkgs/development/tools/build-managers/gradle/tests/java-application/src/main/java/Main.java"

}

24 of 56

Working with Watchman

25 of 56

Incremental rsync with Watchman

#!/bin/bash

set -euo pipefail

watchman >/dev/null watch-project .

WATCHMAN_CLOCK=$(cat .watchman-clock 2>/dev/null || echo 'c:0:0')

WATCHMAN_QUERY=$(jq -n --arg clock "$WATCHMAN_CLOCK" '["query", ".", {"fields": ["name"], "since": $clock}]')

WATCHMAN_RESULT=$(watchman <<<"$WATCHMAN_QUERY" --json)

if jq <<<"$WATCHMAN_RESULT" >/dev/null -e '.is_fresh_instance'; then

echo 'Full sync'

RSYNC_ARGS=(--delete-after)

else

jq <<<"$WATCHMAN_RESULT" >.watchman-files -r '.files[]'

printf 'Incrementally syncing %d files\n' "$(wc <.watchman-files -l)"

RSYNC_ARGS=(--files-from='.watchman-files' --delete-missing-args)

fi

rsync --archive --compress "${RSYNC_ARGS[@]}" ./ ../nixpkgs-synced

jq <<<"$WATCHMAN_RESULT" >.watchman-clock -r '.clock'

Setup

Load clock

Query Watchman

Full sync

Incremental sync

Execute

Commit clock

26 of 56

Incremental rsync with Watchman

$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced' './sync.sh'

Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced

Time (mean ± σ): 1.443 s ± 0.046 s [User: 0.435 s, System: 1.458 s]

Range (min … max): 1.393 s … 1.522 s 10 runs

Benchmark 2: ./sync.sh

Time (mean ± σ): 317.2 ms ± 17.1 ms [User: 209.8 ms, System: 35.9 ms]

Range (min … max): 298.5 ms … 350.8 ms 10 runs

Summary

'./sync.sh' ran

4.55 ± 0.29 times faster than 'rsync --archive --compress ./ ../nixpkgs-synced'

27 of 56

Why Watchman?

28 of 56

Watchman advantages — reliability

[Stripe] also invested heavily in reliability and self-healing on errors and network problems. One “internal” but significant improvement involved migrating to watchman for file-watching: In our testing, it was by far the most robust file-watcher we found, vastly reducing the frequency of missed updates or a “stuck” file-watcher.

�— Nelson Elhage, Stripe's monorepo developer environment (2024)

29 of 56

Watchman advantages — architectural

  • Your filesystem is a distributed system
    • POSIX filesystems, at least
  • Watchman...
    • Supports efficient incremental queries
    • Encourages designing for eventual consistency
    • Effectively implements snapshotting via clock IDs
  • Abstraction levels:
    • ❌ Poking shared mutable state
    • ❌ Callbacks
    • Event sourcing (almost)

30 of 56

Wondering about Watchman?

31 of 56

Watchman modes

Main modes of operation:

  • since: get changes since last clock ID
    • Use when: you need to do incremental processing in response to user command
  • trigger: run a command when paths change
    • Use when: you need to do incremental processing preemptively, before the user has issued a command
    • At most 1x instance of running command
    • Configurable settlement period
    • Logs go to Watchman logfile
  • subscribe: actively wait for changes in long-running process
    • Use when: you're daemonizing your build and the above approaches don't work

32 of 56

Watchman data structure

  • Essentially an in-memory linked hash map from path to metadata
    • Metadata includes clock ID
    • Not persisted between restarts
    • Watchman manages crawling the filesystem on its own
  • Algorithms:
    • Update on filesystem notification:
      • Access and bump entry to beginning of list in O(1)
      • Read file metadata from disk as appropriate and update entry
    • Read changed paths since client-provided clock ID:
      • Iterate entries from front of list (by recency)
      • When reaching entry with older clock ID, terminate traversal

33 of 56

Part 2:

Watchman Wisdom

34 of 56

Warnings about Watchman

  1. Handling is_fresh_instance
  2. Changing the query
  3. Eventual consistency

35 of 56

[1/3] Handling is_fresh_instance

  • See: watchman: precise meaning and handling of "is_fresh_instance" (2018)
  • If provided clock ID not known, returns all paths
  • Result has "is_fresh_instance": true
    • Example: first build
    • Example: Watchman died and restarted (may have missed filesystem updates)
  • Remediation:
    • Clear persistent state
    • Check persistent state (be careful about stale entries for deleted files!)

36 of 56

[2/3] Changing the query

  • Your Watchman query is an input to your build!
  • When query changes, persistent state may be invalidated
    • Example: simple static query changes when build task is updated
    • Example: dynamic query generated by .gitignore files may not correctly handle changes to .gitignore files!
  • Solutions:
    • Use simple queries and filter later
    • Store old query alongside old clock; treat as is_fresh_instance when changed

37 of 56

[3/3] Eventual consistency

  • File on disk may change after Watchman result
  • Be prepared for...
    • Content changes
    • Metadata changes
    • Deletion
  • Don't mix Watchman + filesystem data
    • Example: don't combine Watchman mtime with filesystem contents

38 of 56

Wielding Watchman well

  1. Unifying full + incremental processing
  2. Persisting state
  3. Invalidating derived data
  4. Caching persistent state
  5. Distributing persistent state

39 of 56

[1/5] Unifying full + incremental processing

watchman_result = query_watchman()

if watchman_result["is_fresh_instance"]:

paths_to_visit = None

else:

paths_to_visit = set()

for entry in watchman_result["files"]:

path = Path(entry["name"])

paths_to_visit.add(path)

paths_to_visit.update(path.parents)

def should_visit(path: Path) -> bool:

return paths_to_visit is None or path in paths_to_visit

40 of 56

[1/5] Unifying full + incremental processing

def visit(path: Path) -> None:

if not should_visit(path):

return

if path.is_file():

print(f"Visiting file: {path}")

for child_path in path.iterdir():

visit(child_path)

visit(Path("."))

41 of 56

[2/5] Persisting state

  • Usually will store path → metadata
    • Example: {"foo.py": ["class Bar", "def baz"]}
    • Easy to invalidate
  • Storage ideas:
    • Big JSON blob
    • SQLite
    • RocksDB

42 of 56

[3/5] Invalidating derived data

  • Easy to invalidate when keys are paths
  • Harder to invalidate derived data
  • Reverse index: metadata → path
    • Example: {"class Bar": "foo.py", "def baz": "foo.py"}
  • Reverse index strategies:
    • Maintain bidirectional index
    • Allow stale values; check again before using
    • O(n) update on deletions

43 of 56

[4/5] Caching persistent state

  • Add data for validation: path → key + metadata
    • Reusable even when is_fresh_instance
  • Key: inode + mtime + size
  • Key: content hash
    • Watchman offers content.sha1hex
    • Machine-independent
  • Remember: must handle deleted files
    • Above only helps you validate individual entries upon use

44 of 56

[5/5] Distributing persistent state

A

B

C

main

D

E

warm

source control

(changed files)

artifact / service

(precomputed data)

Watchman

(changed files)

HEAD

  • Init:
    • Get clock ID
    • Process source control changes
  • Loop:
    • Process Watchman changes
    • Update clock ID

  • Handling deletions:

45 of 56

Thanks for watching!

Takeaways:

  • Your filesystem is a distributed system
  • Watchman offers a higher-level abstraction
    • OS APIs < Watchman < Bazel
    • Replayable stream of events, indexed by clock IDs (like database cursors)
    • Can implement efficient specialized systems
  • Higher-level abstractions save engineering resources
    • Reliable design (no missed updates)
    • Minimal design (no daemonization; leverage triggers, settlement, etc.)
    • Maintainable design (common incremental + full code paths)

— Waleed Khan <me@waleedkhan.name>

46 of 56

Deleted slides

47 of 56

Who is this talk for?

  • If you're building a system which...�

Crawls a large repository

Executes "build" tasks outside of the build system

Performs fine-grained incremental processing

  • Or if you just think build systems are fun 😊

48 of 56

Watchman advantages — functionality

  • Easy to experiment
  • From homepage:

These two lines establish a watch on a source directory and then set up a trigger named buildme that will run a tool named minify-css whenever a CSS file is changed. The tool will be passed a list of the changed filenames.�

$ watchman watch ~/src

# the single quotes around '*.css' are important!

$ watchman -- trigger ~/src buildme '*.css' -- minify-css

�The output for buildme will land in the Watchman log file unless you send it somewhere else.

49 of 56

Watchman advantages — functionality

  • Cross-platform
    • To Linux and macOS, anyways
  • Well-designed interface
  • Extensible to custom source control systems + build artifacts
    • Note: Git support is probably buggy (I wrote it)

50 of 56

Watchman advantages — reliability

  • Handles and exposes OS errors uniformly
  • Has its own logging / diagnostic mechanisms
  • Design emphasizes correctness

51 of 56

Watchman advantages — efficiency

  • Incremental queries
    • Since a given clock ID
  • Settlement / batching
  • Shares operating system resources
    • Via relative roots

52 of 56

Watchman advantages — architectural

  • Your local filesystem is a distributed system
    • Filesystem serves requests from many clients concurrently
      • Your users can modify the filesystem while you're trying to build 😫
      • Hard to get a consistent snapshot
    • Some filesystems support efficient snapshotting — but does yours?
      • Example: ZFS
      • Example: btrfs
  • Watchman...
    • Institutes a global clock
    • Supports efficient incremental queries
    • Encourages designing for eventual consistency
    • Effectively implements snapshotting

53 of 56

What is Watchman?

  • Need to make it your build tasks efficient and incremental?
  • If so, you might be able to use Watchman, a file watching service
  • You can buy a lot of development runway before needing to daemonize

54 of 56

Watchman disadvantages

  • Non-trivial dependency
    • Difficult to build from source
    • Need to deploy to all developer machines
  • Unsupported features
  • May still need custom filesystem crawling code for efficiency
    • Inter-process communication overhead when communicating many changed paths
    • Natural caching/prefetching effects when traversing filesystem
    • Can use specialized syscalls like fstat/fstatat

55 of 56

Watchman concepts

  • Watch root: directory being watched by Watchman
    • Register with watch-project
      • Idempotent
    • See also: relative roots
  • Clock ID: represents a point in time
    • Analogous to database cursors
    • Clock ID looks like: c:123:234
    • Note: there are other kinds of clocks (mtime, named cursors)
      • Recommended to not use them

56 of 56

Watchman since

  • Can get changed paths since a previous clock ID
  • Result includes new clock ID
    • For use in subsequent queries
    • Ensures you don't miss any updates
    • You should persist this alongside build artifact
  • If provided clock ID not known, returns all paths
    • Result sets "is_fresh_instance": true
    • See: watchman: precise meaning and handling of "is_fresh_instance"
    • Example: previous clock ID is from previous instance of Watchman process, which has since restarted (and may have missed filesystem updates)