1 of 56

Incremental processing with Watchman

Waleed Khan

me@waleedkhan.name

https://blog.waleedkhan.name/incremental-watchman/

Don't daemonize that build!

2 of 56

About

Career in optimizing the developer feedback loop

Meta: Hack language + IDE services
Twitter: Scala tooling; source control
Hudson River Trading: C++/Python build system

Watchman is an open-source file watching service

Project page: https://facebook.github.io/watchman/

Slides:

URL: https://blog.waleedkhan.name/incremental-watchman/
Extra content at end

If you want to follow along with the slides at your own pace, you should take this opportunity to open them up now. You can visit my website waleedkhan.name and you can navigate to the blog post with these slides.

A little about me: I used to work at Meta on the Hack programming language, and Twitter on their Scala build tooling, as well as source control. Most recently, I've been working at Hudson River Trading on their C++ and Python build system.

My career has been spent shortening the developer feedback loop, largely by implementing custom incremental build systems. Watchman is an open-source file-watching tool that can be used as a high-level primitive for those kinds of incremental systems. In this talk, I'd like to share some design ideas for implementing similar incremental build systems.

3 of 56

Talk

Part 1: Warranting Watchman

Justification
Problems + solutions
Abstractions

Part 2: Watchman Wisdom

Stand-alone pieces of advice

Slides:

URL: https://blog.waleedkhan.name/incremental-watchman/
Extra content at end

4 of 56

Part 1:

Warranting Watchman

5 of 56

Build tasks outside the build system

File syncing	Many companies have built solutions to sync local source code to a remote machine Example: Stripe's monorepo development environment (Elhage 2024)
Dynamic dependencies	Generating the build graph itself programmatically Example: Gazelle can be used to generate BUILD files for Bazel
IDE services	Latency-sensitive; need to process changes to source code on order of <100ms Example: Language Server Protocol exposes facility to subscribe to workspace file changes

I mentioned I've spent a lot of time working on custom incremental build tooling. When would you need that? I'll give a few examples that I've personally used Watchman for in order to implement custom incremental build systems.

One example is file syncing. A lot of companies have been embracing "remote development" workflows, by which I mean your code is sent from your local machine to be compiled and run on a remote machine somewhere else. Many companies have invented solutions to sync your local source code to the remote machine as part of the development process. File syncing may be outside the scope of your build system, or it may just not be integrated yet.

Another example is dynamic dependencies in the build system. Many companies have Bazel, but those kind of build systems may not be able to express certain kinds of dynamic dependencies. For Bazel, there's a tool called "Gazelle" which is used to programmatically generate BUILD files, which are then read by Bazel in the next build. There may be some fans of the buck2 build system in the audience who scoff at this, given its native support for dynamic dependencies. [We can't all use buck2, unfortunately.]

Less common, but still important, is building latency-sensitive IDE services. In those cases, the build system may not be able to generate the appropriate build artifacts, possibly because of dynamic dependencies, or the overhead may be too much to serve queries in real-time as the code changes. In those cases, more specialized infrastructure is necessary.

6 of 56

Example repo: nixpkgs

Testing using https://github.com/NixOS/nixpkgs

685k commits
43k files in working copy
(Not that big by monorepo standards)

Device:

MacBook Pro (Retina, 15-inch, Mid 2015)
2.5 GHz Quad-Core Intel Core i7
16 GB 1600 MHz DDR3

7 of 56

Nixpkgs: repository traversal timing

# Benchmark with hyperfine: traversing all files using ripgrep

# (it has optimized parallel filesystem traversal code).

$ hyperfine --warmup=3 'rg --files'

Benchmark 1: rg --files

Time (mean ± σ): 752.9 ms ± 26.4 ms [User: 1295.8 ms, System: 3845.9 ms]

Range (min … max): 707.2 ms … 794.4 ms 10 runs

8 of 56

Remote development sync

Problem: Need to efficiently sync files to remote machine during development

Solution: Can use rsync to sync local source code to remote

9 of 56

Nixpkgs: no-op rsync time

# Sync the entire repository to second directory on local machine.

# rsync will traverse and hash the entire directory contents.

# A real use-case would involve network time.

$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced'

Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced

Time (mean ± σ): 1.438 s ± 0.075 s [User: 0.476 s, System: 1.448 s]

Range (min … max): 1.322 s … 1.521 s 10 runs

10 of 56

rsync woes

Problem: rsync always walks + hashes all files

⚠️ Solution: Maintain persistent index/cache, like git

❌ Solution: Run background process ("daemon"), subscribe to inotify (etc.)

11 of 56

What is daemonization?

See: https://en.wikipedia.org/wiki/Daemon_(computing)
Convert existing build task into background process

For efficiency

Oftentimes:

Build task split into front-end client and back-end server
Program starts daemon if not already running
Commands become RPCs to daemon

12 of 56

Why daemonize?

Keep persistent O(repo) state in memory

Skip loading/saving
Efficient queries and updates

Subscribe to OS filesystem notifications

Example: inotify (Linux)
Example: kqueue (BSD)
Example: FSEvents (macOS)

Reduce startup latency

Example: Nailgun (Java)
Example: CHg (Python)

13 of 56

Why not daemonize?

Huge incidental complexity around service management

Running exactly 0–1 instances
Process groups, signal handling, etc.
Forward/backward compatibility; upgrading running process version
RPC is fundamentally more complex than in-process function calls

Observability is difficult

Partially a tooling problem
Partially a design problem, when it's easy to persist arbitrary state

Long-lived processes make long-lived mistakes

Corrupted/erroneous persistent state
Deadlock
Resource leaks
"Runaway" processes

14 of 56

Don't daemonize that build!

...at least until you have no other choice
Adds substantial incidental complexity

15 of 56

Watchman

Try Watchman, a file watching service

Project page: https://facebook.github.io/watchman/

High-level primitive for custom incremental builds

Higher level of abstraction than filesystem APIs
Uses its own daemon, so that we don't have to 🫡

Can buy a lot of development runway before needing to daemonize

This talk is about using Watchman, a file-watching service, as a way to abstract out a lot of the incidental complexity and help you focus on tackling your actual engineering problems.

You can use Watchman as a high-level primitive in custom build systems. Watchman is a great tool in my opinion because it can completely change the design of your system for the better. And, in fact, even if you do one day decide to daemonize your system, you may benefit a lot from continuing to use Watchman and designing your system around the abstractions that it offers.

I'll also mention that other systems offer similar interfaces. For example, the build system buck2 supports a similar interface via its "incremental actions", and the Language Server Protocol supports subscribing to file-watching notifications.

16 of 56

Warming up with Watchman

17 of 56

Start watching

$ watchman watch-project .

{

"version": "20240926.093054.0",

"watcher": "fsevents",

"watch": "/Users/waleed/Workspace/nixpkgs"

}

18 of 56

Data

$ watchman --json <<<'["query", ".", {}]' | jq '.files[0]'

{

"size": 480,

"new": true,

"exists": true,

"mode": 16877,

"name": ".git"

}

Next, let's issue a basic query.

Notice that we're calling Watchman via the command-line interface with a JSON payload. This is already a huge improvement from an engineering perspective, just because it's way easier to experiment with this rather than, for example, the inotify syscall interface for Linux in a long-running C program.

Here's, I'm piping in a JSON payload to standard input using bash. The subcommand is query, the watched directory is the current directory, which is nixpkgs in this case, and the last parameter is the query options.�

By default, if we don't pass any query options, Watchman returns data about all of the files in the watched directory. In this case, I piped the result to jq to just get the first file for demonstration purposes. We'll see how to do incremental processing shortly.

In the result, you can see the filename — in this case, it's happens to be the .git directory — and a few other metadata fields that are returned with the default query options. You can ask Watchman for a different set of metadata if you want.

19 of 56

Metadata

$ watchman --json <<<'["query", ".", {"expression": "false"}]'

{

"version": "20240926.093054.0",

"files": [],

"clock": "c:1727393614:39384:2:7984",

"is_fresh_instance": true,

"debug": {

"cookie_files": [

"/Users/waleed/Workspace/nixpkgs/.git/.watchman-cookie-waleedkhan.local-39384-342"

]

}

20 of 56

Clock IDs

# get clock ID

$ watchman --json <<<'["query", ".", {}]' | jq '.clock'

"c:1727486318:13768:1:268"

# no-op

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:271",

"files": []

}

21 of 56

Modified files

# create (or update) a file

$ touch foo�

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:268"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:284",

"files": [

{

"size": 0,

"new": true,

"exists": true,

"mode": 33188,

"name": "foo"

}

]

}

22 of 56

Deleted files

# delete a file

$ rm foo

$ watchman --json <<<'["query", ".", {"since": "c:1727486318:13768:1:284"}]' \

| jq '{ clock, files: .files[:1] }'

{

"clock": "c:1727486318:13768:1:292",

"files": [

{

"size": 0,

"new": false,

"exists": false,

"mode": 33188,

"name": "foo"

}

]

}

23 of 56

Complex queries

$ watchman --json <<<'["query", ".", {

"fields": ["name", "size", "mode", "content.sha1hex"],

"expression": [

"allof",

["type", "f"],

["not", ["dirname", ".git"]]

]

}]' | jq '.files[0]'

{

"content.sha1hex": "f4466aeb9bf2306565967c66a8f070821969755c",

"mode": 33188,

"size": 110,

"name": "pkgs/development/tools/build-managers/gradle/tests/java-application/src/main/java/Main.java"

}

And just a few more features. Here, we're explicitly setting metadata fields we're interested in, including the SHA1 content hash for each file. If you want, Watchman can compute and store the content hashes for you in memory, which can be convenient for caching as discussed later.

And we can provide arbitrarily complex expressions to limit what kinds of entries Watchman returns. In this case, we're limiting the returned entries to files of type f, which is to say that they're real files, and not directories or symlinks or something else, mainly so that I can show an example of Watchman returning the content hash. And we're excluding files under the .git directory, so that Watchman returns a real source file instead of some random Git metadata file.

In the result, we can see that Watchman returned some Main.java file along with its metadata and hash as the first file.

24 of 56

Working with Watchman

25 of 56

Incremental rsync with Watchman

#!/bin/bash

set -euo pipefail

watchman >/dev/null watch-project .

WATCHMAN_CLOCK=$(cat .watchman-clock 2>/dev/null || echo 'c:0:0')

WATCHMAN_QUERY=$(jq -n --arg clock "$WATCHMAN_CLOCK" '["query", ".", {"fields": ["name"], "since": $clock}]')

WATCHMAN_RESULT=$(watchman <<<"$WATCHMAN_QUERY" --json)

if jq <<<"$WATCHMAN_RESULT" >/dev/null -e '.is_fresh_instance'; then

echo 'Full sync'

RSYNC_ARGS=(--delete-after)

else

jq <<<"$WATCHMAN_RESULT" >.watchman-files -r '.files[]'

printf 'Incrementally syncing %d files\n' "$(wc <.watchman-files -l)"

RSYNC_ARGS=(--files-from='.watchman-files' --delete-missing-args)

fi

rsync --archive --compress "${RSYNC_ARGS[@]}" ./ ../nixpkgs-synced

jq <<<"$WATCHMAN_RESULT" >.watchman-clock -r '.clock'

Setup

Load clock

Query Watchman

Full sync

Incremental sync

Execute

Commit clock

This is a bash script to implement incremental rsync. Instead of unconditionally syncing all files, we'll sync only the files that have changed since the last sync.

First, we start watching the project. We load the previous clock ID from persistent state on disk, or a default clock ID if it's not available. We query Watchman, and based on is_fresh_instance, we do either a full sync or an incremental sync.

We have to do a full sync if it's the first sync, or if we missed any filesystem updates, to bring the remote side back to a known state. That includes deleting any extra files on the remote side with rsync's --delete-after option.

In the incremental sync case, we limit rsync to syncing only the changed set of files. Note that this set may include deleted files. We pass --delete-missing-args to rsync to indicate that a missing file is not an error, but that we actually want rsync to delete that file from the remote side.

Then we run rsync, and we make sure to only save the new Watchman clock if it succeeds. You can think of this as similar to "committing" a database transaction. If the rsync failed for some reason, then the next attempt will try to sync all of the old changed files again, plus any newly-changed files.

26 of 56

Incremental rsync with Watchman

$ hyperfine --warmup=3 'rsync --archive --compress ./ ../nixpkgs-synced' './sync.sh'

Benchmark 1: rsync --archive --compress ./ ../nixpkgs-synced

Time (mean ± σ): 1.443 s ± 0.046 s [User: 0.435 s, System: 1.458 s]

Range (min … max): 1.393 s … 1.522 s 10 runs

Benchmark 2: ./sync.sh

Time (mean ± σ): 317.2 ms ± 17.1 ms [User: 209.8 ms, System: 35.9 ms]

Range (min … max): 298.5 ms … 350.8 ms 10 runs

Summary

'./sync.sh' ran

4.55 ± 0.29 times faster than 'rsync --archive --compress ./ ../nixpkgs-synced'

27 of 56

Why Watchman?

28 of 56

Watchman advantages — reliability

[Stripe] also invested heavily in reliability and self-healing on errors and network problems. One “internal” but significant improvement involved migrating to watchman for file-watching: In our testing, it was by far the most robust file-watcher we found, vastly reducing the frequency of missed updates or a “stuck” file-watcher.

�— Nelson Elhage, Stripe's monorepo developer environment (2024)

“

”

29 of 56

Watchman advantages — architectural

Your filesystem is a distributed system

POSIX filesystems, at least

Watchman...

Supports efficient incremental queries
Encourages designing for eventual consistency
Effectively implements snapshotting via clock IDs

Abstraction levels:

❌ Poking shared mutable state
❌ Callbacks
✅ Event sourcing (almost)

30 of 56

Wondering about Watchman?

31 of 56

Watchman modes

Main modes of operation:

since: get changes since last clock ID

Use when: you need to do incremental processing in response to user command

trigger: run a command when paths change

Use when: you need to do incremental processing preemptively, before the user has issued a command
At most 1x instance of running command
Configurable settlement period
Logs go to Watchman logfile

subscribe: actively wait for changes in long-running process

Use when: you're daemonizing your build and the above approaches don't work

32 of 56

Watchman data structure

Essentially an in-memory linked hash map from path to metadata

Metadata includes clock ID
Not persisted between restarts
Watchman manages crawling the filesystem on its own

Algorithms:

Update on filesystem notification:

Access and bump entry to beginning of list in O(1)
Read file metadata from disk as appropriate and update entry

Read changed paths since client-provided clock ID:

Iterate entries from front of list (by recency)
When reaching entry with older clock ID, terminate traversal

33 of 56

Part 2:

Watchman Wisdom

34 of 56

Warnings about Watchman

Handling is_fresh_instance
Changing the query
Eventual consistency

35 of 56

[1/3] Handling is_fresh_instance

See: watchman: precise meaning and handling of "is_fresh_instance" (2018)
If provided clock ID not known, returns all paths
Result has "is_fresh_instance": true

Example: first build
Example: Watchman died and restarted (may have missed filesystem updates)

Remediation:

Clear persistent state
Check persistent state (be careful about stale entries for deleted files!)

36 of 56

[2/3] Changing the query

Your Watchman query is an input to your build!
When query changes, persistent state may be invalidated

Example: simple static query changes when build task is updated
Example: dynamic query generated by .gitignore files may not correctly handle changes to .gitignore files!

Solutions:

Use simple queries and filter later
Store old query alongside old clock; treat as is_fresh_instance when changed

37 of 56

[3/3] Eventual consistency

File on disk may change after Watchman result
Be prepared for...

Content changes
Metadata changes
Deletion

Don't mix Watchman + filesystem data

Example: don't combine Watchman mtime with filesystem contents

38 of 56

Wielding Watchman well

Unifying full + incremental processing
Persisting state
Invalidating derived data
Caching persistent state
Distributing persistent state

39 of 56

[1/5] Unifying full + incremental processing

watchman_result = query_watchman()

if watchman_result["is_fresh_instance"]:

paths_to_visit = None

else:

paths_to_visit = set()

for entry in watchman_result["files"]:

path = Path(entry["name"])

paths_to_visit.add(path)

paths_to_visit.update(path.parents)

def should_visit(path: Path) -> bool:

return paths_to_visit is None or path in paths_to_visit

40 of 56

[1/5] Unifying full + incremental processing

def visit(path: Path) -> None:

if not should_visit(path):

return

if path.is_file():

print(f"Visiting file: {path}")

for child_path in path.iterdir():

visit(child_path)

visit(Path("."))

41 of 56

[2/5] Persisting state

Usually will store path → metadata

Example: {"foo.py": ["class Bar", "def baz"]}
Easy to invalidate

Storage ideas:

Big JSON blob
SQLite
RocksDB

42 of 56

[3/5] Invalidating derived data

Easy to invalidate when keys are paths
Harder to invalidate derived data
Reverse index: metadata → path

Example: {"class Bar": "foo.py", "def baz": "foo.py"}

Reverse index strategies:

Maintain bidirectional index
Allow stale values; check again before using
O(n) update on deletions

You may need to compute data that's not indexed directly by path. For example, in an IDE, you may take a file path as a key and compute the list of symbols defined in that file as the value. To support go-to-definition, you would want a mapping in the other direction, from symbol name back to file path.

Maintaining the reverse mapping is just a general build system problem at this point. One technique I want to point out is that it's oftentimes feasible to not fully update the reverse mapping and continue to store stale edges. For example, if I store a reverse mapping that says that class Bar is defined in path foo.py, and then I delete foo.py, I might not update the reverse mapping at all. Instead, when I query the path associated with class Bar, my code could actually go re-parse file foo.py and confirm that class Bar is still there before returning it as a result.

43 of 56

[4/5] Caching persistent state

Add data for validation: path → key + metadata

Reusable even when is_fresh_instance

Key: inode + mtime + size
Key: content hash

Watchman offers content.sha1hex
Machine-independent

Remember: must handle deleted files

Above only helps you validate individual entries upon use

44 of 56

[5/5] Distributing persistent state

A

B

C

main

D

E

warm

⋮

source control

(changed files)

artifact / service

(precomputed data)

Watchman

(changed files)

HEAD

Init:

Get clock ID
Process source control changes

Loop:

Process Watchman changes
Update clock ID

Handling deletions:

45 of 56

Thanks for watching!

Takeaways:

Your filesystem is a distributed system
Watchman offers a higher-level abstraction

OS APIs < Watchman < Bazel
Replayable stream of events, indexed by clock IDs (like database cursors)
Can implement efficient specialized systems

Higher-level abstractions save engineering resources

Reliable design (no missed updates)
Minimal design (no daemonization; leverage triggers, settlement, etc.)
Maintainable design (common incremental + full code paths)

— Waleed Khan <me@waleedkhan.name>

46 of 56

Deleted slides

47 of 56

Who is this talk for?

If you're building a system which...�

Crawls a large repository

Executes "build" tasks outside of the build system

Performs fine-grained incremental processing

Or if you just think build systems are fun 😊

48 of 56

Watchman advantages — functionality

Easy to experiment
From homepage:

These two lines establish a watch on a source directory and then set up a trigger named buildme that will run a tool named minify-css whenever a CSS file is changed. The tool will be passed a list of the changed filenames.�

$ watchman watch ~/src

# the single quotes around '*.css' are important!

$ watchman -- trigger ~/src buildme '*.css' -- minify-css

�The output for buildme will land in the Watchman log file unless you send it somewhere else.

49 of 56

Watchman advantages — functionality

Cross-platform

To Linux and macOS, anyways

Well-designed interface
Extensible to custom source control systems + build artifacts

Note: Git support is probably buggy (I wrote it)

50 of 56

Watchman advantages — reliability

Handles and exposes OS errors uniformly
Has its own logging / diagnostic mechanisms
Design emphasizes correctness

Example: monotonic clock IDs (more on this later)
Example: synchronization cookies
Example: is_fresh_instance (more on this later)

51 of 56

Watchman advantages — efficiency

Incremental queries

Since a given clock ID

Settlement / batching
Shares operating system resources

Via relative roots

52 of 56

Watchman advantages — architectural

Your local filesystem is a distributed system

Filesystem serves requests from many clients concurrently

Your users can modify the filesystem while you're trying to build 😫
Hard to get a consistent snapshot

Some filesystems support efficient snapshotting — but does yours?

Example: ZFS
Example: btrfs

Watchman...

Institutes a global clock
Supports efficient incremental queries
Encourages designing for eventual consistency
Effectively implements snapshotting

53 of 56

What is Watchman?

Need to make it your build tasks efficient and incremental?
If so, you might be able to use Watchman, a file watching service

Project page: https://facebook.github.io/watchman/

You can buy a lot of development runway before needing to daemonize

54 of 56

Watchman disadvantages

Non-trivial dependency

Difficult to build from source
Need to deploy to all developer machines

Unsupported features

May still need custom filesystem crawling code for efficiency

Inter-process communication overhead when communicating many changed paths
Natural caching/prefetching effects when traversing filesystem
Can use specialized syscalls like fstat/fstatat

55 of 56

Watchman concepts

Watch root: directory being watched by Watchman

Register with watch-project

Idempotent

See also: relative roots

Clock ID: represents a point in time

Analogous to database cursors
Clock ID looks like: c:123:234
Note: there are other kinds of clocks (mtime, named cursors)

Recommended to not use them

56 of 56

Watchman since

Can get changed paths since a previous clock ID
Result includes new clock ID

For use in subsequent queries
Ensures you don't miss any updates
You should persist this alongside build artifact

If provided clock ID not known, returns all paths

Result sets "is_fresh_instance": true
See: watchman: precise meaning and handling of "is_fresh_instance"
Example: previous clock ID is from previous instance of Watchman process, which has since restarted (and may have missed filesystem updates)