Page  of

File System Layer Design Doc

Author: Ethan Lee

Contributors: Thomas LivelyAlon Zakai

Last updated: 2021-09-30

Objective:

The objective of this project is to implement a production-ready high-performance fully-multithreaded filesystem layer for Emscripten that will replace the existing JavaScript version. The requirements below outline the motivations for the proposed design.

Requirements:

This new file system seeks to meet all the previous file system’s requirements and address any new requirements that would enhance the user experience. At-risk requirements indicate old file system requirements that may need to be modified to support the new file system implementation.

New Requirements:

  • WebAssembly Implementation: The new file system will be compiled to WebAssembly. The old file system was written in JavaScript.
  • Fully-Multithreaded: The new file system will be fully multi-threaded and will be more performant compared to the previous single-threaded file system. There will no longer be any requirement to proxy to the main thread, unless the underlying Web API is only available on the main thread.
  • Smaller Code Size: The new file system should have a smaller code size footprint relative to the current implementation.
  • Per-file Persistence: Users will be able to specify which persistence store a particular file has, if any.
  • Memory Residency: Users will be able to specify how file contents will be flushed from in-memory caches to backing stores.
  • Proxying Support: Proxying will be supported when only one thread can own the underlying data, but otherwise will not be necessary.
  • Large File Caching: The new file system will be able to chunk and cache parts of large files. This will allow portions of large files to be loaded on demand.
  • Support Asynchronous (Main Thread) / Synchronous (Worker Threads) Backends: This includes OPFS.
  • Enhanced Crash Tolerance: There is an opportunity to improve crash tolerance compared to the old file system.

Existing Requirements:

  • POSIX Compatibility: This new file system implementation should still support the syscall implementation that is defined in library_syscall.js
  • Standardized Backend Interface: The current file system supports multiple backends. Similarly, the new file system will provide an interface to plug into multiple backends including IndexedDB, LocalStorage, OPFS, etc.
  • Support Asynchronous Backends: IndexedDB is currently supported by the current file system. This can be achieved by using ASYNCIFY, providing a synchronous interface that keeps data in memory, or synchronously proxying to a worker thread. In the future, when stack switching is implemented, we should be able to use that to support IndexedDB.
  • Support Synchronous Backends: This includes OPFS AccessHandles.
  • Support Special Device and Pass-through Backends: Some special cases such as stdin or stdout may not benefit from the new API. In this case, they could still use the open file table structure for accessing files. The file system will also serve as a passthrough for operations to backends that directly support file operations such as NodeFS.
  • File Loading: The new file system will enable streaming files directly to a backend without loading the entire file into memory similar to the function of WorkerFS.

At-Risk Requirements:

  • JavaScript API for access from JS Code: The old file system was implemented purely in JavaScript and it was possible for the user to interact and extend file system internals. The new file system will provide a JavaScript API, but the scope may be reduced. Users of the current API will need to port their code over to the new JavaScript API provided. This should be a relatively easy process to perform, at least for simple use cases.

Design:

At a high-level, the design will depart from the implementation of a traditional filesystem. A traditional on-disk file system might represent everything as a file, including directories. All data including metadata contained within inodes would persist on-disk in fixed-size data blocks. In contrast, this implementation will consist of a tree-based structure with each node representing a file, directory, or symlink. A file will contain pointers to blocks of data, but otherwise nodes will not need to have fixed size. A diagram is shown below to demonstrate a sample root directory node and children file nodes. File nodes will contain pointers to blocks of memory. The current file system uses a tree node structure with a hashtable to store files and directories in JavaScript. Similarly, the new file system will use a tree-based directory structure as well. However, the file system will be written in C++.

Language choice was based on the fact that Emscripten’s build systems are already integrated with C++ and it provides higher level abstractions compared to C, which will ease development. Other potential languages were considered as well, but the incurred cost of setting up the infrastructure to integrate with Emscripten outweighs any potential benefits.

Figure 1: Proposed Chunk-based File System Layout

Modular Backend API:

By providing a modular backend system, users will be able to combine various backends together to significantly extend the capabilities and flexibility of the filesystem. The main benefit of this system is that this abstraction makes the core file system simpler. By delegating functionality such as cache control and persistence to various backends, each module will be able to focus on its primary file system operations. The internals of each modular backend will be encapsulated and backends will communicate between themselves using the same common API. Backends will also be individually configured according to a user’s specifications, allowing for fine-grained control. Furthermore, the code for virtual backends will be reused to add functionality to arbitrary other backends. Modular backends will also meet the requirement of providing per-file persistence by permitting users to configure each file with its own set of backends, including persistent backends. The backends we plan to implement and their common use cases are detailed below:

Concrete Backends

  • JSHeap: Stores data as JS objects.
  • WasmMemory: Stores data in shared memory
  • XHR: Reads data from a server using XHR requests
  • IndexedDB: Stores data outside of memory in IndexedDB
  • LocalStorage: Stores data outside of memory in LocalStorage
  • OPFS: Stores data outside of memory in OPFS

Virtual Backends

  • Cache: Reads data out of one underlying backend, but stores that data along with any modifications in another faster underlying backend. Can optionally flush modifications back to the original underlying backend.
  • Proxy: Forwards calls to a backend on a different thread. Can provide a synchronous interface for async backends.
  • RecordLocking: Layers POSIX record locking capabilities and management logic on top of some other backend.
  • Instrumenting: Adds logging layer around an existing backend. This can be used for testing crash tolerance.

Cache Hierarchies Use Case Examples

  • JavaScript backend used as an in-memory cache for IndexedDB persistent store.
  • LocalStorage backend used as a fast, persistent cache for a network-based backend.
  • JavaScript backend used as an in-memory cache for a LocalStorage backend. This LocalStorage backend could also be a backend for a network-based backend.
  • Generic cache backend used to read data from a network-based backend but writes dirty data to an IndexedDB persistent backend.
  • Generic cache backend that reads data from IndexedDB but writes dirty data to a temporary IndexedDB backend that will not be persisted. This will store updated data outside of memory but will still be ephemeral.