Overview:

I'll just say "Blob" below when I mean Blob or File, but we do differentiate between them, as they have different internal metadata. We store different info for each, and actually copy the underlying data differently as well.

The basic idea for Blob storage is that while we store simple values in LevelDB, that doesn't work for big data. So we store a reference to the blob data in a separate table that parallels the value table, then store the actual data for each Blob in a real file on disk.

In order to prevent browser crashes from leaving lots of leftover files or fragments lying around, we have to use a journal [two, in fact, but more on that later], also implemented in LevelDB. The journal is stored in the same LevelDB database as everything else, so there's one per origin per profile. The journal assumes that the IDB backend is single-threaded, so if we ever switch to e.g. one thread or task runner per IDBDatabase, we'll need to update the journal code.

When you want to read a Blob value out of the database, we create a new BlobData that points to the file in place in the database directory; no copies occur. Whether you're reading a Blob or a File, it works basically the same way, but we'll construct the appropriate type of object in the renderer.

Implementation details:

The secondary ["live blob"] journal:

Let's say you read a Blob from the database, and keep it live in the renderer, but meanwhile you delete the IDB value that refers to it. We can't delete the underlying file yet, since it's in use, but we want to clear it from the database. So we use the secondary journal to track this file. It's marked as in-use-but-removed-from-the-database, and as soon as the Blob gets GCed, it should go away.

We do this using a ShareableFileReference and the IndexedDBActiveBlobRegistry to track all in-use backing files and all relevant deletions. When a live Blob has been removed from the database and the last reference gets GCed, its secondary journal entry gets moved to the primary journal for cleanup. The secondary journal gets cleaned up the first time we open the database after restarting Chrome, when we can be sure that it's entirely stale objects. The primary journal gets cleaned up whenever convenient, but generally before adding anything new to it.

The IndexedDBActiveBlobRegistry:

This class is conceptually simple, but gets complicated due to the fact that Blobs live on the IO thread and all the other IDB stuff lives on the IDB task runner. The idea is that when you create a Blob that refers to one of our backing files, the ABR keeps track of that. If you try to delete the IDB value that owns the backing file, the ABR checks to see if there are any active blobs using it. If so, it just notes it in the secondary journal, instead of telling you it's OK to delete it. When the Blob later gets garbage-collected, it triggers the actual deletion. However, in order to deal with Blobs on the right thread, this class mostly just does a lot of thread-hopping and forwarding. Note that if you read the same Blob value out multiple times, you'll create multiple separate Blobs, but they'll all refer to the same backing file, and ShareableFileReference will take care of the many-to-one mapping.

GetAddBlobRefCallback gives you a callback that can be called on any thread, which will mark a backing file as in use by a Blob. The callback will get called on the IO thread when the Blob is created, to remove race conditions in which we might only get halfway through Blob creation. See indexed_db_backing_store.cc: GetBlobInfoForRecord for where we get the callbacks for a Blob. See indexed_db_callbacks.cc: RegisterBlobsAndSend for where we call this callback [as mark_used_callback].

GetFinalReleaseCallback returns the callback that the ShareableFileReference will call when the last Blob that refers to a particular backing file is garbage-collected. See indexed_db_callbacks.cc: CreateBlobData for where we hook this callback up to the SFR.

MarkDeletedCheckIfUsed is what you call to tell it you've deleted a value that owned a particular blob [you'll call it multiple times if a given value contained multiple Blobs]. It returns whether or not its Blob in use, so you know if you can delete its backing file.

Disk layout:

Under the same parent directory as e.g. https_origin.com_0.indexeddb.leveldb, we'll create https_origin.com_0.indexeddb.blob. Under that we'll have a directory for each database, with the directory_id in unpadded hex as the directory name. Below that we'll shard Blob files into numbered subdirectories. The blob files themselves will be keyed using a simple incrementing counter, rendered into unpadded hex for the filename. The blob subdirectory sharding uses the second-to-LSB in the key as the directory name, zero-padded to 2 characters. E.g. the Blob with id 0x15A in database 0x2 goes in directory https_origin.com_0.indexeddb.blob/2/01/15A. This keeps us down to 256 files per subdirectory for a very long time, and keeps the number of subdirectories to a minimum. It also lets us do fast database deletions; we just rm -r the directory to clean up the blobs. We could shard by object store too if fast object store deletions were desired, but it would add to path length and add coding complexity, so I chose not to do that initially.

LevelDB schema/layout:

Blob Key Generator Current Number [DatabaseMetadataKey::BLOB_KEY_GENERATOR_CURRENT_NUMBER]:

The incrementing counter that's used to generate blob keys and filenames. Files are usually identified by the numeric key rather than the filename for efficiency.

Primary Blob journal [BlobJournalKey]:

Where we store the primary blob journal.

Live blob journal [LiveBlobJournalKey]:

Where we store the live-blob journal.

Blob journal data:

A list of {database_id, blob_key} pairs, where blob_key may be kAllBlobsKey to indicate that a whole database needs to be deleted. See EncodeBlobJournalData in indexed_db_backing_store.cc. Each of the two journals has the same layout.

Basic blob data [BlobEntryKey]:

This is documented in indexed_db_leveldb_coding.{cc,h}.

We add a BlobEntry table whose keys can be computed directly from those in the ObjectStoreData table [and vice-versa]. Any value in ObjectStoreData that refers to a Blob will also have an entry in BlobEntry. To determine if a range contains any Blobs, you can just quickly scan through BlobEntry rather than checking each ObjectStoreData value. This should allow for efficient DeleteRange, although that's not yet implemented. Any value can contain any number of Blobs, so in BlobEntry we store arrays of encoded IndexedDBBlobInfo; see EncodeBlobData in indexed_db_backing_store.cc.

The serialized value data:

We've been storing SerializedScriptValues, as bits, however they chose to serialize themselves. Currently that involves spitting out a Blob URL which isn't useful after the blob has been GCed. I altered SSV to give me a vector of the useful metadata for each blob as sideband data, and to put into its bitstream only an index into that vector instead of the blob URL. This code path is only taken for IDB values; all others still use the Blob URLs--there are other GC-related bugs there, but those will have different fixes.

In order to pass Blobs back from the browser to the renderer, we need to make sure they don't get garbage-collected while they're in flight. There is not yet a generic mechanism to pass a Blob's refcount across IPC, so I implemented a new handshake for that. We hold the refcount via a BlobDataHandle in a map in IndexedDBDispatcherHost::blob_data_handle_map_ and use the IndexedDBHostMsg_AckReceivedBlobs message to signal that the blob's made it across the IPC successfully.

How we write the blob data:

Reading blobs is currently rather unpleasantly tied into the network code. FileWriter actually used BlobURLRequestJob to get its data out, and we just reused that code. BlobURLRequestJob should really have a BlobStreamReader refactored out of it for external use, but I haven't taken the time to do that yet. Michael has something like that in an unreviewed CL already, which would be worth checking in. Reading Files is a lot easier; we've got the underlying path, so we just use file_util::CopyFile, which is a heck of a lot more efficient.

When does this blob data storage happen?

See IndexedDBTransaction::Commit() and CommitPhaseTwo().

See IndexedDBBackingStore::Transaction::CommitPhaseOne() and CommitPhaseTwo().

We don't actually do anything with files until you try to commit stuff. Before that, we store info about the blobs to be affected in a map in the IndexedDBBackingStore::Transaction. Then we do a phased commit that:

* Adds to the journal the blob files we're about to create, and commits that;

* Writes those files [via the IO and FILE threads];

* Adds to the journal the blob files we're about to remove [from overwritten or removed keys], removes the entries for the new files, and commits that along with the rest of the transaction;

* Removes the dead blob files [unless they're in use, in which case they're added to the secondary journal instead].

* Removes the dead blob file entries from the journal and commits that.

When does this all readback and blob creation happen?

We're constrained by what can be done on the IO thread [stuff involving Blobs] vs. the Indexed DB TaskRunner [stuff involving LevelDB and our in-memory data structures]. So we some jumping back and forth. We get all the IDB TaskRunner stuff done first, including registering the not-yet-created blobs in the IndexedDBActiveBlobRegistry, then jump across to the IO thread to create the BlobData objects and the ShareableFileReferences used for cleanup, then send the read response to the renderer. We have to be sure to avoid races, or we'll end up trying to create Blobs for files that have already been cleaned up. This requires us to do a bit of manual refcount-management in IndexedDBActiveBlobRegistry, since it lives on the IDB TaskRunner and BlobDatas and ShareableFileReferences live on the IO thread.

Status

Things that appear to work, but aren't extensively tested:

Storing and recalling Blobs and Files
Both journals; you can read a Blob out, delete its record, and then use its value.

Things that are explicitly not implemented yet:

Incognito mode [in-memory databases]. They probably don't even fail nicely. However, my guess is they're trivial compared to what's there now.
Efficient DeleteRange
Error recovery. If the journal or blob info table get corrupted, we have no FSCK yet, though one could clear up a number of types of corruption. In the worst case we could just delete the whole database and its associated directory, and that would nuke it back to clean.

The CLs have a number of TODOs, but those are either for small improvements or for longer-term issues which can be dealt with after the initial implementation is checked in.