Proposal: Introduce deletion vector file to reduce write amplification
Deletion vector is a commonly used technique to reduce write amplification in columnar storage. This idea is quite simple: we associate each data file with a deletion vector file, and during updates of data files, we don’t need to rewrite the whole data file, just the deletion vector file.
Let’s use the update sql as an example. In the following case, we need to rewrite whole data files, which introduces 249MB write io.
After introducing a deletion vector file, the write io can be reduced to 1M for each rewrite.
Essentially the deletion vector is a set of row indices encoded in a bitmap used to mark deleted rows in a data file. It needs to provide two operations for efficient reading/writing:
With the required operations above, we have several choices of file format for deletion vectors:
Since each deletion vector is associated with one data file, we propose to extend DataFile struct an optional deletion_vector_file field:
Field name | Type | Description |
Deletion_vector_file, optional | DeletionVectorFile | Deletion vector file associated with data file. |
And the definition of DeletionVectorFile type:
Filed name | Type | Description |
file_path, required | String | Absolute path of deletion vector file. |
Record_count, required | long | Number of deleted rows recorded in this deletion vector file. |
file_size_in_bytes, required | long | File size of this deletion vector file. |
As a basic implementation, when reading a data file, we will also open the deletion vector file to read the bitmap. This is similar to zip two iterators to filter deleted records. Compute engines are free to make optimization on this. For example, vectorized query engine could read a batch of bits and use it as selection vector of record batch.
The example in Motivation part shows the process of doing an update when there is no deletion vector file associated with the data file. When there is already a deletion vector file associated, we will merge two deletion vector files. Following diagram shows the write process[l][m][n][o]:
Let’s say we have a table with following schema:
```sql
CREATE TABLE IF NOT EXISTS s1.t1
(
id long,
v_int int,
v_double double,
v_varchar string
) USING iceberg TBLPROPERTIES (
'format-version'='2',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);
```
And we execute sql:
```
DELETE FROM s1.t1 WHERE id > 1
```
We can see that the physical plan is quite similar to deletion with position deletes. The Sort is optional and depends on the number of affected data files for each writer task. Without sort, we need to buffer all deletes in memory before writing.
Let’s say we have a table with following schema:
```sql
CREATE TABLE IF NOT EXISTS s1.t1
(
id long,
v_int int,
v_double double,
v_varchar string
) USING iceberg TBLPROPERTIES (
'format-version'='2',
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'write.merge.mode' = 'merge-on-read'
);
```
And we execute sql:
```
UPDATE s1.t1 SET v_int = v_int + 1 WHERE id > 1
```
This proposal doesn’t break any existing spec, only adds a new kind of row level deletion file.
When deletion vector file added, we should deprecated usage of position delete file. Equality deletion file is still useful in streaming writes such as flink, but we should introduce more frequent minor compaction to rewrite equality deletion file as much as possible for better reading performance.
Though it looks quite similar to existing position delete files, there are some differences:
The differences brings some advantages for deletion_vector_file:
For example we can use a roaring bitmap for better compression.
Since a deletion_vector_file is always associated with a data_file, we can ignore them during planning phase, since the result is always same as data_file.[t][u]
Following diagram shows the read process with position deletion file:
While with deletion vector file, we don’t need to merge several files before reading, just build selection vector[2] during scan:
This is benefited from the 1:1 relationship between data_file and deletion_vector_file.
To avoid confusion, by statistics[ab][ac][ad][ae] I mean things such as null_count, nan_count in manifest entry. To maintain statistics of pos_deletion_file, we need to store deleted row data in pos_deletion_file, which involves write amplification. With deletion_vector_file, we no longer need to maintain them for deletion files, only need to update the associated data_file’s statistics when necessary.
[a]Why leave out starting position?
[b]Because every time we need to rewrite the whole deletion vector file, we don't need the starting position when writing.
[c]Java lib for Roaring bitmaps used by Spark and Iceberg has no well-defined serialization format for 64-bit integers. However, positions are represented as long values and must be persisted. We can solve it, just as a note.
1 total reaction
Renjie Liu reacted with 👍 at 2023-10-10 01:38 AM
[d]I think we need this.
[e]I'd be interested to learn how well this compresses compared to the existing position delete files. We do store the data file path but it is dictionary encoded in Parquet and costs almost nothing.
[f]If I am not mistaken, the bitmap needed around 10 bits per value when I tested it. It does depend on density, however.
[g]Yes, roaring bitmap's doc says it takes 2bytes per integer at most. But actual compression ratio depends on dataset.
[h]What our expectations on performance here?
[i]I did a simple test using java 32 bit implementation
1. It takes 40ms to merge two bitmaps with 0.25 billion rows.
2. The write throughput is almost 3.5M/s
[j]I was considering the entire write process here. This would mean on write we must re-read all of the deletes files to create our merged delete vectors correct?
[k]Yes, it needs to re-read the deletes file.
[l]I think the process for a task-level rewrite is straightforward.
The bigger question is how engines would operate to meet the requirements here. Engines have separate read and write operators. The read /scan operator knows about the delete vector for a data file, but how does the write task know that a delete vector is present to rewrite it? Would you propose passing the delete vector path through with every record? Would you broadcast the state? I'm assuming that we want to avoid shuffling all of the previously deleted row information.
[m]I echo what Russell and Ryan said here. We need a bit more clarity on how that would be produced. How can we quickly locate an existing delete vector for a file?
[n]+1, I'll add sketches for each DML operation.
[o]Hi, I've added the section to add physical plan for delete and update. Merge into would be similar, so I ignore it. The core idea here is to add a `_dv_file` field in each record, and pass it through the whole pipeline so that we can process it in `WriteDelta` executor.
cc @Ryan Blue @Anton
[p]I don't think this is well specified above, but why wouldn't we want to store multiple bitmaps in the same delete file? Then a data file could just refer to offsets within that file, see Puffin Spec for example
[q]Good suggestion. But there are two concerns if we want to store multiple bitmaps in the same delete file:
1. Write amplification. That means we may need to write more deletion vector file when updating.
2. Concurrrent update problem. It makes concurrently update data files difficult.
[r]Why couldn’t we do this with existing positional deletes?
[s]A position delete file is more complex than a deletion vector file since it requires an extra column for file path, so we can't do it compress it with bitmap.
[t]This is inaccurate and creates a big concern for this proposal. The library must always produce tasks that specify what deletes need to be applied. How are those delete files tracked and how are they associated with data files during planning? If you're saying that planning can ignore them, how do readers know to apply them?
[u]You are right. I've updated the proposal to embed `DeleteionVectorFile` in `DataFile` struct, now it's true.
[v]Is the primary benefit of deletion vectors simply that you only have one to apply?
[w]It's one benefit. Another is that we can produce selection vector for vectorized execution engine rather than filtering records one by one. Following graph shows how it works.
[x]The exact same approach is already implemented. Take a look at ColumnVectorWithFilter and ColumnarBatchReader.
1 total reaction
Renjie Liu reacted with 👍 at 2023-10-11 06:07 AM
[y]I took a look the code, and it's similar to what I mean. But for position deletes, it still requires to open several files and merge all of them before reading. The actual performance degradation depends on the number of position deletes files.
[z]It would be good to have a more in-depth performance analysis of this for different scenarios. Is having a delete vector per data file always going to be a win in terms of read IO reduction vs. the current positional delete representation? Is there enough of a difference in performance to justify the cost & complexity of introducing a new way to represent row-level deletes? Even if the plan is to deprecate the current positional deletes there would be some time window over which engines need to continue support for the old format.
1 total reaction
Renjie Liu reacted with 👍 at 2023-10-12 01:28 AM
[aa]After considering the functionality of time travel, it seems impossible to maintain only one deletion vector for each snapshot generated, given that we need to account for deletes.
[ab]I’m confused by this paragraph, how is this different and why do we need to store deleted row data
[ac]The spec https://iceberg.apache.org/spec/#delete-formats
here says that it has an extra column called `row`, which stores deleted row data.
[ad]That is optional, and as far as I know no writers actually use it
[ae]Are they used in planning phase?