Proposal: Introduce deletion vector file to reduce write amplification

Motivation

Deletion vector is a commonly used technique to reduce write amplification in columnar storage. This idea is quite simple: we associate each data file with a deletion vector file, and during updates of data files, we don’t need to rewrite the whole data file, just the deletion vector file.

Let’s use the update sql as an example. In the following case, we need to rewrite whole data files, which introduces 249MB write io.

After introducing a deletion vector file, the write io can be reduced to 1M for each rewrite.

Design

Deletion Vector

Essentially the deletion vector is a set of row indices encoded in a bitmap used to mark deleted rows in a data file. It needs to provide two operations for efficient reading/writing:

The reader can be initialized with a starting position, and returns an iterator of boolean values.
The writer should accept an iterator of boolean values as input.^[a]^[b]

With the required operations above, we have several choices of file format for deletion vectors:

A normal file with only one boolean column. Columnar formats like parquet, orc have support for boolean data types and can encode them efficiently.
Using roaring bitmap[1]. Roaring bitmap has been well known for being able to encode bitmap efficiently and widely used in many projects^[c], also it has predefined file format. I have no data about the compression ratio compared with parquet boolean file^[d]^[e]^[f]^[g], but one disadvantage is that it’s not a normal data file like others, and requires introducing an extra concept to the iceberg.

Changes to spec

Since each deletion vector is associated with one data file, we propose to extend DataFile struct an optional deletion_vector_file field:

Field name	Type	Description
Deletion_vector_file, optional	DeletionVectorFile	Deletion vector file associated with data file.

And the definition of DeletionVectorFile type:

Filed name	Type	Description
file_path, required	String	Absolute path of deletion vector file.
Record_count, required	long	Number of deleted rows recorded in this deletion vector file.
file_size_in_bytes, required	long	File size of this deletion vector file.

Changes to operations

Read

As a basic implementation, when reading a data file, we will also open the deletion vector file to read the bitmap. This is similar to zip two iterators to filter deleted records. Compute engines are free to make optimization on this. For example, vectorized query engine could read a batch of bits and use it as selection vector of record batch.

Write^[h]^[i]^[j]^[k]

The example in Motivation part shows the process of doing an update when there is no deletion vector file associated with the data file. When there is already a deletion vector file associated, we will merge two deletion vector files. Following diagram shows the write process^[l]^[m]^[n]^[o]:

Delete

Let’s say we have a table with following schema:

```sql

CREATE TABLE IF NOT EXISTS s1.t1

(

id long,

v_int int,

v_double double,

v_varchar string

) USING iceberg TBLPROPERTIES (

'format-version'='2',

'write.delete.mode' = 'merge-on-read',

'write.update.mode' = 'merge-on-read',

'write.merge.mode' = 'merge-on-read'

);

```

And we execute sql:

```

DELETE FROM s1.t1 WHERE id > 1

```

We can see that the physical plan is quite similar to deletion with position deletes. The Sort is optional and depends on the number of affected data files for each writer task. Without sort, we need to buffer all deletes in memory before writing.

Update

Let’s say we have a table with following schema:

```sql

CREATE TABLE IF NOT EXISTS s1.t1

(

id long,

v_int int,

v_double double,

v_varchar string

) USING iceberg TBLPROPERTIES (

'format-version'='2',

'write.delete.mode' = 'merge-on-read',

'write.update.mode' = 'merge-on-read',

'write.merge.mode' = 'merge-on-read'

);

```

And we execute sql:

```

UPDATE s1.t1 SET v_int = v_int + 1 WHERE id > 1

```

Backward Compatibility

This proposal doesn’t break any existing spec, only adds a new kind of row level deletion file.

Deprecation

When deletion vector file added, we should deprecated usage of position delete file. Equality deletion file is still useful in streaming writes such as flink, but we should introduce more frequent minor compaction to rewrite equality deletion file as much as possible for better reading performance.

Comparison with Position Deletes

Though it looks quite similar to existing position delete files, there are some differences:

It has no file_path column, but only a bitmap.
The mapping from data_file to deletion_vector_file is 1:1 (or 1: N, maybe we can have one deletion_vector_file per row group to further reduce write amplification). While the mapping from data_file to position_delete_file is N:M.^[p]^[q]

The differences brings some advantages for deletion_vector_file:

More flexible encoding format^[r]^[s].

For example we can use a roaring bitmap for better compression.

Better Planning Performance

Since a deletion_vector_file is always associated with a data_file, we can ignore them during planning phase, since the result is always same as data_file.^[t]^[u]

Better read performance

Following diagram shows the read process with position deletion file:

We need to merge several position deletion files to build in memory bitmap for filtering.^[v]^[w]^[x]^[y]^[z]^[aa]
Then we can data_file to do filtering.

While with deletion vector file, we don’t need to merge several files before reading, just build selection vector[2] during scan:

This is benefited from the 1:1 relationship between data_file and deletion_vector_file.

Easier to maintain statistics of each data_file

To avoid confusion, by statistics^[ab]^[ac]^[ad]^[ae] I mean things such as null_count, nan_count in manifest entry. To maintain statistics of pos_deletion_file, we need to store deleted row data in pos_deletion_file, which involves write amplification. With deletion_vector_file, we no longer need to maintain them for deletion files, only need to update the associated data_file’s statistics when necessary.

References

[a]Why leave out starting position?

[b]Because every time we need to rewrite the whole deletion vector file, we don't need the starting position when writing.

[c]Java lib for Roaring bitmaps used by Spark and Iceberg has no well-defined serialization format for 64-bit integers. However, positions are represented as long values and must be persisted. We can solve it, just as a note.

1 total reaction

Renjie Liu reacted with 👍 at 2023-10-10 01:38 AM

[d]I think we need this.

[e]I'd be interested to learn how well this compresses compared to the existing position delete files. We do store the data file path but it is dictionary encoded in Parquet and costs almost nothing.

[f]If I am not mistaken, the bitmap needed around 10 bits per value when I tested it. It does depend on density, however.

[g]Yes, roaring bitmap's doc says it takes 2bytes per integer at most. But actual compression ratio depends on dataset.

[h]What our expectations on performance here?

[i]I did a simple test using java 32 bit implementation

1. It takes 40ms to merge two bitmaps with 0.25 billion rows.

2. The write throughput is almost 3.5M/s

[j]I was considering the entire write process here. This would mean on write we must re-read all of the deletes files to create our merged delete vectors correct?

[k]Yes, it needs to re-read the deletes file.

[l]I think the process for a task-level rewrite is straightforward.

The bigger question is how engines would operate to meet the requirements here. Engines have separate read and write operators. The read /scan operator knows about the delete vector for a data file, but how does the write task know that a delete vector is present to rewrite it? Would you propose passing the delete vector path through with every record? Would you broadcast the state? I'm assuming that we want to avoid shuffling all of the previously deleted row information.

[m]I echo what Russell and Ryan said here. We need a bit more clarity on how that would be produced. How can we quickly locate an existing delete vector for a file?

[n]+1, I'll add sketches for each DML operation.

[o]Hi, I've added the section to add physical plan for delete and update. Merge into would be similar, so I ignore it. The core idea here is to add a `_dv_file` field in each record, and pass it through the whole pipeline so that we can process it in `WriteDelta` executor.

cc @Ryan Blue @Anton

[p]I don't think this is well specified above, but why wouldn't we want to store multiple bitmaps in the same delete file? Then a data file could just refer to offsets within that file, see Puffin Spec for example

[q]Good suggestion. But there are two concerns if we want to store multiple bitmaps in the same delete file:

1. Write amplification. That means we may need to write more deletion vector file when updating.

2. Concurrrent update problem. It makes concurrently update data files difficult.

[r]Why couldn’t we do this with existing positional deletes?

[s]A position delete file is more complex than a deletion vector file since it requires an extra column for file path, so we can't do it compress it with bitmap.

[t]This is inaccurate and creates a big concern for this proposal. The library must always produce tasks that specify what deletes need to be applied. How are those delete files tracked and how are they associated with data files during planning? If you're saying that planning can ignore them, how do readers know to apply them?

[u]You are right. I've updated the proposal to embed `DeleteionVectorFile` in `DataFile` struct, now it's true.

[v]Is the primary benefit of deletion vectors simply that you only have one to apply?

[w]It's one benefit. Another is that we can produce selection vector for vectorized execution engine rather than filtering records one by one. Following graph shows how it works.

[x]The exact same approach is already implemented. Take a look at ColumnVectorWithFilter and ColumnarBatchReader.

1 total reaction

Renjie Liu reacted with 👍 at 2023-10-11 06:07 AM

[y]I took a look the code, and it's similar to what I mean. But for position deletes, it still requires to open several files and merge all of them before reading. The actual performance degradation depends on the number of position deletes files.

[z]It would be good to have a more in-depth performance analysis of this for different scenarios. Is having a delete vector per data file always going to be a win in terms of read IO reduction vs. the current positional delete representation? Is there enough of a difference in performance to justify the cost & complexity of introducing a new way to represent row-level deletes? Even if the plan is to deprecate the current positional deletes there would be some time window over which engines need to continue support for the old format.

1 total reaction

Renjie Liu reacted with 👍 at 2023-10-12 01:28 AM

[aa]After considering the functionality of time travel, it seems impossible to maintain only one deletion vector for each snapshot generated, given that we need to account for deletes.

[ab]I’m confused by this paragraph, how is this different and why do we need to store deleted row data

[ac]The spec https://iceberg.apache.org/spec/#delete-formats

here says that it has an extra column called `row`, which stores deleted row data.

[ad]That is optional, and as far as I know no writers actually use it

[ae]Are they used in planning phase?

Motivation

Design

Deletion Vector

Changes to spec

Changes to operations

Read

Write[h][i][j][k]

Delete

Update

Backward Compatibility

Deprecation

Comparison with Position Deletes

More flexible encoding format[r][s].

Better Planning Performance

Better read performance

Easier to maintain statistics of each data_file

References

Write^[h]^[i]^[j]^[k]

More flexible encoding format^[r]^[s].