2 of 14

Recap: Segment Tree

Last week, we discussed applications of the almighty Segment Tree to solve range queries, such as SUM and MINIMUM, as well as offer UPDATES to the array.

It accomplishes this by storing nodes in a tree format that store the answer for the interval.

To solve queries, we traverse the tree from top down. Since the tree has height log₂N by construction, each range query takes O(log₂N) time.

For similar reasons, it takes O(log₂N) time to update an element.

3 of 14

Drawbacks?

One large issue with ST is the space requirement. As we learned last week, it can have ~4*N nodes in its tree (recursive implementation), which, although O(N), can sometimes be too inefficient to pass.

However, one way we could resolve this is by using an iterative implementation. By doing so, we can halve the memory requirements—bringing it to 2*N.

Unfortunately, even iterative STs are sometimes too memory intensive to get the coveted AC verdict.

With ST, we can perform updates and queries in O(log₂N) time with O(N) space, is there a way that we can perhaps take longer on queries, but require less memory?

4 of 14

Solution: SQRT Decomposition

It turns out that we can!

Consider an array of N integers. Instead of building a tree based on the array, we can simply group elements together into “blocks.” To create a “block,” we pick some integer K and group elements together based on the following algorithm:

If the index satisfies 0 <= index < k, it is in the first block.
Otherwise, if the index satisfies k <= index < 2k, it is in the second block.

…

Otherwise, if the index satisfies i*k <= index < (i+1)*k, it is in the ith block.

A decomposition of an array into blocks of size 3.

5 of 14

Why is this useful?

Now that we have the array “bucketed” into a smaller array, we can perform the queries on this smaller array instead!

For each “block,” we can store the answer to a query for the block. For example, if we were solving SUM queries, the block would store the sum of all the values in the block. Similarly, if we were trying to solve MINIMUM queries, we would store the minimum of all the values in the block.

In this new array, there are ceil(N / K) elements, which can be a lot less than N if K is chosen properly.

6 of 14

Solving Range Queries

Now that we have the answers to each “bucket,” how can we use this to solve range queries?

Consider a query with endpoints L and R with L <= R. If L and R lie completely within blocks (L is a left endpoint of a block and R is a right endpoint) we can simply iterate over all blocks that make up the range and combine them to produce an answer.

To determine what block number an element is, we can simply take the index, call it i, then divide it by K. In other words, block_index = i / K (note integer division).

7 of 14

Incomplete ranges?

When L and R aren’t on the boundaries of blocks, problems arise. We cannot simply take the value for the block, since we’ll be using values in our query that lie outside of the queried range.

To resolve this, we can simply “shift” the query endpoints so they lie on the boundaries for a block. For L, we would move it to the right (increase the index) and for R, we would move it to the left (decrease the index). We can then calculate the value for this altered range [L^’, R^’] using the method outlined in the previous slide. Since, in the worst case, we’re going over each of the ceil(N/K) blocks, this part will take O(ceil(N/K)).

To compute the values for the range [L, L^’) and (R^’, R], we can simply iterate through these elements naively. This is because the size of these smaller ranges will be at most K. This means that we’ll only need to do a max of 2*K iterations to find this value. This yields O(2*K) runtime.

Combining the two steps, we can get a total runtime of O(ceil(N/K) + 2*K).

8 of 14

Point Updates (single element)

Fortunately, unlike STs, point updates are made very easy with SQRT decomposition.

By construction, each element will be in at most 1 bucket. In particular, if their index is i, bucket_index = i / K, where K is the block size. This means that, to update the value, we only need to update the value for one block!

Thus, to perform updates, we can simply update the value in the array, then recalculate the value for the block. Since the block is at most of size K, it will take O(K) time.

At the surface level, this looks much worse than ST, since it’s technically linear, but we have the freedom of choosing K.

9 of 14

Choosing a value for K

As mentioned, the value we choose for K is quite significant. Queries take O(ceil(N/K) + 2*K) time, whereas updates take O(K) time.

Choosing an optimal value of K tends to make these two operations run roughly in the same time. Using a little bit of simplification and math, we can find the (usually) optimal value for K.

10 of 14

Optimal Value of K

First, assume that 2*K and the ceil function provide negligible difference. Since we want updates and queries to take the same time, we have:

O(N/K) = O(K)

We can simply rewrite this as:

N/K = K

Multiplying through by K, we get:

N = K²

Since we want to solve for K, we take the SQRT (!!) of both sides, which yields:

K = sqrt(N)

Hence, K = sqrt(N) provides the most optimal block size. Note that this is where the “SQRT” in “SQRT” decomposition comes from.

11 of 14

Python Implementation

12 of 14

Deviations on K

Often, K = sqrt(N) won’t always be the most optimal choice of K. This is because, sometimes, there are more updates than queries, which would mean that a smaller value of K could be more optimal (updates are linear in K [O(K)]).

For these reasons, the “optimal” value of K varies from problem to problem. Usually, K = sqrt(N) is the most consistent, but sometimes isn’t enough. To AC some problems, you may need to try smaller and larger values of K to try and optimize your code.

13 of 14

Why use it?

Since SQRT decomposition buckets the array into N / K = N / sqrt(N) = sqrt(N) blocks, it only needs O(sqrt(N)) extra space, not including the O(N) array.

However, both its range and updates operations run in a beefy O(sqrt(N)), whereas ST runs in O(log₂N), which is much faster (for N = 10¹⁸, log₂N ~ 63, whereas sqrt(N) = 10⁹), so why use it?

The redeeming factor of SQRT decomposition is that, when using C++, the compiler knows how to optimize it. This optimization is called vectorization, which essentially makes your code significantly faster.

In particular, if your code runs in O(f(N)) and is vectorized, it will actually run much closer to O(f(N)/64). Though this is a constant factor, it can actually make substantial differences in runtime, allowing some TLE solutions to AC. This is one of the reasons why most Competitive Programmers code in C++, instead of Python.

14 of 14

Problems

https://dmoj.ca/problem/segtree (try using SQRT decomposition as practice)

https://dmoj.ca/problem/ccc05s5 (like above, it can be tackled without SQRT, but try with to practice)

Since SQRT decomposition answers queries, most query problems can be approached with SQRT. Feel free to check out the problems listed in the previous slideshows, such as Fenwick Tree and ST, then solve them with SQRT.