1 of 25

Scalable Package Queries in Relational Database Systems

Authors: Matteo Brucato, Juan Felipe Beltran, Azza Abouzied, Alexandra Meliou [Link]

Presented By: Swarad Ganesh Gat (114833402)

2 of 25

Problems with Traditional Databases

Traditional database queries define constraints that each tuple in the result must satisfy.
However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually.

3 of 25

Examples of practical problems require a collection of result tuples to satisfy

constraints collectively, rather than individually:

EXAMPLE 1 (MEAL PLANNER): A dietitian needs to design a daily meal plan for a patient. She wants a set of three gluten-free meals, between 2,000 and 2,500 calories in total, and with a low total intake of saturated fats.

EXAMPLE 2 (NIGHT SKY): An astrophysicist is looking for rectangular regions of the night sky that may potentially contain previously unseen quasars. Regions are explored if their overall redshift is within some specified parameters, and ranked according to their likelihood of containing a quasar.

4 of 25

Example: Meal Planning

→ A dietician wants to build a Meal Plan for a patient

Meal Plan:

3 gluten free recipes
At Least 2 Cal in total
Lowest total fat intake

SQL Query:

5 of 25

Self Joins are too expensive

→ This query is efficient only for constructing packages with very small cardinality: larger cardinality requires a larger number of self-joins, quickly rendering evaluation time prohibitive (Figure)

6 of 25

PaQL: The Package Query Language

A declarative query language that introduces simple extensions to SQL to define package semantics and package-level constraints.

→ We can express the query in the previous example with PaQL in the following way:

7 of 25

PaQL: The Package Query Language

Basic Semantics: The result of a package query is a set of packages. Each package resembles a relational table following the schema of relation R
Repetition Constraint: The REPEAT 0 statement in query Q specifies that no tuple from the input relation can appear multiple times in a package result
Base Predicates: : Any tuple in the package needs to individually satisfy the base predicate. They appear in the WHERE clause of the query
Global Predicates: Global predicates are the core of package queries, and they appear in the new SUCH THAT clause of the query
Objective Clause: The objective clause specifies a ranking among the candidate package results and appears with either the MINIMIZE or MAXIMIZE keywords.

8 of 25

ILP Formulation:

Terms:

Let R indicate the input relation
n = |R| the number of tuples in R
R.attr an attribute of R
P is a package
f a linear aggregate function (such as COUNT and SUM)
For each tuple ti from R, 1 ≤ i ≤ n, the ILP problem includes a nonnegative integer variable xi (xi ≥ 0), indicating the number of times ti is included in an answer package.

9 of 25

Translation Rules:

Repetition Constraint: The REPEAT keyword which is present in the FROM clause of the query implies 0 ≤ xi ≤ K +1.
Base Predicate: Let β be the base predicate. If R.gluten = “free” and Rβ the relation containing tuples from R satisfying β. We encode β by setting x_i=0 for every tuple t_i Rβ
Global Predicate: Each global predicate in the SUCH THAT cause takes the form f(P) v. For each predicate, we derive a linear function f’(x). For example, f(P) = COUNT(P.*) is linearly translated into f’(x) = 𝚺_ix_i. A summation constraint f(P) = SUM(P.attr) is translated into f’(x) = 𝚺_i(t_i.attr)x_i
Objective Clause: Encode MAXIMIZE f(P) as max f’(x̅)

10 of 25

Example of Translation:

Query:

11 of 25

Query Evaluation with DIRECT

Direct is a basic evaluation method for package queries. We employ 3 simple steps in the DIRECT algorithm:

ILP Formulation: Transform a PaQL query to an ILP problem using the translation rules
Base Relations: Compute the base relations such as Rβ, R_C and R_P. After this, all variables x_i such that x_i = 0 can be eliminated from the ILP problem since the corresponding tuple t_i cannot appear in any package solution. This reduced the size of the problem significantly.
ILP Execution: An off-the-shelf ILP solver is used as a black box to get the solution x̅∗ for all integer variables x_i. Each x̅∗ informs the number of times t_i should be included in the answer package.

12 of 25

Drawbacks of DIRECT

Only applicable if the input relation is small enough to fit into main memory (RAM). IBM’s CPLEX require the entire problem to be loaded into the memory before execution.
For problems that fit into the main memory, this approach might still fail due to the complexity of the problem. Modern ILP solvers use algorithms such as “branch-and-cut” that often perform well in practice but can choke even on small problem sizes due to their exponential worst-case complexity.

13 of 25

Scalable Package Evaluation:

SKETCHREFINE is an approximate divide-and-conquer evaluation technique for efficiently answering package queries on large datasets. SKETCHREFINE smartly decomposes a query into smaller queries, formulated them as ILP problems and employs an ILP solver as a “black-box” evaluation method to answer each query.

The algorithm avoids the drawbacks of the DIRECT algorithm. It is based on 2 important observations:

Similar tuples are likely to be interchangeable within packages
A group of similar tuples can therefore be “compressed” to a single representative tuple for the entire group.

14 of 25

Scalable Package Evaluation:

There are 3 main steps in the SKETCHREFINE algorithm:

Offline partitioning: The algorithm assumes a partitioning of the data into groups of similar tuples. The data is partitioned using k-dimensional quad trees.
Sketch: SKETCHREFINE sketches an initial package by evaluating the package query only over the set of representative tuples.
Refine: SKETCHREFINE then transforms the initial package into a complete package by replacing each representative tuple with some of the original tuples from the same group, one group at a time.

15 of 25

The SKETCHREFINE Algorithm:

16 of 25

The SKETCHREFINE Algorithm:

17 of 25

Offline Partitioning:

SKETCHREFINE relies on an offline partitioning of the input relation R. It is based on a set of k numerical partitioning attributes. It partitions the data using 2 parameters: a size threshold and a radius limit.

Size threshold: The size threshold τ restricts the size of each partitioning group to a maximum of τ original tuples.
Radius Limit: The radius r_j of a group is the greatest absolute distance between the representative tuple of the group G_j and every other original tuple of the group, across all partitioning attributes. ω is the radius limit.

18 of 25

Partitioning method

→ First, relation R is augmented with a group ID column gid, where ti.gid = j if tuple ti is in group Gj. Initially, all tuples are assigned to a single group G1 by setting ti.gid = 1.

Then, the procedure recursively:

1. Computes the size and radii of current groups by grouping tuples by gid.

2. Calculates the centroid tuple of each group Gj using the A partitioning attributes.

3. If group Gj exceeds the size threshold or radius limit, its tuples are partitioned into 2k subgroups (k = number of partitioning attributes).

20 of 25

REFINE:

→ Using the sketched solution over the representative tuples, the REFINE procedure iteratively replaces the representative tuples with tuples from the original relation R.

The algorithm derives package p̅_j from p_s by eliminating all instances of t̃_j.
It then replaces the eliminated representatives with the actual tuples.

21 of 25

Experiments:

Software:

The package implementation system is implemented on top of a traditional relational DBMS. The data resides in the database and the system interact with the DBMS via SQL (PostgreSQL). The core components of the evaluation are implemented in Python 2.7.
IBM’s CPLEX is used as the black-box ILP solver. When the algorithm needs to solve an ILP, the corresponding data is retrieved from the DBMS and passed to CPLEX.

22 of 25

Datasets and queries:

→ The performance of the query evaluation methods is tested using both real-world and benchmark data.

→ The real world data consists of 5.5 million tuples extracted from the Galaxy view of the Sloan Digital Sky Survey (link). For the benchmark dataset, the authors have used TPC-H (link)

→ 7 packaged queries were tried over the datasets.

→ Both SKETCHREFINE and DIRECT were compared.

→ The methods were judged on metrics like response time and approximation ratio.

→ Approximation ratio: As SKETCHREFINE always guarantees to return an approximate answer with respect to DIRECT. The objective values of both algorithms are divided to get the approximation ratio

23 of 25

Query performance as data set size increases:

24 of 25

Effect of varying partition size threshold

25 of 25

References:

Brucato, Matteo, Juan Felipe Beltran, Azza Abouzied, and Alexandra Meliou. "Scalable Package Queries in Relational Database Systems." Proceedings of the VLDB Endowment 9, no. 7 (2016).
Galaxy view of the Sloan Digital Sky Survey (link)
TPC-H (link)