Scalable Package Queries in Relational Database Systems
Problems with Traditional Databases
Examples of practical problems require a collection of result tuples to satisfy
constraints collectively, rather than individually:
EXAMPLE 1 (MEAL PLANNER): A dietitian needs to design a daily meal plan for a patient. She wants a set of three gluten-free meals, between 2,000 and 2,500 calories in total, and with a low total intake of saturated fats.
EXAMPLE 2 (NIGHT SKY): An astrophysicist is looking for rectangular regions of the night sky that may potentially contain previously unseen quasars. Regions are explored if their overall redshift is within some specified parameters, and ranked according to their likelihood of containing a quasar.
Example: Meal Planning
→ A dietician wants to build a Meal Plan for a patient
Meal Plan:
SQL Query:
Self Joins are too expensive
→ This query is efficient only for constructing packages with very small cardinality: larger cardinality requires a larger number of self-joins, quickly rendering evaluation time prohibitive (Figure)
PaQL: The Package Query Language
→ We can express the query in the previous example with PaQL in the following way:
PaQL: The Package Query Language
ILP Formulation:
Terms:
Translation Rules:
Example of Translation:
Query:
Query Evaluation with DIRECT
Direct is a basic evaluation method for package queries. We employ 3 simple steps in the DIRECT algorithm:
Drawbacks of DIRECT
Scalable Package Evaluation:
SKETCHREFINE is an approximate divide-and-conquer evaluation technique for efficiently answering package queries on large datasets. SKETCHREFINE smartly decomposes a query into smaller queries, formulated them as ILP problems and employs an ILP solver as a “black-box” evaluation method to answer each query.
The algorithm avoids the drawbacks of the DIRECT algorithm. It is based on 2 important observations:
Scalable Package Evaluation:
There are 3 main steps in the SKETCHREFINE algorithm:
The SKETCHREFINE Algorithm:
The SKETCHREFINE Algorithm:
Offline Partitioning:
SKETCHREFINE relies on an offline partitioning of the input relation R. It is based on a set of k numerical partitioning attributes. It partitions the data using 2 parameters: a size threshold and a radius limit.
Partitioning method
→ First, relation R is augmented with a group ID column gid, where ti.gid = j if tuple ti is in group Gj. Initially, all tuples are assigned to a single group G1 by setting ti.gid = 1.
Then, the procedure recursively:
1. Computes the size and radii of current groups by grouping tuples by gid.
2. Calculates the centroid tuple of each group Gj using the A partitioning attributes.
3. If group Gj exceeds the size threshold or radius limit, its tuples are partitioned into 2k subgroups (k = number of partitioning attributes).
SKETCH:
REFINE:
→ Using the sketched solution over the representative tuples, the REFINE procedure iteratively replaces the representative tuples with tuples from the original relation R.
Experiments:
Software:
Datasets and queries:
→ The performance of the query evaluation methods is tested using both real-world and benchmark data.
→ The real world data consists of 5.5 million tuples extracted from the Galaxy view of the Sloan Digital Sky Survey (link). For the benchmark dataset, the authors have used TPC-H (link)
→ 7 packaged queries were tried over the datasets.
→ Both SKETCHREFINE and DIRECT were compared.
→ The methods were judged on metrics like response time and approximation ratio.
→ Approximation ratio: As SKETCHREFINE always guarantees to return an approximate answer with respect to DIRECT. The objective values of both algorithms are divided to get the approximation ratio
Query performance as data set size increases:
Effect of varying partition size threshold
References: