1 of 19

Data-Driven Decision Making:�Managing Risk, Diversity & Model Auditing

Riddho R. Haque

Supervised by Peter J. Haas & Alexandra Meliou

VLDB PhD Workshop 2025

1

2 of 19

Collaborators

2

Anh L. Mai

Matteo Brucato

Azza Abouzied

Peter J. Haas

Alexandra Meliou

Funded By:

Marco Serafini

3 of 19

Data Driven Decisions are important

3

Data-Driven Decision Making

Finance

Transportation

Manufacturing

Healthcare

Journalism

4 of 19

Current decision-making workflows are not enough

4

Data

Solvers / Optimizers

Predictive AI Models

Uncertain

Multimodal

Big

Main Memory Bound

Can’t handle many dimensions.

Managing risks, diversity, etc. scalably

Might be stealing your data

Uninterpretable

Unpoliced

5 of 19

Why use databases to make your decisions?

5

Keep your data in the DB.

[SIGMOD08, SIGMOD13]

Application independent, Declarative SQL-extensions for decision-making.

[VLDB16, SIGMOD20, VLDB25]

Scale beyond main-memory limitations. [VLDB16, SIGMOD20, VLDB24, VLDB25]

DB

Main Memory

6 of 19

What’s next for In-DB Decision-Making?

6

Scaling to Large Probabilistic Relations

Decisions based on multimodal data.

Interpretability, transparency, and

explainability.

7 of 19

My PhD Dissertation Work

7

Large scale stochastic optimization

[VLDB 25]

Multistage Decision-Making

Explainability

Diversity-Aware Decision-Making

(From multimodal data)

Auditing ML-Driven Decisions

8 of 19

Part I: Large Scale Stochastic Optimization

8

Data

Solvers / Optimizers

Predictive AI Models

Uncertain

Multimodal

Big

Main Memory Bound

Can’t handle many dimensions.

Managing risk, diversity, etc. scalably

Might be stealing your data

Uninterpretable

Unpoliced

9 of 19

Large Scale Stochastic Optimization

9

Which stocks should I invest in?

I want but I fear taking risks.

In-DB Solver

[VLDB 25]

SQL-extensions that allow specifying risk constraints and objectives.

[SIGMOD 20, VLDB 25]

Company

Sell In

(Days)

How many shares?

GOOG

278

1

MSFT

648

2

STBZ

341

3

EQS

614

1

Created Portfolio

10 of 19

In- Solvers for Large Scale Stochastic Optimization [VLDB 25]

10

  • RCLSolve finds linear approximations to non-convex risk constraints.
  • Stochastic SketchRefine uses Divide and Conquer (with RCLSolve solving its subproblems) to scale beyond in-memory ILP solvers.

Simpler feasible set ->� faster to solve 

More details tomorrow (Sep 2, 2025) 3:45 – 5:15 pm,

4F Wordsworth

Research-21

11 of 19

Part II: Diversification with Optimization

11

Data

Solvers / Optimizers

Predictive AI Models

Uncertain

Multimodal

Big

Main Memory Bound

Can’t handle many dimensions.

Managing risk, diversity, etc. scalably

Might be stealing your data

Uninterpretable

Unpoliced

12 of 19

The Need For Diversity: Sampling Tweets

12

“How the Internet Reacted

to the Arab Spring”

“Get me 50 tweets that fit into the front page, with as many likes as possible”

In-DB Solver

Sample of Tweets

The retrieved tweets may be too similar.

The retrieved tweets may not cover all opinions.

13 of 19

Creating diverse and representative samples

13

Balancing Optimality with diversity and coverage.

Diversity

Coverage

No two sampled tweets are ‘too close’

No unsampled point is ‘too far’ from sampled points

Text to embedding

Measuring similarity between tweets.

High-dimensional

embeddings

14 of 19

Creating diverse and representative samples

14

Diversity

Constraint

Coverage

Constraint

Orders of magnitude runtime improvement over past techniques.

[ICDT 22, SDM 23]

Minimum Set Cover

Diverse Sampling

Scalable

Optimization

15 of 19

Part III: Auditing ML-Driven Decisions

15

Data

Solvers / Optimizers

Predictive AI Models

Uncertain

Multimodal

Big

Main Memory Bound

Can’t handle many dimensions.

Managing risk, diversity, etc. scalably

Might be stealing your data

Uninterpretable

Unpoliced

16 of 19

ML Models Are Increasingly Used in In-DB Decision-Making

16

Predicting Future Stock Prices

Hate Speech

Racism

Spam

Misinformation

Filtering Toxic Content from Social Media

17 of 19

Are ML Models Violating Copyrights/Privacy?

17

Trained on Copyrighted Art?

Targeted Ads Based On Private Conversations?

The New Yorker, 2025

18 of 19

Auditing Binary Classifiers

18

-

-

-

-

-

-

+

+

+

+

+

-

-

-

-

-

-

+

+

+

+

+

-

-

+

-

-

-

-

-

-

+

+

+

+

+

?

?

?

?

?

?

?

?

?

?

?

?

Different Possible Decision Boundaries given a training set.

Oracle queries tell us where the decision boundary passes through.

Unexplained perturbations → Model could have been trained on ‘hidden’ data points.

19 of 19

Summary

  • Our in-database algorithms enable decision making which is:

    • Scalable
    • Risk Constrained
    • Diversity Aware

  • They can work with different data types such as:

    • Probabilistic
    • Multi-modal data

  • Tools are needed to audit the provenance of ML-derived data used in decision-making.

19

Paper