1 of 39

CSE 344: Section 7

HW6 & Indexing

2 of 39

HW6

3 of 39

What is Flights App?

A client-server application

4 of 39

SQL Injection

Vulnerable web applications can expose capabilities to the user that are undesired

Dropping tables, inserting values, deleting values, gaining access when it should be rejected, etc.

Using PreparedStatement objects will help prevent SQL Injection

A PreparedStatement pre-compiles the SQL query such that inputs that would result in SQL Injection when parsed can no longer do so

SQL Injection Demo

5 of 39

PreparedStatements

How they work:

Your SQL Connection Object parses the String passed to prepareStatement() and pre-compiles it to a parameterized SQL query
The “?” in the query are interpreted as parameters that can be inserted via calls on the PreparedStatement Object

String rawQuery = “SELECT * FROM Flights WHERE origin_city = ? AND day_of_month = ?”;

PreparedStatement ps = conn.prepareStatement(rawQuery); // Pre-compiles the query into a PreparedStatement Object

ps.clearParameters(); // Clears parameters from previous use

ps.setString(1, originCity); // Sets the first parameter (the first “?”) to the value of the variable “originCity”

ps.setInt(2, dayOfMonth); // Sets the second parameter (the second “?”) to the value of the variable “dayOfMonth”

ResultSet rs = ps.executeQuery(); // Executes the query and stores the ResultSet in the variable “rs”

6 of 39

ResultSet

A ResultSet represents a table of data returned from executing a query
It maintains an internal cursor that points to the current row of data, initially that cursor is positioned before the first row so the first call to `next` moves the cursor to the first row

Makes it ideal for a while loop:

// ... continued from previous slide

ResultSet rs = ps.executeQuery();

while (rs.next()) { // check if there is another row and move to the next row

String destCity = rs.getString(“dest_city”); // Gets the value of the attribute “dest_city” for the current row

...

} // When `next` finally returns false, we know we’ve seen every row

rs.close(); // Remember to close the ResultSet

7 of 39

HW6 Notes

We will be using Azure again: you’ll need your server name, database name, username, and password to put in the dbconn.properties file

This file should not be submitted

Add test cases of your own, the ones that we provided are not exhaustive and do not test for the full set of capabilities we expect

There are private test cases - you can run them on Gradescope and see whether they pass, but won’t see tested commands or expected output
Private test case file names are a hint!

8 of 39

HW6 Notes part 2

Remember the tables you create is for global data stored for all users

Do not store local data for one user which might get deleted after the user quits the session

The Flights, Weekdays, Months, and Carriers tables should be read-only - you shouldn’t have to modify these.
Reminder to be careful about the order in which you create/delete tables!
Make sure to append your uw netid to your custom tables

9 of 39

HW6 Notes part 3

The `Flight` class is there for you to use if you like, feel free to modify it, add methods, etc.

You can also add internal classes into Query.java as you see fit

All your code must be in Query.java or PasswordUtils.java

It might be helpful to create a class that represents an Itinerary

If it implements the `Comparable` interface then you can use `Collections.sort` to sort it (this would be helpful for sorting one-hop and direct flights with one another)

10 of 39

HW6 Demo

11 of 39

Indexing

12 of 39

13 of 39

Definitions

Index - data structure that organizes data records on disk to optimize selections on the search key fields for the index
An index contains a collection of data entries, and supports efficient retrieval of all data entries with the given search key value k
A data entry is of the format (key = k, value = pointer to records)

14
15
16

14	tuple1

15	tuple2

16	tuple3

Data file sorted on age

Index File

Index on age

14 of 39

Definitions

An index is either clustered or unclustered, depending on how the actual data is sorted on disk
Clustered index - one that has the same key ordering as what is on disk (one per table)
Unclustered index - may exist without any ordering on disk (any number per table)

15 of 39

Definitions

An index is either clustered or unclustered, depending on how the actual data is sorted on disk
Clustered index - one that has the same key ordering as what is on disk (one per table)
Unclustered index - may exist without any ordering on disk (any number per table)

16 of 39

Why Indexing?

Consider you have a deck of 52 cards. I asked you to pick out the 8 of hearts. Calculate the average number of flips required if:

The deck is randomly shuffled.
The deck is separated into four piles, each representing a suit (spades, diamonds, clubs, hearts). Each pile is kept randomly face down.

17 of 39

Problem 1

To find the 8 of hearts from a randomly shuffled cards, on average I would require 26 flips through the deck, about half the deck.

18 of 39

Problem 2

There are 4 piles, one for each suit.
Disregard the other 3 piles, i.e. clubs, spades, diamonds.
In doing so we would require 2 flips, on average, to find the heart pile.
We're left with 13 cards in the heart pile to scan, on average we would require 7 flips to find the 8 of hearts.

19 of 39

Answer

Problem 1:

Scan 52 Cards (+26)

Average Flips = 26

Problem 2:

Pick suit (+2)
Scan 13 cards (+7)

Average Flips = 9

20 of 39

Further Thinking…

Can you think of an better index?
Would your index be clustered or unclustered?

21 of 39

22 of 39

Cost Estimation

23 of 39

Cost Estimation: Factors

B(R) = # blocks for relation R

T(R) = # tuples for relation R

V(R, a) = # of unique values of attribute a in relation R

M = # of available memory pages

24 of 39

Cost Estimation: Nested Loop Join (⋈)

Naive: B(R) + T(R)B(S)

for each tuple t1 in R do

for each tuple t2 in S do

if t1 and t2 join then output (t1,t2)

25 of 39

Cost Estimation: Nested Loop Join (⋈)

Block-at-a-time: B(R) + B(S)B(R)

for each block bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

26 of 39

Cost Estimation: Nested Loop Join (⋈)

Block-nested-loop: B(R) + (B(R)/(M-1))*B(S) ≅ B(R) + (B(R)/(M)) * B(S)

for each group of M blocks bR in R:

for each block bS in S:

for each tuple tR in bR:

for each tuple tS in bS:

if tR and tS can join:

output (tR,tS)

27 of 39

Cost Estimation: Hash Join (⋈)

R joined with S (assume R is smaller in size)

B(R) + B(S)

Assuming B(R) < M for one pass (look at each table once) efficiency, read all of R into a hash table and join with all of S

28 of 39

Cost Estimation: Sort-Merge Join (⋈)

B(R) + B(S)

One pass (look at each table once); Both must be small (B(R) + B(S) < M)

Why would we use this over Hash Join?

Tables are sorted on join attributes (No need to hash)
Range join instead of equijoin

29 of 39

Selectivity Formulas

Selectivity Factor (X) → Proportion of total data needed

Assuming uniform distribution of data values on numeric attribute a in table R, if the condition is:

a = c → X ≅1 / V(R, a)
a < c→ X ≅ (c - min(R, a))/ (max(R, a) - min(R, a))
c1 < a < c2→ X ≅ (c2 - c1)/ (max(R, a) - min(R, a))
cond1 and cond2 → X ≅ X1 * X2

30 of 39

Cardinality Estimation Example

31 of 39

Cardinality Estimation (Point Selection)

Assume a uniform distribution.

σ_{color = “orange”}

X ≅ 1 / V(color)

≅ 1/4

32 of 39

Cardinality Estimation (Union)

Assume conditions are disjoint.

σ_{color = “orange” OR color = “green”}

X ≅ 1 / V(color) + 1 / V(color)

≅ 1/4 + 1/4 = 1/2

33 of 39

Cardinality Estimation (Intersection)

Assume independence.

σ_{c1 = “red” AND c2 = “yellow”}

X ≅ 1 / V(c1) · 1 / V(c2)

≅ 1/2 · 1/2 = 1/4

c1

c2

34 of 39

Cardinality Estimation (Ranges)

Schema: Weather(date, pressure). Assume a uniform distribution.

σ_{pressure < p}

σ_{pressure > p}

σ_{p < pressure < q}

X ≅ (p − min) / (max - min)

X ≅ (max − p) / (max - min)

X ≅ (q − p) / (max - min)

min(pressure)

max(pressure)

p

q

35 of 39

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ (1 / (V(c1) · V(c2))) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4

c1

c2

Equijoins are tricky. The first guess we might have for the selectivity factor is X = 1/V(c1) * 1/V(c2). This isn’t quite right though, as 1/V(c1) * 1/V(c2) = 1/3 * 1/4 = 1/12. The correct answer we want is X = 3/12 = 1/4. What went wrong?

Let’s think of what 1/V(c1) * 1/V(c2) really means. From the “Point Selection” slide, we know that 1/V(c1) is the selection factor for choosing one specific color from c1. For example, the selectivity factor for σ (c1 = ‘red’) is 1/V(c1) = 1/3. The same goes for 1/V(c2). For example, σ (c2 = ‘red’) has a selectivity factor of 1/V(c2) = 1/4.

From the “Intersection” slide, we know that multiplying selectivity conditions (when they are independent, and in this case they are) is the same as the AND of them. So σ (c1 = ‘red’ AND c2 = ‘red’) has a selectivity factor of 1/3 * 1/4 = 1/12. If we look at the table, that’s exactly right: there is just one circle that is red for c1 and c2.

We don’t just want one circle here, though. The equijoin is like multiple selections at once: σ (c1 = ‘red’ AND c2 = ‘red’) + σ (c1 = ‘orange’ AND c2 = ‘orange’) + σ (c1 = ‘yellow’ AND c2 = ‘yellow’) = 1/12 + 1/12 + 1/12 = 1/12 * 3 = 3/12.

Is that why we multiply by min(V(c1), V(c2))? Yes! Think about this: the selectivity factor for each σ is always the same: 1/V(c1) * 1/V(c2). The only question is how many we need to add together. When the tables are different sizes, the smaller table is the one that constrains the join. You could add 100 new colors to c2 and it would not increase the cardinality of the join as c1 would not have any of those new colors. The cardinality only increases when V(c1) increases to match V(c2).

In conclusion, we get the formula 1/V(c1) * 1/V(c2) * min(V(c1), V(c2)), which we can simplify to 1/max(V(c1), V(c2)). (When we divide the smaller of V(c1) and V(c2) by both V(c1) and V(c2), we’re left with the larger of the two--max(V(c1), V(c2)).)

36 of 39

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/4 (didn’t change)

c1

c2

37 of 39

Cardinality Estimation (Equijoin)

Assume uniform distribution and full overlap.

⋈_{c1 = c2}

X ≅ 1 / (V(c1) · V(c2)) · min(V(c1), V(c2))

≅ 1 / max(V(c1), V(c2))

≅ 1/5 (decreased)

c1

c2

38 of 39

Summary

39 of 39

Worksheet