3 of 72

[Review]�Determining Serializability:

Conflicting Actions

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

4 of 72

Conflicting Actions

Def: Two actions conflict if:

They are two different, concurrent transactions.
They reference the same object.
At least one is a write.

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

#olap

5 of 72

How do we know if a schedule is serializable?

We like serializable schedules.

Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:

the txns appear to have followed a serial schedule.
that txns can be “rolled back” one-by-one.

Conflicting actions between transactions will determine if a schedule is serializable.

What does it mean???

Let’s dive in!

#olap

6 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

🤔

#olap

7 of 72

Which of the following are conflicting actions?

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

R₂(P)

W₁(P)

T2 reads �Parth salary �as ????

T1 writes�Parth salary �as ????

W₂(G)

R₁(G)

T2 writes Gabi salary as ????

T1 reads Gabi salary as ????

W₂(J)

W₁(J)

T2 writes Jonah salary as ????

T1 writes Jonah salary as ????

R₂(G)

R₁(G)

T2 reads Gabi salary as ????

T1 reads Gabi salary as ????

W₂(P)

W₁(J)

T2 writes Parth salary as ????

T1 writes Jonah salary as ????

Suppose T1 → T2 in a schedule. For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

hypothetically

#olap

8 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

#olap

9 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

W₁(P)

R₂(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R₁(G)

W₂(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W₁(J)

W₂(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R₁(G)

R₂(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W₁(J)

W₂(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

can be flipped! no conflicts!

cannot be flipped! conflicting actions!

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

#olap

10 of 72

How do we know if a schedule is serializable?

We like serializable schedules.

Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:

the txns appear to have followed a serial schedule.
that txns can be “rolled back” one-by-one.

A schedule is serializable if all conflicting actions dictate a specific ordering of the transactions (with no cycles)

A topological sort on the graph of conflicts between transactions.

Conflicting actions between transactions determine if a schedule is serializable.

#olap

11 of 72

From the board

#olap

12 of 72

The previous slide, in table form

T₁:

R₁(J)
W₁(P)

R₂(P)
W₂(P)

T₂:

T₁	T₂
	R₂(P)
R₁(J)
	W₂(P)
W₁(P)

serializable

T₁	T₂
	R₂(P)
R₁(J)
W₁(P)
	W₂(P)

unserializable

T₁	T₂
R₁(J)
	R₂(P)
	W₂(P)
W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

R₂(P)

W₁(P)

W₂(P)

W₁(P)

⚠️

serializable

Schedule 4: The two pairs of conflicting actions imply two different orders to T1 and T2.

#olap

13 of 72

Performance Tradeoffs: Snapshot Isolation

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

14 of 72

Final thoughts: Why Should Data Engineers Know Transactions?

You may think that transactions (and serializability) are very much in the weeds of DBMS design, which we don’t particularly implement in this course. However…

Inevitably you will update a database and manage data from transactional databases!

This means you should have a sense of its characteristics.

If your DB is slow for transactional reasons:

You should understand why
And how you can trade-off speed and “correctness,” i.e., redefine your transactions.

Finally, transaction concepts are also quite useful outside of databases.

Examples: Queueing systems, e.g., RabbitMQ or Kafka.

#olap

15 of 72

Serialized Transactions: A summary

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

Strict 2PL is a common implementation of serialization, though it is not the only one. [Knowing the implementation is optional, see previous lecture]

Life is good?…Except…

SELECT avg(gpa) FROM students;

Locks all students!
But we likely don’t need this to be 100% correct!

Sometimes we prefer to trade correctness for a little more performance.

#olap

16 of 72

Approximating Serialized Transactions with Weak Isolation

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

Strict 2PL is a common implementation of serialization, though it is not the only one.

Enter: Weak Isolation.

Each isolation can choose to be a “bit sloppy”...
…as long as it doesn’t mess up other transaction’s choices to do so.
The most common weak isolation implementation is snapshot isolation.
This is a much weaker property of isolation than serialized transactions,�but it’s good enough when we prefer more concurrency/higher performance.

Life is good?…Except…

SELECT avg(gpa) FROM students;

Locks all students!
But we likely don’t need this to be 100% correct!

Sometimes we prefer to trade correctness for a little more performance.

#olap

17 of 72

Snapshot Isolation

Snapshot isolation is a weaker form of isolation than serialization, but it’s good enough when we prefer more concurrency/higher performance.

Database system requirements: Keep multiple versions of tuples.
Postgres’ approach: Multiversion Concurrency Control

At transaction start: Take a “snapshot” of the database, off which to do reads/writes.

snapshot reads: All reads of this transaction are from this snapshot.
write validation: This transaction can commit if none of its writes conflict with other transactions since the snapshot was taken.

If write-write conflicts, then abort this transaction.

#olap

18 of 72

Snapshot Isolation is Actually Popular

Isolation levels (both default and maximum) vary in support across different database engines.

Marketing also varies!

When Oracle says “Serializable,” they actually are giving you Snapshot Isolation!!

The maximum levels of many cloud DBMSs is not always the theoretical maximum,�which is “serializable” transactions.

Serializable: Google Cloud Spanner, CockroachDB, Azure SQL Server
Read Commit: Snowflake, AWS Aurora

For more about Read Commit and others, check out the bonus slides.

#olap

19 of 72

BI: OLAP and OLTP

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

20 of 72

Recall: Data warehouse produced as the result of an ELT process.

Data Integration

Load

Transform

Extract

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Usually, the purpose of such a data system targets large scale analytics on “clean” data.

#olap

21 of 72

Recall: Data warehouse produced as the result of an ELT process

Data Lake

Extract

Load

Transform

Data Preparation /

Integration

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Use-Case-Specific

Fit for purpose

Self-Service

Usually, the purpose of such a data system targets large scale analytics on “clean” data.

#olap

22 of 72

Data Warehouses, OLAP, and BI

Business Intelligence (BI): enterprise technologies for data analysis and management.

Aka Decision Support
Generally leverages data warehouses!

OLTP, or OnLine Transaction Processing is another common technology.

Read, insert, update, and delete data. Support most things we’ve seen in the course.
Read-write queries on a smaller amount of data, say, gigabytes.
Example: append a new car sale into the sales table.

OLAP, or OnLine Analytical Processing is one huge technology under the BI umbrella.

Read/summarize large volumes of multidimensional data to understand trends and patterns.
Read-only queries on, say, petabytes of multidimensional data.
Example: total sales of each type of Honda car over time for each county.

OLAP and OLTP are generally two separate database management systems that store and process volumes of data in different ways across the same ecosystem.

#olap

23 of 72

BI Systems (and Visualization)

2023 MAD (ML/AI/Data) Landscape: blog, interactive

#olap

24 of 72

Example from Amazon AWS

https://aws.amazon.com/compare/the-difference-between-olap-and-oltp/

We’ll disambiguate some of this terminology today.

typical business architecture
common OLAP operations
multidimensional data model for OLAP: data cubes, facts and dims
Implementation: star/ snowflake schema

#olap

25 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Post-hoc large-scale analysis (OLAP) happens separately from updates (OLTP).

OLAP is performed in a data warehouse
OLTP happens in the per-region database
OLAP separate from critical path of OLTP.

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

OLAP support

periodic updates

#olap

26 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Then, consolidation of data updates into the warehouse happens periodically via ETL.

Extract useful business info to be summarized
Transform it (i.e., canonicalize/clean up)
Load it into the warehouse

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

periodic updates

OLAP support

#olap

27 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Note: OLAP will suffer some staleness, but OK in practice.

Updates like sales are crucial.
Better to have sales never be prevented than for a manager’s report to generate promptly.
After all, large-scale OLAP read queries would otherwise delay updates.
Analysis may not be useful in real-time (e.g. looking at “yesterday’s sales”, we don’t want to see only part of “today”)

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

OLAP support

periodic updates

#olap

28 of 72

CalAnswers: BI Demo

https://calanswers.berkeley.edu/home

https://calanswers-bi.berkeley.edu/analytics/saw.dll?Portal&PortalPath=/shared/Maintenance/Home

Summary

So Far

29 of 72

Data Cubes and OLAP Queries

[Review] Serializability�—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

30 of 72

Multidimensional Data: Data Cubes

OLAP systems fundamentally provide comprehensive, summarized views of data in data warehouses.

This summarization can and will happen across multiple dimensions of data. One such representation is a data cube, aka OLAP cube.

#olap

31 of 72

Multidimensional Data: Data Cubes

Our running example today is a 3-dimensional cube that computes sales:

(item, store, customer, number)

where number is the measure �(i.e., sales count) per customer� per store per item.

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

NYC_2

SF_3

LA_1

This cube generalizes to the >3D case! Think about it at home.

#olap

32 of 72

OLAP Queries

(item, store, customer, number)

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

1. Cross-tab

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

#olap

33 of 72

[Exercise] OLAP Queries

🤔

1. Cross-tab, i.e., pivot table

2(i). Slice to get equality condition � on one or more dimensions.

2(ii).Dice to get range partition � on one or more dimensions.

3(i). Roll-up and�3(ii).drill-down� adjust granularity to � summarize at different � levels of a dimension hierarchy.

Match the OLAP operation to the diagram. Format: 1A

#olap

34 of 72

Match the OLAP operation to the diagram. Format: 1A or 2(i)-A, etc.

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

35 of 72

(1/3) OLAP Queries: Cross-tab

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

1. Cross-tab, i.e., pivot table

Example: number of sales �per item per person

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

#olap

36 of 72

(1/3) OLAP Queries: Cross-tab

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

Remy

Jas

Tracy

Yuan

1. Cross-tab, i.e., pivot table

Example: number of sales �per item per person
Example: number of sales �per item per store

shirt

jacket

jeans

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

#olap

37 of 72

(2/3) OLAP Queries: Slice and Dice

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

1. Cross-tab, i.e., pivot table

Example: number of sales �per item per person
Example: number of sales �per item per store

3. Roll-up and drill-down

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan

dice

slice

2. Slice and dice i.e.,� get equality and range partitions� (respectively) on one or more � dimensions

Dicing is particularly useful for combining date dimensions (e.g., by month, day of week, years, weeks, etc.)

#olap

38 of 72

(2/3) OLAP Queries: Slice and Dice

(item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension � hierarchy

roll up to stores by state

drill down to item by item name

#olap

39 of 72

(3/3) OLAP Queries: Roll-up, Drill-down

Q: Can the cube be stored at this granularity?

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension

hierarchy

drill down to item by item name

#olap

40 of 72

(3/3) OLAP Queries: Roll-up, Drill-down

Q: Can the cube be stored at this granularity?

drill down to item by item name

Data cube elements need to be stored at the appropriate granularity to support the expected OLAP queries.

A: ⚠️ No, because it’s impossible to drill down from item category to the finer granularity item name!

shirt category: polo, blouse, turtleneck,�etc.

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension

hierarchy

#olap

41 of 72

Hierarchies for Setting the Partition Granularity

The partitioning granularity can be set based on user needs…

Different partitioning may be useful for different applications.

Examples:

Q: I want the aggregates per month.

What if we computed a data cube based on Full Date?

We can roll up to access any part of the hierarchy.

What about if we had computed a data cube based on Year?

No longer possible to drill down to month!

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

Full Date

Year

#olap

42 of 72

Summary: Data Cubes and OLAP Queries

OLAP systems fundamentally provide comprehensive, summarized views of data in data warehouses.

Amazon OLAP vs. OLTP [link]

One such representation is a data cube, aka OLAP cube.

Each element in the cube represent a measure with respect to multiple dimensions of data.
The data cube is a multi-dimensional array of measures (e.g., tensor of sales numbers).
Generalizes to >3D (why?)

OLAP queries are summarizations, i.e., various aggregate operations of the data cube.

Cross-tabs
Slicing and dicing
Roll-ups, Drill-downs

Cubes need to be stored at the appropriate granularity to support the expected OLAP queries.

#olap

43 of 72

Star Schema, Snowflake Schema

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

44 of 72

Data Cubes and physical layout

How do we actually store our data cube?

Recall from our tensors lecture:

Dense tensors are not efficient storage.

Many elements are likely nulls, i.e., not measured!

❌

#olap

45 of 72

Data Cubes and physical layout

How do we actually store our data cube?

Recall from our tensors lecture:

Dense tensors are not efficient storage.

Many elements are likely nulls, i.e., not measured!

However, tensors have sparse representations �in the form of relations!

OLAP in a relational database:

mental model: data cube, a multidimensional tensor
physical model: star schema or snowflake schema of relations

Note: Not all OLAP DBMSes are relational! (more options later)

❌

✅

#olap

46 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: �(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

itemID1

itemID2

itemID3

storeID3

storeID2

storeID1

custID1

custID2

custID3

custID4

A data cube in the abstract (for some defined granularity of item, customer, and store)

#olap

47 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

itemID	storeID	custID	number
11	1	1	25
11	2	1	8
12	1	1	30
20	1	1	8
11	1	2	35
…	…	…	…

Sales

Fact table that maps dimensions�to measure of interest

#olap

48 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

storeID	city	state	country
1	SF	CA	USA
2	SF	CA	USA
5	NYC	NY	USA
…	…	…	…

Store

Dimension table that stores the store dimension’s attributes

itemID	storeID	custID	number
11	1	1	25
11	2	1	8
12	1	1	30
20	1	1	8
11	1	2	35
…	…	…	…

Sales

Fact table that maps dimensions �to measure of interest

Star schema:

One fact table
Many dimension tables

A dimension table stores the attributes of that dimension and functions as an explicit hierarchy.

#olap

49 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

Star schema:

One fact table
Many dimension tables

A dimension table stores the attributes of that dimension and functions as an explicit hierarchy.

#olap

50 of 72

[Exercise] True or False?

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

🤔

1. Sales is a junction table.

2. Sales and Customer have a many-to-one relationship.

3. There is a foreign key constraint between Store.storeID and Sales.storeID.

True or False?

#olap

51 of 72

True or False?

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

52 of 72

[Exercise] True or False?

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

1. Sales is a junction table.

2. Sales and Customer have a many-to-one relationship.

3. There is a foreign key constraint between Store.storeID and Sales.storeID.

True

#olap

53 of 72

Example OLAP Query, but as a SQL query

SELECT category, country, SUM(number)

FROM Sales� NATURAL JOIN Item� NATURAL JOIN Store

GROUP BY category, country

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

cross-tab

OLAP report queries GROUP BY some dimension attributes and aggregate measure attributes.

Example cross-tab:

#olap

54 of 72

From a Star Schema to a Snowflake Schema

A snowflake schema extends a star schema:

Still one fact table
Many dimension tables
Dimension tables are also connected to other dimension tables

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category, manuf_name)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

Manufacturer(name, address, owner)

foreign key

#olap

55 of 72

MOLAP, ROLAP, and HOLAP

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

56 of 72

Types of OLAP databases

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

Optimized multidimensional array (i.e., not dense)
Pre-compute materialized slices, cross-tabs, different cube granularities

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
No pre-computation, other than the materialized views specified by the user

#olap

57 of 72

[Exercise] MOLAP vs. ROLAP

MOLAP (Multidimensional OLAP)�Data cube stored as a multidimensional array storage

Relational OLAP (ROLAP)�Data cube is stored as a star (or snowflake) schema in relational DBMS

1. (+) Fast response to queries (due to precomputation)

2. (+) Supports generalized SQL tools

3. (+) Supports data that does not fit into the data cube model

4. (–) Expensive ETL time (due to precomputation)

5. (–) No OLAP-specific optimizations (other than what is provided already in SQL)

6. (–) Needs SQL knowledge to specify new materialized views

7. (–) Potential data redundancy

A. MOLAP

B. ROLAP

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1A

🤔

#olap

58 of 72

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1A

ⓘ

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

59 of 72

A third type of OLAP database

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

Optimized multidimensional array (i.e., not dense)
Pre-compute materialized slices, cross-tabs, different cube granularities

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
No pre-computation, other than the materialized views specified by the user

Hybrid OLAP (HOLAP) The best of both worlds.

Some data stored in relational storage; other data stored in specialized storage.
The exact proportions are left to the specific OLAP system + the designer.

#olap

60 of 72

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

(new) data cubes, star/snowflake schema, facts, measures, dimensions
(new) slicing/dicing
(review) hierarchies, roll-up, drill-down, cross-tabs
MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

#olap

61 of 72

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

(new) data cubes, star/snowflake schema, facts, measures, dimensions
(new) slicing/dicing
(review) hierarchies, roll-up, drill-down, cross-tabs
MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

One step closer to understanding jargon!!

Microsoft Learn / Power Platform / Power BI [link]

Amazon What is OLAP [link]

#olap

62 of 72

[Extra] OLAP in SQL

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

63 of 72

OLAP in SQL: CUBE

“NULL” is used to indicate “ALL”

SELECT item, color, SUM(number)

FROM Sales

GROUP BY CUBE (item, color)

Any attr. may be replaced with NULL(ALL)

May result in too many combinations

Item	Color	Number
Jacket	Blue	23
…	…	…
Jacket	Green	34
Jeans	Blue	28
…	…	…
Jeans	Green	17
Jacket	NULL	185
Jeans	NULL	200
…	…	…
NULL	Blue	94
NULL	Red	74
…	…	…
NULL	NULL	984

#olap

64 of 72

ROLLUP is an alternative to CUBE

ROLLUP targets a smaller # of combinations

SELECT item, color, size, SUM(number)

FROM Sales

GROUP BY ROLLUP (item), ROLLUP (color, size)

Every combination of:

specific item or ALL for the first rollup
for the second rollup

specific color and specific size
specific color and ALL sizes
ALL colors and ALL sizes

Thus, combinations include

{(item, color, size), (item, color), (item), (color, size), (color), ()}

#olap

65 of 72

Picking the right CUBE/ROLLUP query

First, why does this matter?

If this query is being run once on a PB- sized warehouse, it is important to get it right!
Results usually materialized and used in dashboards, presentations, spreadsheets, ….

Approach:

Think about all the ways you want to slice and dice your data
Pick granularity to recreate all aggregates you want, without blowing up the query result

Result size grows exponentially in the attrs; can be quite bad in large snowflake schemas
Known as the curse of dimensionality

#olap

66 of 72

[Extra] Implicit Dimensions

[Review] Serializability

Snapshot Isolation

—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

Lecture 22, Data 101, Fall 2024

67 of 72

Dates in OLAP

Example Schema:

Fact table: Sales (itemid, storeid, customerid, date, number, price)
Dim table: Iteminfo (itemid, itemname, color, size, category)
Dim table: Store (storeid, city, state, country)
Dim table: Customer (customerid, name, street, city, state)

Not very useful to “aggregate” date; better to treat date as an implicit dimension!

Allows us to group by and see trends across dates (e.g., sales by year)
Fact table: Sales (itemid, storeid, customerid, date, number, price)
Implicit Dim table: Dateinfo (date, month, quarter, year)

Store

Iteminfo

Customer

Sales

#olap

68 of 72

Normalization and Star Schemas

Example Schema:

Fact table: Sales (itemID, storeID, customerID, date, number, price)
Dim table: Item (itemID, itemname, color, size, category)
Dim table: Store (storeID, city, state, country)
Dim table: Customer (customerID, name, street, city, state)

Not very useful to “aggregate” date; better to treat date as an implicit dimension!

Allows us to group by and see trends across dates (e.g., sales by year)
From the “true Fact” table: Sales (itemID, storeID, customerID, date, number, price)
Implicit dimension table: DateInfo (date, month, quarter, year)
And perhaps another implicit dimension table on price (by date and item)

Store

Iteminfo

Customer

Sales

#olap

69 of 72

Implicit Dimension Tables and SQL queries

Example Schema:

Fact table: Sales (itemID, storeID, customerID, date, number, price)
Dim table: Item (itemID, itemname, color, size, category)
Dim table: Store (storeID, city, state, country)
Dim table: Customer (customerID, name, street, city, state)
Implicit dimension table: DateInfo (date, month, quarter, year)
etc.

Store

Iteminfo

Customer

Sales

Date

#olap

70 of 72

Implicit Dimension Tables and SQL queries

Example Schema:

Fact table: Sales (itemID, storeID, customerID, date, number, price)
Dim table: Item (itemID, itemname, color, size, category)
Dim table: Store (storeID, city, state, country)
Dim table: Customer (customerID, name, street, city, state)
Implicit dimension table: DateInfo (date, month, quarter, year)
etc.

What we might want:

SELECT category, country,� month, COUNT(number)

FROM Sales

NATURAL JOIN Iteminfo NATURAL JOIN � Store NATURAL JOIN Dateinfo

GROUP BY category, country, month

Store

Iteminfo

Customer

Sales

Date

Actual query might be (in PostgreSQL)

SELECT category, country, datepart(‘month’, date) as month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country, month

#olap

71 of 72

Extra graphics

…

tops

jackets

bottoms

USA

Cape Town

India

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

NYC_2

SF_3

LA_1

#olap

72 of 72

Extra graphics

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan

#olap