1 of 72

BI, OLAP, and Summarization

November 14, 2024

Data 101, Fall 2024 @ UC Berkeley

Lisa Yan, Michael Ball https://data101.org/fa24/

1

LECTURE 22

2 of 72

Join at slido.com�#OLAP

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

3 of 72

[Review]�Determining Serializability:

Conflicting Actions

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

3

Lecture 22, Data 101, Fall 2024

4 of 72

Conflicting Actions

Def: Two actions conflict if:

  • They are two different, concurrent transactions.
  • They reference the same object.
  • At least one is a write.

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

4

5 of 72

How do we know if a schedule is serializable?

We like serializable schedules.

  • Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:
    • the txns appear to have followed a serial schedule.
    • that txns can be “rolled back” one-by-one.

5

Conflicting actions between transactions will determine if a schedule is serializable.

What does it mean???

Let’s dive in!

6 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

6

W1(P)

R2(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R1(G)

W2(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W1(J)

W2(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R1(G)

R2(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W1(J)

W2(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

A.

B.

C.

D.

E.

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

🤔

7 of 72

Which of the following are conflicting actions?

7

W1(P)

R2(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R1(G)

W2(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W1(J)

W2(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R1(G)

R2(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W1(J)

W2(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

R2(P)

W1(P)

T2 reads �Parth salary �as ????

T1 writes�Parth salary �as ????

W2(G)

R1(G)

T2 writes Gabi salary as ????

T1 reads Gabi salary as ????

W2(J)

W1(J)

T2 writes Jonah salary as ????

T1 writes Jonah salary as ????

R2(G)

R1(G)

T2 reads Gabi salary as ????

T1 reads Gabi salary as ????

W2(P)

W1(J)

T2 writes Parth salary as ????

T1 writes Jonah salary as ????

Suppose T1 → T2 in a schedule. For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

hypothetically

hypothetically

hypothetically

hypothetically

hypothetically

8 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

8

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

W1(P)

R2(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R1(G)

W2(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W1(J)

W2(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R1(G)

R2(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W1(J)

W2(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

A.

B.

C.

D.

E.

9 of 72

Which of the following are conflicting actions?

Alt Def: If T1 and T2 have conflicting actions, then every equivalent serial schedule (i.e., with the same database outcome) must have T1 and T2 in some specific order.

Suppose T1 → T2 in a schedule,�i.e., T1 comes before T2.

9

W1(P)

R2(P)

T1 writes �Parth salary �as 110000

T2 reads �Parth salary �as 110000

R1(G)

W2(G)

T1 reads Gabi salary as 100000

T2 writes Gabi salary as 121000

W1(J)

W2(J)

T1 writes Jonah salary as 300000

T2 writes Jonah salary as 0

R1(G)

R2(G)

T1 reads Gabi salary as 121000

T2 reads Gabi salary as 121000

W1(J)

W2(P)

T1 writes Jonah salary as 0

T2 writes Parth salary as 200000

can be flipped! no conflicts!

cannot be flipped! conflicting actions!

For which of the following would the resulting flip of actions mean that this transaction order would change, i.e., that now T2 → T1? Select all.

10 of 72

How do we know if a schedule is serializable?

We like serializable schedules.

  • Isolation, again: After the dust settles, transactions appear to have happened in some order (which may seem “arbitrary”). However, the order means that:
    • the txns appear to have followed a serial schedule.
    • that txns can be “rolled back” one-by-one.

A schedule is serializable if all conflicting actions dictate a specific ordering of the transactions (with no cycles)

  • A topological sort on the graph of conflicts between transactions.

10

Conflicting actions between transactions determine if a schedule is serializable.

11 of 72

From the board

11

12 of 72

The previous slide, in table form

12

T1:

R1(J)

W1(P)

R2(P)

W2(P)

T2:

3.

T1

T2

R2(P)

R1(J)

W2(P)

W1(P)

serializable

4.

T1

T2

R2(P)

R1(J)

W1(P)

W2(P)

unserializable

5.

T1

T2

R1(J)

R2(P)

W2(P)

W1(P)

R2(P)

W1(P)

W2(P)

W1(P)

R2(P)

W1(P)

W2(P)

W1(P)

R2(P)

W1(P)

W2(P)

W1(P)

⚠️

⚠️

serializable

Schedule 4: The two pairs of conflicting actions imply two different orders to T1 and T2.

13 of 72

Performance Tradeoffs: Snapshot Isolation

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

13

Lecture 22, Data 101, Fall 2024

14 of 72

Final thoughts: Why Should Data Engineers Know Transactions?

You may think that transactions (and serializability) are very much in the weeds of DBMS design, which we don’t particularly implement in this course. However…

14

Inevitably you will update a database and manage data from transactional databases!

  • This means you should have a sense of its characteristics.

If your DB is slow for transactional reasons:

  • You should understand why
  • And how you can trade-off speed and “correctness,” i.e., redefine your transactions.

Finally, transaction concepts are also quite useful outside of databases.

  • Examples: Queueing systems, e.g., RabbitMQ or Kafka.

15 of 72

Serialized Transactions: A summary

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

  • Strict 2PL is a common implementation of serialization, though it is not the only one. [Knowing the implementation is optional, see previous lecture]

15

Life is good?…Except…

  • SELECT avg(gpa) FROM students;
    • Locks all students!
    • But we likely don’t need this to be 100% correct!
  • Sometimes we prefer to trade correctness for a little more performance.

16 of 72

Approximating Serialized Transactions with Weak Isolation

Serialized transactions ensure ACID properties of shared access, particularly Isolation.

  • Strict 2PL is a common implementation of serialization, though it is not the only one.

16

Enter: Weak Isolation.

  • Each isolation can choose to be a “bit sloppy”...
  • …as long as it doesn’t mess up other transaction’s choices to do so.
  • The most common weak isolation implementation is snapshot isolation.
  • This is a much weaker property of isolation than serialized transactions,�but it’s good enough when we prefer more concurrency/higher performance.

Life is good?…Except…

  • SELECT avg(gpa) FROM students;
    • Locks all students!
    • But we likely don’t need this to be 100% correct!
  • Sometimes we prefer to trade correctness for a little more performance.

17 of 72

Snapshot Isolation

Snapshot isolation is a weaker form of isolation than serialization, but it’s good enough when we prefer more concurrency/higher performance.

  • Database system requirements: Keep multiple versions of tuples.
  • Postgres’ approach: Multiversion Concurrency Control

17

At transaction start: Take a “snapshot” of the database, off which to do reads/writes.

  • snapshot reads: All reads of this transaction are from this snapshot.
  • write validation: This transaction can commit if none of its writes conflict with other transactions since the snapshot was taken.
    • If write-write conflicts, then abort this transaction.

18 of 72

Snapshot Isolation is Actually Popular

Isolation levels (both default and maximum) vary in support across different database engines.

Marketing also varies!

  • When Oracle says “Serializable,” they actually are giving you Snapshot Isolation!!

18

The maximum levels of many cloud DBMSs is not always the theoretical maximum,�which is “serializable” transactions.

  • Serializable: Google Cloud Spanner, CockroachDB, Azure SQL Server
  • Read Commit: Snowflake, AWS Aurora
    • For more about Read Commit and others, check out the bonus slides.

19 of 72

BI: OLAP and OLTP

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

19

Lecture 22, Data 101, Fall 2024

20 of 72

Recall: Data warehouse produced as the result of an ELT process.

20

Data Integration

Load

Transform

Extract

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Usually, the purpose of such a data system targets large scale analytics oncleandata.

21 of 72

Recall: Data warehouse produced as the result of an ELT process

21

Data Lake

Extract

Load

Transform

Transform

Transform

Data Preparation /

Integration

Data Warehouse

Raw Data

Transactions

Sensors

Log Files

Experiments

Source of Truth

Governed

Secure

Audited�Managed

Use-Case-Specific

Fit for purpose

Self-Service

Usually, the purpose of such a data system targets large scale analytics oncleandata.

22 of 72

Data Warehouses, OLAP, and BI

Business Intelligence (BI): enterprise technologies for data analysis and management.

  • Aka Decision Support
  • Generally leverages data warehouses!

22

OLTP, or OnLine Transaction Processing is another common technology.

  • Read, insert, update, and delete data. Support most things we’ve seen in the course.
  • Read-write queries on a smaller amount of data, say, gigabytes.
  • Example: append a new car sale into the sales table.

OLAP, or OnLine Analytical Processing is one huge technology under the BI umbrella.

  • Read/summarize large volumes of multidimensional data to understand trends and patterns.
  • Read-only queries on, say, petabytes of multidimensional data.
  • Example: total sales of each type of Honda car over time for each county.

OLAP and OLTP are generally two separate database management systems that store and process volumes of data in different ways across the same ecosystem.

23 of 72

BI Systems (and Visualization)

23

2023 MAD (ML/AI/Data) Landscape: blog, interactive

24 of 72

Example from Amazon AWS

24

We’ll disambiguate some of this terminology today.

  • typical business architecture
  • common OLAP operations
  • multidimensional data model for OLAP: data cubes, facts and dims
  • Implementation: star/ snowflake schema

25 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Post-hoc large-scale analysis (OLAP) happens separately from updates (OLTP).

  • OLAP is performed in a data warehouse
  • OLTP happens in the per-region database
  • OLAP separate from critical path of OLTP.

25

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

OLAP support

periodic updates

26 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Then, consolidation of data updates into the warehouse happens periodically via ETL.

  • Extract useful business info to be summarized
  • Transform it (i.e., canonicalize/clean up)
  • Load it into the warehouse

26

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

periodic updates

OLAP support

27 of 72

Typical Business Architecture

Imagine Honda USA, which has many sales regions.

Note: OLAP will suffer some staleness, but OK in practice.

  • Updates like sales are crucial.
  • Better to have sales never be prevented than for a manager’s report to generate promptly.
  • After all, large-scale OLAP read queries would otherwise delay updates.
  • Analysis may not be useful in real-time (e.g. looking at “yesterday’s sales”, we don’t want to see only part of “today”)

27

Region SW

Region NW

Region MW

Region NE

Region S

e.g, regional sales data consolidated daily EOD

Centralized

Data Warehouse

Regional OLTP

support (live or transactional)

OLAP support

periodic updates

28 of 72

CalAnswers: BI Demo

28

Summary

So Far

29 of 72

Data Cubes and OLAP Queries

[Review] Serializability�—

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

29

Lecture 22, Data 101, Fall 2024

30 of 72

Multidimensional Data: Data Cubes

OLAP systems fundamentally provide comprehensive, summarized views of data in data warehouses.

This summarization can and will happen across multiple dimensions of data. One such representation is a data cube, aka OLAP cube.

30

31 of 72

Multidimensional Data: Data Cubes

Our running example today is a 3-dimensional cube that computes sales:

(item, store, customer, number)

where number is the measure �(i.e., sales count) per customer� per store per item.

31

shirt

jacket

jeans

1

2

4

2

3

2

3

Remy

Jas

Tracy

Yuan

NYC_2

SF_3

LA_1

This cube generalizes to the >3D case! Think about it at home.

32 of 72

OLAP Queries

(item, store, customer, number)

32

shirt

jacket

jeans

1

2

4

2

3

2

3

Remy

Jas

Tracy

Yuan

1. Cross-tab

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

33 of 72

[Exercise] OLAP Queries

33

🤔

1. Cross-tab, i.e., pivot table

2(i). Slice to get equality condition � on one or more dimensions.

2(ii).Dice to get range partition � on one or more dimensions.

3(i). Roll-up and�3(ii).drill-down� adjust granularity to � summarize at different � levels of a dimension hierarchy.

Match the OLAP operation to the diagram. Format: 1A

A.

B.

C.

D.

E.

34 of 72

Match the OLAP operation to the diagram. Format: 1A or 2(i)-A, etc.

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

35 of 72

(1/3) OLAP Queries: Cross-tab

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

35

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

1

2

7

2

3

2

4

2

3

3

2

5

7

4

1

1. Cross-tab, i.e., pivot table

    • Example: number of sales �per item per person

4

4

9

17

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

36 of 72

(1/3) OLAP Queries: Cross-tab

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

36

Remy

Jas

Tracy

Yuan

1

2

1. Cross-tab, i.e., pivot table

    • Example: number of sales �per item per person
    • Example: number of sales �per item per store

shirt

jacket

jeans

4

3

2

4

4

9

17

2. Slice and dice

3. Roll-up and drill-down

NYC_2

SF_3

LA_1

37 of 72

(2/3) OLAP Queries: Slice and Dice

(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

37

1. Cross-tab, i.e., pivot table

    • Example: number of sales �per item per person
    • Example: number of sales �per item per store

3. Roll-up and drill-down

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan

dice

slice

2. Slice and dice i.e.,� get equality and range partitions� (respectively) on one or more � dimensions

Dicing is particularly useful for combining date dimensions (e.g., by month, day of week, years, weeks, etc.)

38 of 72

(2/3) OLAP Queries: Slice and Dice

(item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

38

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension � hierarchy

roll up to stores by state

drill down to item by item name

39 of 72

(3/3) OLAP Queries: Roll-up, Drill-down

39

Q: Can the cube be stored at this granularity?

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension

hierarchy

drill down to item by item name

40 of 72

(3/3) OLAP Queries: Roll-up, Drill-down

40

Q: Can the cube be stored at this granularity?

drill down to item by item name

Data cube elements need to be stored at the appropriate granularity to support the expected OLAP queries.

A: ⚠️ No, because it’s impossible to drill down from item category to the finer granularity item name!

shirt category: polo, blouse, turtleneck,�etc.

3. Roll-up and drill-down, i.e.,� adjust granularity to summarize at � different levels of a dimension

hierarchy

41 of 72

Hierarchies for Setting the Partition Granularity

The partitioning granularity can be set based on user needs…

Different partitioning may be useful for different applications.

Examples:

Q: I want the aggregates per month.

  • What if we computed a data cube based on Full Date?
    • We can roll up to access any part of the hierarchy.
  • What about if we had computed a data cube based on Year?
    • No longer possible to drill down to month!

41

Datetime

Hour

Full Date

Day of week

Month

Specific Month

Specific Quarter

Year

Quarter

City

State

Country

Full Date

Year

42 of 72

Summary: Data Cubes and OLAP Queries

OLAP systems fundamentally provide comprehensive, summarized views of data in data warehouses.

42

Amazon OLAP vs. OLTP [link]

One such representation is a data cube, aka OLAP cube.

  • Each element in the cube represent a measure with respect to multiple dimensions of data.
  • The data cube is a multi-dimensional array of measures (e.g., tensor of sales numbers).
  • Generalizes to >3D (why?)

OLAP queries are summarizations, i.e., various aggregate operations of the data cube.

  1. Cross-tabs
  2. Slicing and dicing
  3. Roll-ups, Drill-downs

Cubes need to be stored at the appropriate granularity to support the expected OLAP queries.

43 of 72

Star Schema, Snowflake Schema

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

43

Lecture 22, Data 101, Fall 2024

44 of 72

Data Cubes and physical layout

How do we actually store our data cube?

Recall from our tensors lecture:

  • Dense tensors are not efficient storage.
    • Many elements are likely nulls, i.e., not measured!

44

45 of 72

Data Cubes and physical layout

How do we actually store our data cube?

Recall from our tensors lecture:

  • Dense tensors are not efficient storage.
    • Many elements are likely nulls, i.e., not measured!
  • However, tensors have sparse representations �in the form of relations!

45

OLAP in a relational database:

  • mental model: data cube, a multidimensional tensor
  • physical model: star schema or snowflake schema of relations

Note: Not all OLAP DBMSes are relational! (more options later)

46 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: �(item, store, customer, number)

where number is the measure (i.e., sales count) per customer per store per item.

46

itemID1

itemID2

itemID3

storeID3

storeID2

storeID1

custID1

custID2

custID3

custID4

A data cube in the abstract (for some defined granularity of item, customer, and store)

47 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

47

itemID

storeID

custID

number

11

1

1

25

11

2

1

8

12

1

1

30

20

1

1

8

11

1

2

35

Sales

Fact table that maps dimensions�to measure of interest

48 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

48

storeID

city

state

country

1

SF

CA

USA

2

SF

CA

USA

5

NYC

NY

USA

Store

Dimension table that stores the store dimension’s attributes

itemID

storeID

custID

number

11

1

1

25

11

2

1

8

12

1

1

30

20

1

1

8

11

1

2

35

Sales

Fact table that maps dimensions �to measure of interest

Star schema:

  • One fact table
  • Many dimension tables

A dimension table stores the attributes of that dimension and functions as an explicit hierarchy.

49 of 72

Data Cube → Star Schema

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

49

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

Star schema:

  • One fact table
  • Many dimension tables

A dimension table stores the attributes of that dimension and functions as an explicit hierarchy.

50 of 72

[Exercise] True or False?

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

50

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

🤔

1. Sales is a junction table.

2. Sales and Customer have a many-to-one relationship.

3. There is a foreign key constraint between Store.storeID and Sales.storeID.

True or False?

51 of 72

True or False?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

52 of 72

[Exercise] True or False?

3-dimensional cube that computes sales: (item,store,customer,number)

where number is the measure (i.e., sales count) per customer per store per item.

52

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

1. Sales is a junction table.

2. Sales and Customer have a many-to-one relationship.

3. There is a foreign key constraint between Store.storeID and Sales.storeID.

True

True

True

53 of 72

Example OLAP Query, but as a SQL query

SELECT category, country, SUM(number)

FROM Sales� NATURAL JOIN Item� NATURAL JOIN Store

GROUP BY category, country

53

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

cross-tab

OLAP report queries GROUP BY some dimension attributes and aggregate measure attributes.

Example cross-tab:

54 of 72

From a Star Schema to a Snowflake Schema

A snowflake schema extends a star schema:

  • Still one fact table
  • Many dimension tables
  • Dimension tables are also connected to other dimension tables

54

Sales�(itemID,�storeID,�customerID,�number)

Item (itemID, itemname, color, size, category, manuf_name)

Store (storeID, city, state, country)

Customer (customerID, name, street, city, state)

Manufacturer(name, address, owner)

foreign key

foreign key

foreign key

foreign key

55 of 72

MOLAP, ROLAP, and HOLAP

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

55

Lecture 22, Data 101, Fall 2024

56 of 72

Types of OLAP databases

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

  • Optimized multidimensional array (i.e., not dense)
  • Pre-compute materialized slices, cross-tabs, different cube granularities

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

  • Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
  • No pre-computation, other than the materialized views specified by the user

56

57 of 72

[Exercise] MOLAP vs. ROLAP

MOLAP (Multidimensional OLAP)Data cube stored as a multidimensional array storage

Relational OLAP (ROLAP)�Data cube is stored as a star (or snowflake) schema in relational DBMS

1. (+) Fast response to queries (due to precomputation)

2. (+) Supports generalized SQL tools

3. (+) Supports data that does not fit into the data cube model

4. (–) Expensive ETL time (due to precomputation)

5. (–) No OLAP-specific optimizations (other than what is provided already in SQL)

6. (–) Needs SQL knowledge to specify new materialized views

7. (–) Potential data redundancy

57

A. MOLAP

B. ROLAP

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1A

🤔

58 of 72

MOLAP or ROLAP: Which OLAP system do each of the following pros/cons refer to? Format: 1A

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

59 of 72

A third type of OLAP database

MOLAP (Multidimensional OLAP) Data cube stored as a multidimensional array storage

  • Optimized multidimensional array (i.e., not dense)
  • Pre-compute materialized slices, cross-tabs, different cube granularities

Relational OLAP (ROLAP) Data cube is stored as a star (or snowflake) schema in relational DBMS

  • Can be implemented in SQL (WHERE, GROUP BY, plus CUBE/ROLL UP (see extra slides))
  • No pre-computation, other than the materialized views specified by the user

Hybrid OLAP (HOLAP) The best of both worlds.

  • Some data stored in relational storage; other data stored in specialized storage.
  • The exact proportions are left to the specific OLAP system + the designer.

59

60 of 72

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

60

  • (new) data cubes, star/snowflake schema, facts, measures, dimensions
  • (new) slicing/dicing
  • (review) hierarchies, roll-up, drill-down, cross-tabs
  • MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

61 of 72

So, Why Did We Learn OLAP?

OLAP (OnLine Analytical Processing) is a special BI (Business Intelligence) system that supports analytical processing and report generation queries, typically done in large “batch” operations.

61

  • (new) data cubes, star/snowflake schema, facts, measures, dimensions
  • (new) slicing/dicing
  • (review) hierarchies, roll-up, drill-down, cross-tabs
  • MOLAP, ROLAP, HOLAP

[see extra] Conveniences that come in SQL: ROLLUP and CUBE operators

One step closer to understanding jargon!!

Microsoft Learn / Power Platform / Power BI [link]

Amazon What is OLAP [link]

62 of 72

[Extra] OLAP in SQL

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

62

Lecture 22, Data 101, Fall 2024

63 of 72

OLAP in SQL: CUBE

  • NULL” is used to indicate “ALL”

SELECT item, color, SUM(number)

FROM Sales

GROUP BY CUBE (item, color)

  • Any attr. may be replaced with NULL(ALL)

  • May result in too many combinations

63

Item

Color

Number

Jacket

Blue

23

Jacket

Green

34

Jeans

Blue

28

Jeans

Green

17

Jacket

NULL

185

Jeans

NULL

200

NULL

Blue

94

NULL

Red

74

NULL

NULL

984

64 of 72

ROLLUP is an alternative to CUBE

ROLLUP targets a smaller # of combinations

SELECT item, color, size, SUM(number)

FROM Sales

GROUP BY ROLLUP (item), ROLLUP (color, size)

  • Every combination of:
    • specific item or ALL for the first rollup
    • for the second rollup
      • specific color and specific size
      • specific color and ALL sizes
      • ALL colors and ALL sizes
  • Thus, combinations include
    • {(item, color, size), (item, color), (item), (color, size), (color), ()}

64

65 of 72

Picking the right CUBE/ROLLUP query

  • First, why does this matter?
    • If this query is being run once on a PB- sized warehouse, it is important to get it right!
    • Results usually materialized and used in dashboards, presentations, spreadsheets, ….

  • Approach:
    • Think about all the ways you want to slice and dice your data
    • Pick granularity to recreate all aggregates you want, without blowing up the query result
      • Result size grows exponentially in the attrs; can be quite bad in large snowflake schemas
      • Known as the curse of dimensionality

65

66 of 72

[Extra] Implicit Dimensions

[Review] Serializability

Snapshot Isolation

BI: OLAP and OLTP

Data Cubes and OLAP Queries

Star Schema, Snowflake Schema

MOLAP, ROLAP, and HOLAP

[Extra] OLAP in SQL

[Extra] Implicit Dimensions

66

Lecture 22, Data 101, Fall 2024

67 of 72

Dates in OLAP

  • Example Schema:
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Dim table: Iteminfo (itemid, itemname, color, size, category)
    • Dim table: Store (storeid, city, state, country)
    • Dim table: Customer (customerid, name, street, city, state)

  • Not very useful to “aggregate” date; better to treat date as an implicit dimension!
    • Allows us to group by and see trends across dates (e.g., sales by year)
    • Fact table: Sales (itemid, storeid, customerid, date, number, price)
    • Implicit Dim table: Dateinfo (date, month, quarter, year)

67

Store

Iteminfo

Customer

Sales

68 of 72

Normalization and Star Schemas

Example Schema:

  • Fact table: Sales (itemID, storeID, customerID, date, number, price)
  • Dim table: Item (itemID, itemname, color, size, category)
  • Dim table: Store (storeID, city, state, country)
  • Dim table: Customer (customerID, name, street, city, state)

Not very useful to “aggregate” date; better to treat date as an implicit dimension!

  • Allows us to group by and see trends across dates (e.g., sales by year)
  • From the “true Fact” table: Sales (itemID, storeID, customerID, date, number, price)
  • Implicit dimension table: DateInfo (date, month, quarter, year)
  • And perhaps another implicit dimension table on price (by date and item)

68

Store

Iteminfo

Customer

Sales

69 of 72

Implicit Dimension Tables and SQL queries

Example Schema:

  • Fact table: Sales (itemID, storeID, customerID, date, number, price)
  • Dim table: Item (itemID, itemname, color, size, category)
  • Dim table: Store (storeID, city, state, country)
  • Dim table: Customer (customerID, name, street, city, state)
  • Implicit dimension table: DateInfo (date, month, quarter, year)
  • etc.

69

Store

Iteminfo

Customer

Sales

Date

70 of 72

Implicit Dimension Tables and SQL queries

Example Schema:

  • Fact table: Sales (itemID, storeID, customerID, date, number, price)
  • Dim table: Item (itemID, itemname, color, size, category)
  • Dim table: Store (storeID, city, state, country)
  • Dim table: Customer (customerID, name, street, city, state)
  • Implicit dimension table: DateInfo (date, month, quarter, year)
  • etc.

70

What we might want:

SELECT category, country,� month, COUNT(number)

FROM Sales

NATURAL JOIN Iteminfo NATURAL JOIN � Store NATURAL JOIN Dateinfo

GROUP BY category, country, month

Store

Iteminfo

Customer

Sales

Date

Actual query might be (in PostgreSQL)

SELECT category, country, datepart(‘month’, date) as month, COUNT(number)

FROM Sales NATURAL JOIN Iteminfo NATURAL JOIN Store

GROUP BY category, country, month

71 of 72

Extra graphics

71

tops

jackets

bottoms

USA

Cape Town

India

shirt

jacket

jeans

Remy

Jas

Tracy

Yuan

NYC_2

SF_3

LA_1

72 of 72

Extra graphics

72

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan

shirt

jacket

jeans

NYC_2

SF_3

LA_1

Remy

Jas

Tracy

Yuan