Course Name :
DATA WAREHOUSING & DATA MINING
Course Code : 20CS10
Course Instructor : Michael Sadgun Rao.K
Semester : IV
Regulation : R20
1
20CS10 - DATA WAREHOUSING & DATA MINING
2
20CS10 - DATA WAREHOUSING & DATA MINING
3
Course Outcomes:
At the end of this course, the student will be able to
CO1: Summarize the architecture of data warehouse. (Understand- L2)
CO2: Apply different preprocessing methods, Similarity, Dissimilarity measures for any given raw data.(Apply – L3)
CO3: Construct a decision tree and resolve the problem of model overfitting.(Analyze– L4)
CO4: Compare Apriori and FP-growth association rule mining algorithms for frequent itemset generation.(Apply - L3)
CO5: Apply suitable clustering algorithm for the given data set. (Apply - L3)
4
�DATA WAREHOUSING AND DATA MINING�
UNIT I
DATA WAREHOUSING
5
6
DATA WAREHOUSING (Unit - I)
What is Data Warehouse?
7
Data Warehouse—Subject-Oriented
8
Data Warehouse—Integrated
9
Data Warehouse—Time Variant
10
Data Warehouse—Nonvolatile
11
OLTP vs OLAP
12
| OLTP | OLAP |
User & System Orientation | Customer Oriented ( transaction & query processing) | market Oriented ( Data Analysis by managers, executives & Analysts) |
Data Contents | Current Data (too detailed) | Large amount of data (summarization & aggregation) |
Database design | ER data model ( Application oriented database design) | Star or Snowflake model (subject Oriented Database design) |
View | focus on current Data within an enterprise or department | multiple versions of database schema(evolutionary process), data from diff. org. & many data stores |
Access Patterns | short, atomic transactions (requires concurrency control & recovery) | read-only operations ( Complex queries) |
Data Warehouse vs. Operational DBMS
13
OLTP vs. OLAP
14
Why Separate Data Warehouse?
15
16
Data Warehousing: A Multitiered Architecture
17
Data Warehousing: A Multitiered Architecture
18
Data Warehousing: A Multitiered Architecture
19
Data Warehousing: A Multitiered Architecture
or
20
Data Warehousing: A Multitiered Architecture
21
Data Warehousing: A Multitiered Architecture
22
Data Warehousing: A Multitiered Architecture
Metadata Repository:
metadata are the data that define warehouse objects
It consists of:
23
Data Warehousing: A Multitiered Architecture
Metadata Repository:
24
Data Warehousing: A Multitiered Architecture
Metadata Repository:
25
Data Warehousing: A Multitiered Architecture
Metadata Repository:
26
Data Warehousing: A Multitiered Architecture
Metadata Repository:
27
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
28
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
29
Data Cube: A Multidimensional Data Model
30
Data Cube: A Multidimensional Data Model
31
Data Cube: A Multidimensional Data Model
32
2-D View of Sales data
Data Cube: A Multidimensional Data Model
33
3-D View of a Sales data
The 3-D data in the table are represented as a series of 2-D tables
Data Cube: A Multidimensional Data Model
34
3D Data Cube Representation of Sales data
we may also represent the same data in the form of a 3D data cube
Data Cube: A Multidimensional Data Model
35
4-D Data Cube Representation of Sales Data
we may display any n-dimensional data as a series of (n − 1)-dimensional “cubes.”
Cube: A Lattice of Cuboids
36
time,item
time,item,location
time, item, location, supplier
all
time
item
location
supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
37
Schemas for Multidimensional Data Models
38
Schemas for Multidimensional Data Models
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each dimension.
39
Star schema
40
Snow flake
41
Snowflake schema
42
Fact Constellation
43
Fact Constellation
44
This schema specifies two fact tables, sales and shipping
the dimensions tables for time, item, and location are shared between the sales and shipping fact tables.
Cube Definition Syntax in DMQL
define cube <cube_name> [<dimension_list>]: <measure_list>
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
45
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
46
Defining Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
47
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
48
Dimensions: The Role of Concept Hierarchies
49
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
Dimensions: The Role of Concept Hierarchies
50
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
Dimensions: The Role of Concept Hierarchies
51
Measures of Data Cube: Three Categories
52
Typical OLAP Operations
53
54
Fig. 3.10 Typical OLAP Operations
Typical OLAP Operations:Roll Up/Drill Up
55
Source & Courtesy: https://www.javatpoint.com/olap-operations
or
Typical OLAP Operations:Roll Down
56
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Slicing
57
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Typical OLAP Operations:Slicing
58
Slicing:
It selects a single dimension from the OLAP cube which results in a new sub-cube creation.
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Dice
59
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Typical OLAP Operations:Dicing
60
Dice:
It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Pivot
61
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter.
A Star-Net Query Model
62
A Star-Net Query Model
63
A Star-Net Query Model
or
64
A Star-Net Query Model
65
Shipping Method
AIR-EXPRESS
TRUCK
ORDER
Customer Orders
CONTRACTS
Customer
Product
PRODUCT GROUP
PRODUCT LINE
PRODUCT ITEM
SALES PERSON
DISTRICT
DIVISION
Organization
Promotion
CITY
COUNTRY
REGION
Location
DAILY
QTRLY
ANNUALY
Time
Each circle is called a footprint
Data Warehouse Design and Usage
A Business Analysis Framework for Data Warehouse Design:
66
Data Warehouse Design and Usage
67
Data Warehouse Design and Usage
68
Data Warehouse Design and Usage
69
Data Warehouse Design and Usage
Data Warehouse Design Process
70
structured and systematic analysis at each step, one step to the next
Data Warehouse Design and Usage
Data Warehouse Design Process
71
Data Warehouse Design and Usage
72
Data Warehouse Design and Usage
Data Warehouse Usage for Information Processing
73
Data Warehouse Design and Usage
74
Data Warehouse Design and Usage
75
Data Warehouse Design and Usage
76
Data Warehouse Design and Usage
From Online Analytical Processing to Multidimensional Data Mining
or
or
77
Data Warehouse Design and Usage
78
Data warehouse implementation�
79
Data warehouse implementation:
Efficient Data Cube Computation
80
Data warehouse implementation:
Efficient Data Cube Computation
Example 4.6
81
Data warehouse implementation:
Efficient Data Cube Computation
82
Data warehouse implementation:
Efficient Data Cube Computation
83
Data warehouse implementation:
Efficient Data Cube Computation
84
Data warehouse implementation:
Efficient Data Cube Computation
85
Data warehouse implementation:
Efficient Data Cube Computation
86
Data warehouse implementation:
Efficient Data Cube Computation
87
Data warehouse implementation:
Efficient Data Cube Computation
There are three choices for data cube materialization given a base cuboid:
88
Data warehouse implementation:
Efficient Data Cube Computation
89
Data warehouse implementation:
Efficient Data Cube Computation
90
Data warehouse implementation:
Efficient Data Cube Computation
91
Data warehouse implementation:
Indexing OLAP Data: Bitmap Index
92
Data warehouse implementation:
Indexing OLAP Data: Bitmap Index
93
Exercise: Bitmap index on CITY
94
Data warehouse implementation:
Indexing OLAP Data: Join Index
95
Data warehouse implementation:
Indexing OLAP Data: Join Index
96
Data warehouse implementation:
Indexing OLAP Data: Join Index
97
To speed up query processing - join indexing & bitmap indexing methods can be integrated to form bitmapped join indices.
Data warehouse implementation:
Efficient processing of OLAP queries
Given materialized views, query processing should proceed as follows:
into
corresponding SQL and/or OLAP operations.
98
Data warehouse implementation:
Efficient processing of OLAP queries
“dominance” relationships among the cuboids,
99
Data warehouse implementation:
Efficient processing of OLAP queries
Example:-
100
Data warehouse implementation:
Efficient processing of OLAP queries
101
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
102
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
103
A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-zeros.
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
104