Course Name :
DATA WAREHOUSING & DATA MINING
Course Code : 20CS10
Course Instructor : Dr.S.Naganjaneyulu
Semester : IV
Regulation : R20
1
20CS10 - DATA WAREHOUSING & DATA MINING
2
20CS10 - DATA WAREHOUSING & DATA MINING
3
Course Outcomes:
At the end of this course, the student will be able to
CO1: Summarize the architecture of data warehouse. (Understand- L2)
CO2: Apply different preprocessing methods, Similarity, Dissimilarity measures for any given raw data.(Apply – L3)
CO3: Construct a decision tree and resolve the problem of model overfitting.(Analyze– L4)
CO4: Compare Apriori and FP-growth association rule mining algorithms for frequent itemset generation.(Apply - L3)
CO5: Apply suitable clustering algorithm for the given data set. (Apply - L3)
4
�DATA WAREHOUSING AND DATA MINING�
UNIT I
DATA WAREHOUSING
5
6
DATA WAREHOUSING (Unit - I)
7
1.1 Data Warehouse Overview
What is Data Warehouse?
8
Data Warehouse—Subject-Oriented
9
Data Warehouse—Integrated
10
Data Warehouse—Time Variant
11
Data Warehouse—Nonvolatile
12
OLTP vs OLAP
13
| OLTP | OLAP |
User & System Orientation | Customer Oriented ( transaction & query processing) | market Oriented ( Data Analysis by managers, executives & Analysts) |
Data Contents | Current Data (too detailed) | Large amount of data (summarization & aggregation) |
Database design | ER data model ( Application oriented database design) | Star or Snowflake model (subject Oriented Database design) |
View | focus on current Data within an enterprise or department | multiple versions of database schema(evolutionary process), data from diff. org. & many data stores |
Access Patterns | short, atomic transactions (requires concurrency control & recovery) | read-only operations ( Complex queries) |
OLTP vs. OLAP
14
15
1.2 Data Warehouse Architecture
Data Warehouse vs. Operational DBMS
16
Why Separate Data Warehouse?
17
18
Data Warehousing: A Multitiered Architecture
19
Data Warehousing: A Multitiered Architecture
20
Data Warehousing: A Multitiered Architecture
21
Data Warehousing: A Multitiered Architecture
or
22
Data Warehousing: A Multitiered Architecture
23
Data Warehousing: A Multitiered Architecture
24
Data Warehousing: A Multitiered Architecture
Metadata Repository:
metadata are the data that define warehouse objects
It consists of:
25
Data Warehousing: A Multitiered Architecture
Metadata Repository:
26
Data Warehousing: A Multitiered Architecture
Metadata Repository:
27
Data Warehousing: A Multitiered Architecture
Metadata Repository:
28
Data Warehousing: A Multitiered Architecture
Metadata Repository:
29
30
1.3 A Multidimensional Data Model
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
31
Data Warehouse Modeling: Data Cube :
A Multidimensional Data Model
32
Data Cube: A Multidimensional Data Model
33
Data Cube: A Multidimensional Data Model
34
Data Cube: A Multidimensional Data Model
35
2-D View of Sales data
Data Cube: A Multidimensional Data Model
36
3-D View of a Sales data
The 3-D data in the table are represented as a series of 2-D tables
Data Cube: A Multidimensional Data Model
37
3D Data Cube Representation of Sales data
we may also represent the same data in the form of a 3D data cube
Data Cube: A Multidimensional Data Model
38
4-D Data Cube Representation of Sales Data
we may display any n-dimensional data as a series of (n − 1)-dimensional “cubes.”
Cube: A Lattice of Cuboids
39
time,item
time,item,location
time, item, location, supplier
all
time
item
location
supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
40
Schemas for Multidimensional Data Models
41
Schemas for Multidimensional Data Models
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables
(dimension tables), one for each dimension.
42
Star schema
43
Snow flake schema
44
Snowflake schema
45
Fact Constellation
46
Fact Constellation
47
This schema specifies two fact tables, sales and shipping
the dimensions tables for time, item, and location are shared between the sales and shipping fact tables.
Examples for Defining Star, Snowflake,�and Fact Constellation Schemas
48
Syntax for Cube and Dimension Definition in DMQL
define cube <cube_name> [<dimension_list>]: <measure_list>
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
49
Defining Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
50
Defining Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state, country))
51
Defining Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
52
Concept Hierarchies
53
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
Concept Hierarchies
54
courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman
Concept Hierarchies
55
Measures of Data Cube: Three Categories
for example, 〈time = “Q1”, location = “Vancouver”,�item = “computer”〉.
56
Measures of Data Cube: Three Categories
57
Typical OLAP Operations
58
59
Fig. 3.10 Typical OLAP Operations
Typical OLAP Operations:Roll Up/Drill Up
60
Source & Courtesy: https://www.javatpoint.com/olap-operations
or
Typical OLAP Operations:Roll Down
61
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Slicing
62
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Typical OLAP Operations:Slicing
63
Slicing:
It selects a single dimension from the OLAP cube which results in a new sub-cube creation.
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Dice
64
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Typical OLAP Operations:Dicing
65
Dice:
It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
Source & Courtesy: https://www.javatpoint.com/olap-operations
Typical OLAP Operations:Pivot
66
Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube
Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter.
A Star-Net Query Model
67
A Star-Net Query Model
68
A Star-Net Query Model
or
69
Data Warehouse Design and Usage
A Business Analysis Framework for Data Warehouse Design:
70
Data Warehouse Design and Usage
71
Data Warehouse Design and Usage
72
Data Warehouse Design and Usage
73
Data Warehouse Design and Usage
Data Warehouse Design Process
74
structured and systematic analysis at each step, one step to the next
Data Warehouse Design and Usage
Data Warehouse Design Process
75
Data Warehouse Design and Usage
76
Data Warehouse Design and Usage
Data Warehouse Usage for Information Processing
77
78
1.4 Data Warehouse Implementation
Data warehouse implementation�
79
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
80
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
Example 4.6
81
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
82
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
83
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
84
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
85
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
86
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
87
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
There are three choices for data cube materialization for a given base cuboid:
88
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
89
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
90
Data warehouse implementation:
1.4.1 Efficient Data Cube Computation
91
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
Index structures - To facilitate efficient data accessing
92
Data warehouse implementation:
1.3.2 Indexing OLAP Data: Bitmap Index
93
Exercise: Bitmap index on CITY
94
Data warehouse implementation:
Indexing OLAP Data: Join Index
95
Data warehouse implementation:
Indexing OLAP Data: Join Index
96
Data warehouse implementation:
Indexing OLAP Data: Join Index
97
To speed up query processing - join indexing & bitmap indexing methods can be integrated to form bitmapped join indices.
Data warehouse implementation:
Efficient processing of OLAP queries
Given materialized views, query processing should proceed as follows:
into
corresponding SQL and/or OLAP operations.
98
Data warehouse implementation:
Efficient processing of OLAP queries
“dominance” relationships among the cuboids,
99
Data warehouse implementation:
Efficient processing of OLAP queries
Example:-
100
Data warehouse implementation:
Efficient processing of OLAP queries
101
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
102
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
103
A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-zeros.
Data warehouse implementation:
OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP
104
105
1.5 From Data Warehousing to Data Mining
From DataWarehousing to Data Mining
DataWarehouse Usage
106
DataWarehouse Usage
107
DataWarehouse Usage
108
DataWarehouse Usage
109
From Online Analytical Processing to Multidimensional Data Mining
110
Architecture for On-Line Analytical Mining
111
112