1 of 104

Course Name :

DATA WAREHOUSING & DATA MINING

Course Code : 20CS10

Course Instructor : Michael Sadgun Rao.K

Semester : IV

Regulation : R20

2 of 104

20CS10 - DATA WAREHOUSING & DATA MINING

Theory - 4 hours per week

TEXTBOOKS:

Introduction to Data Mining : Pang-Ning Tan & Michael Steinbach, Vipin Kumar, Fifth Impression, Pearson, 2015.
Data Mining concepts and Techniques, 3rd Edition, Jiawei Han, Michael Kamber, Elsevier, 2011

Practical - 3 hours per week

Data Mining using Python Lab

3 of 104

20CS10 - DATA WAREHOUSING & DATA MINING

Pre-requisite :

DBMS and
Probability and Statistics

Course Educational Objective:

To introduce the concepts of data warehouse and data mining, which gives a complete description about the principles used, architectures, applications, design and implementation of data mining and data warehousing concepts.

4 of 104

Course Outcomes:

At the end of this course, the student will be able to

CO1: Summarize the architecture of data warehouse. (Understand- L2)

CO2: Apply different preprocessing methods, Similarity, Dissimilarity measures for any given raw data.(Apply – L3)

CO3: Construct a decision tree and resolve the problem of model overfitting.(Analyze– L4)

CO4: Compare Apriori and FP-growth association rule mining algorithms for frequent itemset generation.(Apply - L3)

CO5: Apply suitable clustering algorithm for the given data set. (Apply - L3)

5 of 104

�DATA WAREHOUSING AND DATA MINING�

UNIT I

DATA WAREHOUSING

6 of 104

DATA WAREHOUSING (Unit - I)

Data Warehouse and OLAP Technology:

An Overview: Data Warehouse
A Multidimensional Data Model
Data Warehouse Architecture
Data Warehouse Implementation
From Data Warehousing to Data Mining. (Han & Kamber)

7 of 104

What is Data Warehouse?

Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions.
Data warehouse refers to a data repository that is maintained separately from an organization’s operational databases.
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
Data warehousing: The process of constructing and using data warehouses

8 of 104

Data Warehouse—Subject-Oriented

Organized around major subjects, such as customer, product, sales
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

9 of 104

Data Warehouse—Integrated

Constructed by integrating multiple, heterogeneous data sources

relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied.

Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

10 of 104

Data Warehouse—Time Variant

The time horizon for the data warehouse is significantly longer than that of operational systems

Operational database: current value data
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain “time element”

11 of 104

Data Warehouse—Nonvolatile

A physically separate store of data transformed from the operational environment
Operational update of data does not occur in the data warehouse environment

Does not require transaction processing, recovery, and concurrency control mechanisms
Requires only two operations in data accessing:

initial loading of data and access of data

12 of 104

OLTP vs OLAP

OLTP - Online Transaction Processing
OLAP - online Analytical Processing

	OLTP	OLAP
User & System Orientation	Customer Oriented ( transaction & query processing)	market Oriented ( Data Analysis by managers, executives & Analysts)
Data Contents	Current Data (too detailed)	Large amount of data (summarization & aggregation)
Database design	ER data model ( Application oriented database design)	Star or Snowflake model (subject Oriented Database design)
View	focus on current Data within an enterprise or department	multiple versions of database schema(evolutionary process), data from diff. org. & many data stores
Access Patterns	short, atomic transactions (requires concurrency control & recovery)	read-only operations ( Complex queries)

13 of 104

Data Warehouse vs. Operational DBMS

OLTP (on-line transaction processing)

Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

Major task of data warehouse system
Data analysis and decision making

Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries

14 of 104

OLTP vs. OLAP

15 of 104

Why Separate Data Warehouse?

High performance for both systems

DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

Different functions and different data:

missing data: Decision support requires historical data which operational DBs do not typically maintain
data consolidation: Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Note: There are more and more systems which perform OLAP analysis directly on relational databases

17 of 104

Data Warehousing: A Multitiered Architecture

Bottom Tier:

Warehouse Database server
a relational database system
Back-end tools and utilities

data extraction

by using API gateways(ODBC, JDBC & OLEDB)

cleaning
transformation
load & refresh

18 of 104

Data Warehousing: A Multitiered Architecture

Middle Tier (OLAP server)

ROLAP - Relational OLAP

extended RDBMS that maps operations on multidimensional data to standard relational operations.

MOLAP - Multidimensional OLAP

Special-purpose server that directly implements multidimensional data and operations.

Top Tier

Front-end Client Layer
Query and reporting tools, analysis tools and data mining tools.

19 of 104

Data Warehousing: A Multitiered Architecture

Data Warehouse Models:

Enterprise warehouse:

collects all of the information about subjects spanning the entire organization.

corporate-wide data integration

can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

implemented on mainframes, computer superservers, or parallel architecture platforms

20 of 104

Data Warehousing: A Multitiered Architecture

Data Warehouse Models:
Data mart:a subset of corporate-wide data that is of value to a specific group of users
confined to specific selected subjects.
Example - marketing data mart may confine its subjects to customer, item, and sales.
implemented on low-cost departmental servers
Independent Data mart - data captured from

one or more operational systems or external information providers,

from data generated locally within a particular department or geographic area.

Dependent Data mart - sourced directly from enterprise data warehouses.

21 of 104

Data Warehousing: A Multitiered Architecture

Data Warehouse Models:
Virtual warehouse:

A virtual warehouse is a set of views over operational databases.
easy to build but requires excess capacity on operational database servers.

22 of 104

Data Warehousing: A Multitiered Architecture

Data extraction: gathers data from multiple, heterogeneous, and external sources.
Data Cleaning: detects errors in the data and rectifies them when possible
Data transformation: converts data from legacy or host format to warehouse format.
Load: sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
Refresh: propagates the updates from the data sources to the warehouse.

23 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

metadata are the data that define warehouse objects

It consists of:

Data warehouse structure
Operational metadata
algorithms used for summarization
Mapping from the operational environment to the data warehouse
Data related to system performance
Business metadata

24 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

data warehouse structure

warehouse schema,
view, dimensions,
hierarchies, and
derived data definitions,
data mart locations and contents.

Operational metadata

data lineage (history of migrated data and the sequence of transformations applied to it),
currency of data (active, archived, or purged),
monitoring information (warehouse usage statistics, error reports, and audit trails).

25 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

The algorithms used for summarization,

measure and dimension definition algorithms,
data on granularity,
partitions,
subject areas,
aggregation,
summarization, and
predefined queries and reports.

26 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

Mapping from the operational environment to the data warehouse

source databases and their contents,
gateway descriptions,
data partitions,
data extraction, cleaning, transformation rules and defaults
data refresh and purging rules, and
security (user authorization and access control).

27 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

Data related to system performance

indices and profiles that improve data access and retrieval performance,
rules for the timing and scheduling of refresh, update, and replication cycles.

Business metadata,

business terms and definitions,
data ownership information, and
charging policies

28 of 104

Data Warehouse Modeling: Data Cube :

A Multidimensional Data Model

A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
Dimensions are the perspectives or entities with respect to which an organization wants to keep records.
Example:-

AllElectronics may create a sales data warehouse
time, item, branch, and location - These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold.

29 of 104

Data Warehouse Modeling: Data Cube :

A Multidimensional Data Model

Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.
For example - a dimension table for item may contain the attributes item name, brand, and type.
A multidimensional data model is typically organized around a central theme, such as sales. This theme is represented by a fact table.
Facts are numeric measures.
The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

30 of 104

Data Cube: A Multidimensional Data Model

A data warehouse is based on a multidimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

31 of 104

Data Cube: A Multidimensional Data Model

A data cube is a lattice of cuboids
A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which

each dimension corresponds to an attribute or a set of attributes in the schema, and
each cell stores the value of some aggregate measure such as count or sum(sales_amount).

A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.

32 of 104

Data Cube: A Multidimensional Data Model

2-D View of Sales data

AllElectronics sales data for items sold per quarter in the city of Vancouver.
a simple 2-D data cube that is a table or spreadsheet for sales data from AllElectronics

33 of 104

Data Cube: A Multidimensional Data Model

3-D View of a Sales data

The 3-D data in the table are represented as a series of 2-D tables

34 of 104

Data Cube: A Multidimensional Data Model

3D Data Cube Representation of Sales data

we may also represent the same data in the form of a 3D data cube

35 of 104

Data Cube: A Multidimensional Data Model

4-D Data Cube Representation of Sales Data

we may display any n-dimensional data as a series of (n − 1)-dimensional “cubes.”

36 of 104

Cube: A Lattice of Cuboids

time,item

time,item,location

time, item, location, supplier

all

time

item

location

supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

37 of 104

In data warehousing literature, an n-D base cube is called a base cuboid.

The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.

In our example, this is the total sales, or dollars sold, summarized over all four dimensions.
The apex cuboid is typically denoted by all.

The lattice of cuboids forms a data cube.

38 of 104

Schemas for Multidimensional Data Models

Modeling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

39 of 104

Schemas for Multidimensional Data Models

Star schema: In this, a data warehouse contains

(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and

(2) a set of smaller attendant tables

(dimension tables), one for each dimension.

Each dimension is represented by only one table.
Each table contains a set of attributes
Problem: redundancy in dimension tables.
ex:- location dimension table will create redundancy among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).

40 of 104

Star schema

41 of 104

Snow flake

Variant of the star schema model
Dimension tables are normalized ( to remove redundancy)
Dimension table is splitted into additional tables.
The resulting schema graph forms a shape similar to a snowflake.
Problem

more joins will be needed to execute a query ( affects system performance)
so this is not as popular as the star schema in data warehouse design.

42 of 104

Snowflake schema

43 of 104

Fact Constellation

A fact constellation schema allows dimension tables to be shared between fact tables
A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.
For data warehouses, the fact constellation schema is commonly used.
For data marts, the star or snowflake schema is commonly used

44 of 104

Fact Constellation

This schema specifies two fact tables, sales and shipping

the dimensions tables for time, item, and location are shared between the sales and shipping fact tables.

45 of 104

Cube Definition Syntax in DMQL

Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]: <measure_list>

Dimension Definition (Dimension Table)

define dimension <dimension_name> as (<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)

First time as “cube definition”
define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>

46 of 104

Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

47 of 104

Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

48 of 104

Defining Fact Constellation in DMQL

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

49 of 104

Dimensions: The Role of Concept Hierarchies

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level.
concept hierarchy for the dimension location

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman

50 of 104

Dimensions: The Role of Concept Hierarchies

A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy.

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman

51 of 104

Dimensions: The Role of Concept Hierarchies

Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy.
A total or partial order can be defined among groups of values.

52 of 104

Measures of Data Cube: Three Categories

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

E.g., count(), sum(), min(), max()

Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function

E.g., avg()=sum()/count(), min_N(), standard_deviation()

Holistic: there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation.

E.g., median(), mode(), rank()

53 of 104

Typical OLAP Operations

Roll up (drill-up):
Drill down (roll down):
Slice and dice: project and select
Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes

Other operations

drill across: involving (across) more than one fact table
drill through: Allows users to analyze the same data through different reports, analyze it with different features and even display it through different visualization methods

54 of 104

Fig. 3.10 Typical OLAP Operations

55 of 104

Typical OLAP Operations:Roll Up/Drill Up

Source & Courtesy: https://www.javatpoint.com/olap-operations

summarize data

by climbing up hierarchy

by dimension reduction

56 of 104

Typical OLAP Operations:Roll Down

reverse of roll-up

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Source & Courtesy: https://www.javatpoint.com/olap-operations

57 of 104

Typical OLAP Operations:Slicing

Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension.
Example: The sales figures of all sales regions and all product categories of the company in the year 2005 and 2006 are "sliced" out of the data cube.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

58 of 104

Typical OLAP Operations:Slicing

Slicing:

It selects a single dimension from the OLAP cube which results in a new sub-cube creation.

Source & Courtesy: https://www.javatpoint.com/olap-operations

59 of 104

Typical OLAP Operations:Dice

Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions
The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

60 of 104

Typical OLAP Operations:Dicing

Dice:

It selects a sub-cube from the OLAP cube by selecting two or more dimensions.

Source & Courtesy: https://www.javatpoint.com/olap-operations

61 of 104

Typical OLAP Operations:Pivot

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter.

62 of 104

A Star-Net Query Model

The querying of multidimensional databases can be based on a starnet model.
It consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension.
Each abstraction level in the hierarchy is called a footprint.

These represent the granularities available for use by OLAP operations such as drill-down and roll-up.

63 of 104

A Star-Net Query Model

64 of 104

A Star-Net Query Model

Four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time, respectively
footprints representing abstraction levels of the dimension - time line has four footprints: “day,” “month,” “quarter,” and “year.”
Concept hierarchies can be used to generalize data by replacing low-level values (such as “day” for the time dimension) by higher-level abstractions (such as “year”)

to specialize data by replacing higher-level abstractions with lower-level values.

65 of 104

A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCK

ORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

Organization

Promotion

CITY

COUNTRY

REGION

Location

DAILY

QTRLY

ANNUALY

Time

Each circle is called a footprint

66 of 104

Data Warehouse Design and Usage

A Business Analysis Framework for Data Warehouse Design:

To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework.

Different views are combined to form a complex framework.

Top-down, business-driven, or owner’s perspective

Bottom-up, builder-driven, or implementor’s view

67 of 104

Data Warehouse Design and Usage

Four different views regarding a data warehouse design must be considered:
Top-down view

allows the selection of the relevant information necessary for the data warehouse (matches current and future business needs).

Data source view

exposes the information being captured, stored, and managed by operational systems.
Documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables.
Modeled in ER model or CASE (computer-aided software engineering).

68 of 104

Data Warehouse Design and Usage

Data warehouse view

includes fact tables and dimension tables.
It represents the information that is stored inside the data warehouse, including

precalculated totals and counts,
information regarding the source, date, and time of origin, added to provide historical context.

Business query view

is the data perspective in the data warehouse from the end-user’s viewpoint.

69 of 104

Data Warehouse Design and Usage

Skills required to build & use a Data warehouse

Business Skills

how systems store and manage their data,
how to build extractors (operational DBMS to DW)
how to build warehouse refresh software(update)

Technology skills

the ability to discover patterns and trends,
to extrapolate trends based on history and look for anomalies or paradigm shifts, and
to present coherent managerial recommendations based on such analysis.

Program management skills

Interface with many technologies, vendors, and end-users in order to deliver results in a timely and cost effective manner

70 of 104

Data Warehouse Design and Usage

Data Warehouse Design Process

A data warehouse can be built using

Top-down approach (overall design and planning)

It is useful in cases where the technology is mature and well known

Bottom-up approach(starts with experiments & prototypes)
a combination of both.

In SE point of view ( Waterfall model or Spiral model)

planning,
requirements study,
problem analysis,
warehouse design,
data integration and testing, and
finally deployment of the data warehouse

structured and systematic analysis at each step, one step to the next

rapid generation, short intervals between successive releases, good choice for data warehouse development
turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner

71 of 104

Data Warehouse Design and Usage

Data Warehouse Design Process

Steps involved in Warehouse design are:
1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration, sales, or the general ledger).

Data warehouse model - If the business process is organizational and involves multiple complex object collections,
Data mart model - if the process is departmental and focuses on the analysis of one kind of business process

72 of 104

Data Warehouse Design and Usage

2. Choose the business process grain

Fundamental, atomic level of data to be represented in the fact table
(e.g., individual transactions, individual daily snapshots, and so on).

3. Choose the dimensions that will apply to each fact table record.

Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status.

4. Choose the measures that will populate each fact table record.

Typical measures are numeric additive quantities like dollars sold and units sold.

73 of 104

Data Warehouse Design and Usage

Data Warehouse Usage for Information Processing

Evolution of DW takes place throughout a number of phases.
Initial Phase - DW is used for generating reports and answering predefined queries.
Progressively - to analyze summarized and detailed data, (results are in the form of reports and charts)
Later - for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations.
Finally - for knowledge discovery and strategic decision making using data mining tools.

74 of 104

Data Warehouse Design and Usage

Tools for data warehousing can be categorized into

access and retrieval tools,
database reporting tools,
data analysis tools, and
data mining tools.

There are three kinds of data warehouse applications:

information processing,
analytical processing, and
data mining

75 of 104

Data Warehouse Design and Usage

Information processing supports

querying,
basic statistical analysis, and
reporting using crosstabs, tables, charts, or graphs.

Analytical processing supports

basic OLAP operations,

slice-and-dice, drill-down, roll-up, and pivoting.

It generally operates on historic data in both summarized and detailed forms.
multidimensional data analysis

76 of 104

Data Warehouse Design and Usage

Data mining supports

knowledge discovery by finding hidden patterns and associations,
constructing analytical models,
performing classification and prediction, and
presenting the mining results using visualization tools.

Note:-

Data Mining is different with Information Processing and Analytical processing

77 of 104

Data Warehouse Design and Usage

From Online Analytical Processing to Multidimensional Data Mining

Multidimensional data mining

Exploratory multidimensional data mining,

Online analytical mining - OLAM

integrates OLAP with data mining to uncover knowledge in multidimensional databases.

78 of 104

Data Warehouse Design and Usage

Multidimensional data mining is particularly important for the following reasons:

High quality of data in data warehouses:
Available information processing infrastructure surrounding data warehouses
OLAP-based exploration of multidimensional data
Online selection of data mining functions

79 of 104

Data warehouse implementation�

OLAP servers demand that decision support queries be answered in the order of seconds.

Methods for the efficient implementation of data warehouse systems.

1. Efficient data cube computation.
2. OLAP data indexing (bitmap or join indices )
3. OLAP query processing
4. Various types of warehouse servers for OLAP processing.

80 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Requires efficient computation of aggregations across many sets of dimensions.
In SQL terms:

Aggregations are referred to as group-by’s.

Each group-by can be represented by a cuboid,

set of group-by’s forms a lattice of cuboids defining a data cube.

Compute cube Operator - computes aggregates over all subsets of the dimensions specified in the operation.

require excessive storage space for large number of dimensions.

81 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Example 4.6

create a data cube for AllElectronics sales that contains the following:

city, item, year, and sales in dollars.

82 of 104

Data warehouse implementation:

Efficient Data Cube Computation

What is the total number of cuboids, or group-by’s, that can be computed for this data cube?

3 attributes - city, item & year -3 dimensions
sales in dollars - measure,
the total number of cuboids, or group by’s,

2 POWER 3 = 8.

The possible group-by’s are the following:

{(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}

() - group-by is empty (i.e., the dimensions are not grouped) - all.
group-by’s form a lattice of cuboids for the data cube

83 of 104

Data warehouse implementation:

Efficient Data Cube Computation

84 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Base cuboid contains all three dimensions(city, item, year)

returns - total sales for any combination of the three dimensions.
This is least generalized (most specific) of the cuboids.

Apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty (contains total sum of all sales)

This is most generalized (least specific) of the cuboids

Drill Down equivalent

start at the apex cuboid and explore downward in the lattice

akin to rolling up

start at the base cuboid and explore upward

85 of 104

Data warehouse implementation:

Efficient Data Cube Computation

zero-dimensional operation:

An SQL query containing no group-by
Example - “compute the sum of total sales”

one-dimensional operation:

An SQL query containing one group-by
Example - “compute the sum of sales group-by city”

A cube operator on n dimensions is equivalent to a collection of group-by statements, one for each subset of the n dimensions.

86 of 104

Data warehouse implementation:

Efficient Data Cube Computation

data cube could be defined as:

“define cube sales_cube [city, item, year]: sum(sales_in_dollars)”

2 power n cuboids - For a cube with n dimensions
“compute cube sales_cube” - statement

computes the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty subset.

In OLAP, for diff. queries diff. cuboids need to be accessed.
Precomputation - compute in advance all or at least some of the cuboids in a data cube
curse of dimensionality - required storage space may explode if all the cuboids in a data cube are precomputed ( for more dimensions)

87 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Data cube can be viewed as a lattice of cuboids

2 power n - when no concept hierarchy

How many cuboids in an n-dimensional cube with L levels?

where L_i is the number of levels associated with dimension i ( +1 for all )

If the cube has 10 dimensions and each dimension has five levels (including all), the total number of cuboids that can be generated is 5¹⁰≈ 9.8 × 10⁶.

88 of 104

Data warehouse implementation:

Efficient Data Cube Computation

There are three choices for data cube materialization given a base cuboid:

1. No materialization: Do not precompute - expensive multidimensional aggregates - extremely slow.

2. Full materialization: Precompute all of the cuboids - full cube - requires huge amounts of memory space in order to store all of the precomputed cuboids.

89 of 104

Data warehouse implementation:

Efficient Data Cube Computation

3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids.
compute a subset of the cube, which contains only those cells that satisfy some user-specified criterion - subcube

3 factors to consider:

(1) identify the subset of cuboids or subcubes to materialize;
(2) exploit the materialized cuboids or subcubes during query processing; and
(3) efficiently update the materialized cuboids or subcubes during load and refresh.

90 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Partial Materialization: Selected Computation of Cuboids

Following should take into account during selection of the subset of cuboids or subcubes

the queries in the workload, their frequencies, and their accessing costs

workload characteristics, the cost for incremental updates, and the total storage requirements.

physical database design such as the generation and selection of indices.

91 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Heuristic approaches for cuboid and subcube selection

Iceberg cube:

data cube that stores only those cube cells with an aggregate value (e.g., count) that is above some minimum support threshold.

shell cube:

precomputing the cuboids for only a small number of dimensions

92 of 104

Data warehouse implementation:

Indexing OLAP Data: Bitmap Index

Index structures - To facilitate efficient data accessing
Bitmap indexing method - it allows quick searching in data cubes.
In the bitmap index for a given attribute, there is a distinct bit vector, B_v, for each value v in the attribute’s domain.
If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors).
If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.

93 of 104

Data warehouse implementation:

Indexing OLAP Data: Bitmap Index

Example:- AllElectronics data warehouse
dim(item)={H,C,P,S} - 4 values - 4 bit vectors
dim(city)= {V,T} - 2 values - 2 bit vectors
Better than Hash & Tree Indices but good for low cardinality only (cardinality:number of unique items in the database column)

94 of 104

Exercise: Bitmap index on CITY

95 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

Traditional indexing maps the value in a given column to a list of rows having that value.
Join indexing registers the joinable rows of two relations from a relational database.
Join indexing registers the joinable rows of two relations from a relational database.
For example,

two relations - R(RID, A) and S(B, SID)
join on the attributes A and B,
join index record contains the pair (RID, SID),
where RID and SID are record identifiers from the R and S relations, respectively

96 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

Advantage:-

Identification of joinable tuples without performing costly join operations.

Useful:-

To maintain the relationship between a foreign key(fact table) and its matching primary keys(dimension table), from the joinable relation.

Indexing maintains relationships between attribute values of a dimension (e.g., within a dimension table) and the corresponding rows in the fact table.
Composite join indices: Join indices with multiple dimensions.

97 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

Example:-Star Schema

“sales_star [time, item, branch, location]: dollars_sold = sum (sales_in_dollars).”
join index is relationship between

Sales fact table and
the location, item dimension tables

To speed up query processing - join indexing & bitmap indexing methods can be integrated to form bitmapped join indices.

98 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

Given materialized views, query processing should proceed as follows:

1. Determine which operations should be performed on the available cuboids:

This involves transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the query

into

corresponding SQL and/or OLAP operations.

Example:

slicing and dicing a data cube may correspond to selection and/or projection operations on a materialized cuboid.

99 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

Determine to which materialized cuboid(s) the relevant operations should be applied:

pruning the set using knowledge of

“dominance” relationships among the cuboids,

estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost.

100 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

Example:-

define a data cube for AllElectronics of the form “sales cube [time, item, location]: sum(sales in dollars).”
dimension hierarchies

“day < month < quarter < year” for time;
“item_name < brand < type” for item
“street < city < province or state < country” for location

Query:

{brand, province or state}, with the selection constant “year = 2010.”

100

101 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

suppose that there are four materialized cuboids available, as follows:

“Which of these four cuboids should be selected to process the query?” - 1,3,4
How would the costs of each cuboid compare if used to process the query?” - 4

101

102 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

Relational OLAP (ROLAP) servers:

ROLAP uses relational tables to store data for online analytical processing
Intermediate servers that stand in between a relational back-end server and client front-end tools.
Operation:

use a relational or extended-relational DBMS to store and manage warehouse data
OLAP middleware to support missing pieces

ROLAP has greater scalability than MOLAP.
Example:-

DSS server of Microstrategy

102

103 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

Multidimensional OLAP (MOLAP) servers:

support multidimensional data views through array-based multidimensional storage engines
maps multidimensional views directly to data cube array structures.
Advantage:

fast indexing to precomputed summarized data.

adopt a two-level storage representation

Denser subcubes are stored as array structures
Sparse subcubes employ compression technology

103

A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-zeros.

104 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

Hybrid OLAP (HOLAP) servers:

Combines ROLAP and MOLAP technology
benefits

greater scalability from ROLAP and
faster computation of MOLAP.

HOLAP server may allow

large volumes of detailed data to be stored in a relational database,
while aggregations are kept in a separate MOLAP store.

Example:- Microsoft SQL Server 2000 (supports)
Specialized SQL servers:

provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

104

1 of 104

2 of 104

3 of 104

4 of 104

5 of 104

6 of 104

7 of 104

8 of 104

9 of 104

10 of 104

11 of 104

12 of 104

13 of 104

14 of 104

15 of 104

16 of 104

17 of 104

18 of 104

19 of 104

20 of 104

21 of 104

22 of 104

23 of 104

24 of 104

25 of 104

26 of 104

27 of 104

28 of 104

29 of 104

30 of 104

31 of 104

32 of 104

33 of 104

34 of 104

35 of 104

36 of 104

37 of 104

38 of 104

39 of 104

40 of 104

41 of 104

42 of 104

43 of 104

44 of 104

45 of 104

46 of 104

47 of 104

48 of 104

49 of 104

50 of 104

51 of 104

52 of 104

53 of 104

54 of 104

55 of 104

56 of 104

57 of 104

58 of 104

59 of 104

60 of 104

61 of 104

62 of 104

63 of 104

64 of 104

65 of 104

66 of 104

67 of 104

68 of 104

69 of 104

70 of 104

71 of 104

72 of 104

73 of 104

74 of 104

75 of 104

76 of 104

77 of 104

78 of 104

79 of 104

80 of 104