1 of 104

Course Name :

DATA WAREHOUSING & DATA MINING

Course Code : 20CS10

Course Instructor : Michael Sadgun Rao.K

Semester : IV

Regulation : R20

1

2 of 104

20CS10 - DATA WAREHOUSING & DATA MINING

  • Theory - 4 hours per week
    • TEXTBOOKS:
      • Introduction to Data Mining : Pang-Ning Tan & Michael Steinbach, Vipin Kumar, Fifth Impression, Pearson, 2015.
      • Data Mining concepts and Techniques, 3rd Edition, Jiawei Han, Michael Kamber, Elsevier, 2011
  • Practical - 3 hours per week
    • Data Mining using Python Lab

2

3 of 104

20CS10 - DATA WAREHOUSING & DATA MINING

  • Pre-requisite :
    • DBMS and
    • Probability and Statistics
  • Course Educational Objective:
    • To introduce the concepts of data warehouse and data mining, which gives a complete description about the principles used, architectures, applications, design and implementation of data mining and data warehousing concepts.

3

4 of 104

Course Outcomes:

At the end of this course, the student will be able to

CO1: Summarize the architecture of data warehouse. (Understand- L2)

CO2: Apply different preprocessing methods, Similarity, Dissimilarity measures for any given raw data.(Apply – L3)

CO3: Construct a decision tree and resolve the problem of model overfitting.(Analyze– L4)

CO4: Compare Apriori and FP-growth association rule mining algorithms for frequent itemset generation.(Apply - L3)

CO5: Apply suitable clustering algorithm for the given data set. (Apply - L3)

4

5 of 104

DATA WAREHOUSING AND DATA MINING

UNIT I

DATA WAREHOUSING

5

6 of 104

6

DATA WAREHOUSING (Unit - I)

  • Data Warehouse and OLAP Technology:
    • An Overview: Data Warehouse
    • A Multidimensional Data Model
    • Data Warehouse Architecture
    • Data Warehouse Implementation
    • From Data Warehousing to Data Mining. (Han & Kamber)

7 of 104

What is Data Warehouse?

  • Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions.
  • Data warehouse refers to a data repository that is maintained separately from an organization’s operational databases.
  • A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
  • Data warehousing: The process of constructing and using data warehouses

7

8 of 104

Data Warehouse—Subject-Oriented

  • Organized around major subjects, such as customer, product, sales
  • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
  • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

8

9 of 104

Data Warehouse—Integrated

  • Constructed by integrating multiple, heterogeneous data sources
    • relational databases, flat files, on-line transaction records
  • Data cleaning and data integration techniques are applied.
    • Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
      • E.g., Hotel price: currency, tax, breakfast covered, etc.
    • When data is moved to the warehouse, it is converted.

9

10 of 104

Data Warehouse—Time Variant

  • The time horizon for the data warehouse is significantly longer than that of operational systems
    • Operational database: current value data
    • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
  • Every key structure in the data warehouse
    • Contains an element of time, explicitly or implicitly
    • But the key of operational data may or may not contain “time element”

10

11 of 104

Data Warehouse—Nonvolatile

  • A physically separate store of data transformed from the operational environment
  • Operational update of data does not occur in the data warehouse environment
    • Does not require transaction processing, recovery, and concurrency control mechanisms
    • Requires only two operations in data accessing:
      • initial loading of data and access of data

11

12 of 104

OLTP vs OLAP

  • OLTP - Online Transaction Processing
  • OLAP - online Analytical Processing

12

OLTP

OLAP

User & System Orientation

Customer Oriented

( transaction & query processing)

market Oriented ( Data Analysis by managers, executives & Analysts)

Data Contents

Current Data (too detailed)

Large amount of data

(summarization & aggregation)

Database design

ER data model ( Application oriented database design)

Star or Snowflake model (subject Oriented Database design)

View

focus on current Data within an enterprise or department

multiple versions of database schema(evolutionary process),

data from diff. org. & many data stores

Access Patterns

short, atomic transactions (requires concurrency control & recovery)

read-only operations ( Complex queries)

13 of 104

Data Warehouse vs. Operational DBMS

  • OLTP (on-line transaction processing)
    • Major task of traditional relational DBMS
    • Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.
  • OLAP (on-line analytical processing)
    • Major task of data warehouse system
    • Data analysis and decision making
  • Distinct features (OLTP vs. OLAP):
    • User and system orientation: customer vs. market
    • Data contents: current, detailed vs. historical, consolidated
    • Database design: ER + application vs. star + subject
    • View: current, local vs. evolutionary, integrated
    • Access patterns: update vs. read-only but complex queries

13

14 of 104

OLTP vs. OLAP

14

15 of 104

Why Separate Data Warehouse?

  • High performance for both systems
    • DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery
    • Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation
  • Different functions and different data:
    • missing data: Decision support requires historical data which operational DBs do not typically maintain
    • data consolidation: Decision support requires consolidation (aggregation, summarization) of data from heterogeneous sources
    • data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
  • Note: There are more and more systems which perform OLAP analysis directly on relational databases

15

16 of 104

16

17 of 104

Data Warehousing: A Multitiered Architecture

  • Bottom Tier:
    • Warehouse Database server
    • a relational database system
    • Back-end tools and utilities
      • data extraction
        • by using API gateways(ODBC, JDBC & OLEDB)
      • cleaning
      • transformation
      • load & refresh

17

18 of 104

Data Warehousing: A Multitiered Architecture

  • Middle Tier (OLAP server)
    • ROLAP - Relational OLAP
      • extended RDBMS that maps operations on multidimensional data to standard relational operations.
    • MOLAP - Multidimensional OLAP
      • Special-purpose server that directly implements multidimensional data and operations.
  • Top Tier
    • Front-end Client Layer
    • Query and reporting tools, analysis tools and data mining tools.

18

19 of 104

Data Warehousing: A Multitiered Architecture

  • Data Warehouse Models:
    • Enterprise warehouse:
      • collects all of the information about subjects spanning the entire organization.

      • corporate-wide data integration

      • can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

      • implemented on mainframes, computer superservers, or parallel architecture platforms

19

20 of 104

Data Warehousing: A Multitiered Architecture

  • Data Warehouse Models:
  • Data mart:a subset of corporate-wide data that is of value to a specific group of users
  • confined to specific selected subjects.
  • Example - marketing data mart may confine its subjects to customer, item, and sales.
  • implemented on low-cost departmental servers
  • Independent Data mart - data captured from
    • one or more operational systems or external information providers,

or

    • from data generated locally within a particular department or geographic area.
  • Dependent Data mart - sourced directly from enterprise data warehouses.

20

21 of 104

Data Warehousing: A Multitiered Architecture

  • Data Warehouse Models:
  • Virtual warehouse:
    • A virtual warehouse is a set of views over operational databases.
    • easy to build but requires excess capacity on operational database servers.

21

22 of 104

Data Warehousing: A Multitiered Architecture

  • Data extraction: gathers data from multiple, heterogeneous, and external sources.
  • Data Cleaning: detects errors in the data and rectifies them when possible
  • Data transformation: converts data from legacy or host format to warehouse format.
  • Load: sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and partitions.
  • Refresh: propagates the updates from the data sources to the warehouse.

22

23 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

metadata are the data that define warehouse objects

It consists of:

    • Data warehouse structure
    • Operational metadata
    • algorithms used for summarization
    • Mapping from the operational environment to the data warehouse
    • Data related to system performance
    • Business metadata

23

24 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

    • data warehouse structure
      1. warehouse schema,
      2. view, dimensions,
      3. hierarchies, and
      4. derived data definitions,
      5. data mart locations and contents.
    • Operational metadata
      • data lineage (history of migrated data and the sequence of transformations applied to it),
      • currency of data (active, archived, or purged),
      • monitoring information (warehouse usage statistics, error reports, and audit trails).

24

25 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

    • The algorithms used for summarization,
      • measure and dimension definition algorithms,
      • data on granularity,
      • partitions,
      • subject areas,
      • aggregation,
      • summarization, and
      • predefined queries and reports.

25

26 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

    • Mapping from the operational environment to the data warehouse
      • source databases and their contents,
      • gateway descriptions,
      • data partitions,
      • data extraction, cleaning, transformation rules and defaults
      • data refresh and purging rules, and
      • security (user authorization and access control).

26

27 of 104

Data Warehousing: A Multitiered Architecture

Metadata Repository:

    • Data related to system performance
      • indices and profiles that improve data access and retrieval performance,
      • rules for the timing and scheduling of refresh, update, and replication cycles.
    • Business metadata,
      • business terms and definitions,
      • data ownership information, and
      • charging policies

27

28 of 104

Data Warehouse Modeling: Data Cube :

A Multidimensional Data Model

  • A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
  • Dimensions are the perspectives or entities with respect to which an organization wants to keep records.
  • Example:-
    • AllElectronics may create a sales data warehouse
    • time, item, branch, and location - These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations at which the items were sold.

28

29 of 104

Data Warehouse Modeling: Data Cube :

A Multidimensional Data Model

  • Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.
  • For example - a dimension table for item may contain the attributes item name, brand, and type.
  • A multidimensional data model is typically organized around a central theme, such as sales. This theme is represented by a fact table.
  • Facts are numeric measures.
  • The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

29

30 of 104

Data Cube: A Multidimensional Data Model

  • A data warehouse is based on a multidimensional data model which views data in the form of a data cube

  • A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

    • Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)

    • Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

30

31 of 104

Data Cube: A Multidimensional Data Model

  • A data cube is a lattice of cuboids
  • A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in which
    • each dimension corresponds to an attribute or a set of attributes in the schema, and
    • each cell stores the value of some aggregate measure such as count or sum(sales_amount).
  • A data cube provides a multidimensional view of data and allows the precomputation and fast access of summarized data.

31

32 of 104

Data Cube: A Multidimensional Data Model

32

2-D View of Sales data

  • AllElectronics sales data for items sold per quarter in the city of Vancouver.
  • a simple 2-D data cube that is a table or spreadsheet for sales data from AllElectronics

33 of 104

Data Cube: A Multidimensional Data Model

33

3-D View of a Sales data

The 3-D data in the table are represented as a series of 2-D tables

34 of 104

Data Cube: A Multidimensional Data Model

34

3D Data Cube Representation of Sales data

we may also represent the same data in the form of a 3D data cube

35 of 104

Data Cube: A Multidimensional Data Model

35

4-D Data Cube Representation of Sales Data

we may display any n-dimensional data as a series of (n − 1)-dimensional “cubes.”

36 of 104

Cube: A Lattice of Cuboids

36

time,item

time,item,location

time, item, location, supplier

all

time

item

location

supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

37 of 104

  • In data warehousing literature, an n-D base cube is called a base cuboid.

  • The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.
    • In our example, this is the total sales, or dollars sold, summarized over all four dimensions.
    • The apex cuboid is typically denoted by all.

  • The lattice of cuboids forms a data cube.

37

38 of 104

Schemas for Multidimensional Data Models

  • Modeling data warehouses: dimensions & measures
    • Star schema: A fact table in the middle connected to a set of dimension tables
    • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
    • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

38

39 of 104

Schemas for Multidimensional Data Models

  • Star schema: In this, a data warehouse contains

(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and

(2) a set of smaller attendant tables

(dimension tables), one for each dimension.

  • Each dimension is represented by only one table.
  • Each table contains a set of attributes
  • Problem: redundancy in dimension tables.
  • ex:- location dimension table will create redundancy among the attributes province or state and country; that is, (..., Urbana, IL, USA) and (..., Chicago, IL, USA).

39

40 of 104

Star schema

40

41 of 104

Snow flake

  • Variant of the star schema model
  • Dimension tables are normalized ( to remove redundancy)
  • Dimension table is splitted into additional tables.
  • The resulting schema graph forms a shape similar to a snowflake.
  • Problem
    • more joins will be needed to execute a query ( affects system performance)
    • so this is not as popular as the star schema in data warehouse design.

41

42 of 104

Snowflake schema

42

43 of 104

Fact Constellation

43

  • A fact constellation schema allows dimension tables to be shared between fact tables
  • A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.
  • For data warehouses, the fact constellation schema is commonly used.
  • For data marts, the star or snowflake schema is commonly used

44 of 104

Fact Constellation

44

This schema specifies two fact tables, sales and shipping

the dimensions tables for time, item, and location are shared between the sales and shipping fact tables.

45 of 104

Cube Definition Syntax in DMQL

  • Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]: <measure_list>

  • Dimension Definition (Dimension Table)

define dimension <dimension_name> as (<attribute_or_subdimension_list>)

  • Special Case (Shared Dimension Tables)
    • First time as “cube definition”
    • define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>

45

46 of 104

Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

46

47 of 104

Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

47

48 of 104

Defining Fact Constellation in DMQL

define cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)

define dimension from_location as location in cube sales

define dimension to_location as location in cube sales

48

49 of 104

Dimensions: The Role of Concept Hierarchies

  • A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level.
  • concept hierarchy for the dimension location

49

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman

50 of 104

Dimensions: The Role of Concept Hierarchies

  • A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy.

50

courtesy: Data Mining. Concepts and Techniques, 3rd Edition (The Morgan Kaufman

51 of 104

Dimensions: The Role of Concept Hierarchies

  • Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or attribute, resulting in a set-grouping hierarchy.
  • A total or partial order can be defined among groups of values.

51

52 of 104

Measures of Data Cube: Three Categories

  • Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning
      • E.g., count(), sum(), min(), max()
  • Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
      • E.g., avg()=sum()/count(), min_N(), standard_deviation()
  • Holistic: there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation.
      • E.g., median(), mode(), rank()

52

53 of 104

Typical OLAP Operations

  • Roll up (drill-up):
  • Drill down (roll down):
  • Slice and dice: project and select
  • Pivot (rotate):
    • reorient the cube, visualization, 3D to series of 2D planes
  • Other operations
    • drill across: involving (across) more than one fact table
    • drill through: Allows users to analyze the same data through different reports, analyze it with different features and even display it through different visualization methods

53

54 of 104

54

Fig. 3.10 Typical OLAP Operations

55 of 104

Typical OLAP Operations:Roll Up/Drill Up

55

Source & Courtesy: https://www.javatpoint.com/olap-operations

  • summarize data
    • by climbing up hierarchy

or

    • by dimension reduction

56 of 104

Typical OLAP Operations:Roll Down

56

  • reverse of roll-up
    • from higher level summary to lower level summary or detailed data, or introducing new dimensions

Source & Courtesy: https://www.javatpoint.com/olap-operations

57 of 104

Typical OLAP Operations:Slicing

57

  • Slice is the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions, creating a new cube with one fewer dimension.
  • Example: The sales figures of all sales regions and all product categories of the company in the year 2005 and 2006 are "sliced" out of the data cube.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

58 of 104

Typical OLAP Operations:Slicing

58

Slicing:

It selects a single dimension from the OLAP cube which results in a new sub-cube creation.

Source & Courtesy: https://www.javatpoint.com/olap-operations

59 of 104

Typical OLAP Operations:Dice

59

  • Dice: The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions
  • The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product categories, the time and region dimensions cover the same range as before.

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

60 of 104

Typical OLAP Operations:Dicing

60

Dice:

It selects a sub-cube from the OLAP cube by selecting two or more dimensions.

Source & Courtesy: https://www.javatpoint.com/olap-operations

61 of 104

Typical OLAP Operations:Pivot

61

Source & Courtesy: https://en.wikipedia.org/wiki/OLAP_cube

Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter.

62 of 104

A Star-Net Query Model

62

  • The querying of multidimensional databases can be based on a starnet model.
  • It consists of radial lines emanating from a central point, where each line represents a concept hierarchy for a dimension.
  • Each abstraction level in the hierarchy is called a footprint.

  • These represent the granularities available for use by OLAP operations such as drill-down and roll-up.

63 of 104

A Star-Net Query Model

63

64 of 104

A Star-Net Query Model

  • Four radial lines, representing concept hierarchies for the dimensions location, customer, item, and time, respectively
  • footprints representing abstraction levels of the dimension - time line has four footprints: “day,” “month,” “quarter,” and “year.”
  • Concept hierarchies can be used to generalize data by replacing low-level values (such as “day” for the time dimension) by higher-level abstractions (such as “year”)

or

  • to specialize data by replacing higher-level abstractions with lower-level values.

64

65 of 104

A Star-Net Query Model

65

Shipping Method

AIR-EXPRESS

TRUCK

ORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

Organization

Promotion

CITY

COUNTRY

REGION

Location

DAILY

QTRLY

ANNUALY

Time

Each circle is called a footprint

66 of 104

Data Warehouse Design and Usage

A Business Analysis Framework for Data Warehouse Design:

  • To design an effective data warehouse we need to understand and analyze business needs and construct a business analysis framework.

  • Different views are combined to form a complex framework.

    • Top-down, business-driven, or owner’s perspective

    • Bottom-up, builder-driven, or implementor’s view

66

67 of 104

Data Warehouse Design and Usage

  • Four different views regarding a data warehouse design must be considered:
  • Top-down view
    • allows the selection of the relevant information necessary for the data warehouse (matches current and future business needs).
  • Data source view
    • exposes the information being captured, stored, and managed by operational systems.
    • Documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables.
    • Modeled in ER model or CASE (computer-aided software engineering).

67

68 of 104

Data Warehouse Design and Usage

  • Data warehouse view
    • includes fact tables and dimension tables.
    • It represents the information that is stored inside the data warehouse, including
      • precalculated totals and counts,
      • information regarding the source, date, and time of origin, added to provide historical context.
  • Business query view
    • is the data perspective in the data warehouse from the end-user’s viewpoint.

68

69 of 104

Data Warehouse Design and Usage

  • Skills required to build & use a Data warehouse
    • Business Skills
      • how systems store and manage their data,
      • how to build extractors (operational DBMS to DW)
      • how to build warehouse refresh software(update)
    • Technology skills
      • the ability to discover patterns and trends,
      • to extrapolate trends based on history and look for anomalies or paradigm shifts, and
      • to present coherent managerial recommendations based on such analysis.
    • Program management skills
      • Interface with many technologies, vendors, and end-users in order to deliver results in a timely and cost effective manner

69

70 of 104

Data Warehouse Design and Usage

Data Warehouse Design Process

  • A data warehouse can be built using
    • Top-down approach (overall design and planning)
      • It is useful in cases where the technology is mature and well known
    • Bottom-up approach(starts with experiments & prototypes)
    • a combination of both.
  • In SE point of view ( Waterfall model or Spiral model)
        • planning,
        • requirements study,
        • problem analysis,
        • warehouse design,
        • data integration and testing, and
        • finally deployment of the data warehouse

70

structured and systematic analysis at each step, one step to the next

  • rapid generation, short intervals between successive releases, good choice for data warehouse development
  • turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner

71 of 104

Data Warehouse Design and Usage

Data Warehouse Design Process

  • Steps involved in Warehouse design are:
  • 1. Choose a business process to model (e.g., orders, invoices, shipments, inventory, account administration, sales, or the general ledger).
    • Data warehouse model - If the business process is organizational and involves multiple complex object collections,
    • Data mart model - if the process is departmental and focuses on the analysis of one kind of business process

71

72 of 104

Data Warehouse Design and Usage

  • 2. Choose the business process grain
    • Fundamental, atomic level of data to be represented in the fact table
    • (e.g., individual transactions, individual daily snapshots, and so on).
  • 3. Choose the dimensions that will apply to each fact table record.
    • Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status.
  • 4. Choose the measures that will populate each fact table record.
    • Typical measures are numeric additive quantities like dollars sold and units sold.

72

73 of 104

Data Warehouse Design and Usage

Data Warehouse Usage for Information Processing

  • Evolution of DW takes place throughout a number of phases.
  • Initial Phase - DW is used for generating reports and answering predefined queries.
  • Progressively - to analyze summarized and detailed data, (results are in the form of reports and charts)
  • Later - for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations.
  • Finally - for knowledge discovery and strategic decision making using data mining tools.

73

74 of 104

Data Warehouse Design and Usage

  • Tools for data warehousing can be categorized into
    • access and retrieval tools,
    • database reporting tools,
    • data analysis tools, and
    • data mining tools.
  • There are three kinds of data warehouse applications:
    • information processing,
    • analytical processing, and
    • data mining

74

75 of 104

Data Warehouse Design and Usage

  • Information processing supports
    • querying,
    • basic statistical analysis, and
    • reporting using crosstabs, tables, charts, or graphs.
  • Analytical processing supports
    • basic OLAP operations,
      • slice-and-dice, drill-down, roll-up, and pivoting.
    • It generally operates on historic data in both summarized and detailed forms.
    • multidimensional data analysis

75

76 of 104

Data Warehouse Design and Usage

  • Data mining supports
    • knowledge discovery by finding hidden patterns and associations,
    • constructing analytical models,
    • performing classification and prediction, and
    • presenting the mining results using visualization tools.
  • Note:-
    • Data Mining is different with Information Processing and Analytical processing

76

77 of 104

Data Warehouse Design and Usage

From Online Analytical Processing to Multidimensional Data Mining

  • Multidimensional data mining

or

  • Exploratory multidimensional data mining,

or

  • Online analytical mining - OLAM

  • integrates OLAP with data mining to uncover knowledge in multidimensional databases.

77

78 of 104

Data Warehouse Design and Usage

  • Multidimensional data mining is particularly important for the following reasons:
    • High quality of data in data warehouses:
    • Available information processing infrastructure surrounding data warehouses
    • OLAP-based exploration of multidimensional data
    • Online selection of data mining functions

78

79 of 104

Data warehouse implementation�

  • OLAP servers demand that decision support queries be answered in the order of seconds.

  • Methods for the efficient implementation of data warehouse systems.
    • 1. Efficient data cube computation.
    • 2. OLAP data indexing (bitmap or join indices )
    • 3. OLAP query processing
    • 4. Various types of warehouse servers for OLAP processing.

79

80 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • Requires efficient computation of aggregations across many sets of dimensions.
  • In SQL terms:
    • Aggregations are referred to as group-by’s.
  • Each group-by can be represented by a cuboid,
    • set of group-by’s forms a lattice of cuboids defining a data cube.
  • Compute cube Operator - computes aggregates over all subsets of the dimensions specified in the operation.
    • require excessive storage space for large number of dimensions.

80

81 of 104

Data warehouse implementation:

Efficient Data Cube Computation

Example 4.6

  • create a data cube for AllElectronics sales that contains the following:
    • city, item, year, and sales in dollars.

81

82 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • What is the total number of cuboids, or group-by’s, that can be computed for this data cube?
    • 3 attributes - city, item & year -3 dimensions
    • sales in dollars - measure,
    • the total number of cuboids, or group by’s,
      • 2 POWER 3 = 8.
    • The possible group-by’s are the following:
      • {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}
    • () - group-by is empty (i.e., the dimensions are not grouped) - all.
    • group-by’s form a lattice of cuboids for the data cube

82

83 of 104

Data warehouse implementation:

Efficient Data Cube Computation

83

84 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • Base cuboid contains all three dimensions(city, item, year)
    • returns - total sales for any combination of the three dimensions.
    • This is least generalized (most specific) of the cuboids.
  • Apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty (contains total sum of all sales)
    • This is most generalized (least specific) of the cuboids
  • Drill Down equivalent
    • start at the apex cuboid and explore downward in the lattice
  • akin to rolling up
    • start at the base cuboid and explore upward

84

85 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • zero-dimensional operation:
    • An SQL query containing no group-by
    • Example - “compute the sum of total sales
  • one-dimensional operation:
    • An SQL query containing one group-by
    • Example - “compute the sum of sales group-by city

  • A cube operator on n dimensions is equivalent to a collection of group-by statements, one for each subset of the n dimensions.

85

86 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • data cube could be defined as:
    • “define cube sales_cube [city, item, year]: sum(sales_in_dollars)”
  • 2 power n cuboids - For a cube with n dimensions
  • “compute cube sales_cube” - statement
    • computes the sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty subset.
  • In OLAP, for diff. queries diff. cuboids need to be accessed.
  • Precomputation - compute in advance all or at least some of the cuboids in a data cube
  • curse of dimensionality - required storage space may explode if all the cuboids in a data cube are precomputed ( for more dimensions)

86

87 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • Data cube can be viewed as a lattice of cuboids
    • 2 power n - when no concept hierarchy
  • How many cuboids in an n-dimensional cube with L levels?
    • where Li is the number of levels associated with dimension i ( +1 for all )

    • If the cube has 10 dimensions and each dimension has five levels (including all), the total number of cuboids that can be generated is 510 ≈ 9.8 × 106 .

87

88 of 104

Data warehouse implementation:

Efficient Data Cube Computation

There are three choices for data cube materialization given a base cuboid:

  • 1. No materialization: Do not precompute - expensive multidimensional aggregates - extremely slow.

  • 2. Full materialization: Precompute all of the cuboids - full cube - requires huge amounts of memory space in order to store all of the precomputed cuboids.

88

89 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • 3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids.
  • compute a subset of the cube, which contains only those cells that satisfy some user-specified criterion - subcube

  • 3 factors to consider:
    • (1) identify the subset of cuboids or subcubes to materialize;
    • (2) exploit the materialized cuboids or subcubes during query processing; and
    • (3) efficiently update the materialized cuboids or subcubes during load and refresh.

89

90 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • Partial Materialization: Selected Computation of Cuboids

  • Following should take into account during selection of the subset of cuboids or subcubes

    • the queries in the workload, their frequencies, and their accessing costs

    • workload characteristics, the cost for incremental updates, and the total storage requirements.

    • physical database design such as the generation and selection of indices.

90

91 of 104

Data warehouse implementation:

Efficient Data Cube Computation

  • Heuristic approaches for cuboid and subcube selection
    • Iceberg cube:
      • data cube that stores only those cube cells with an aggregate value (e.g., count) that is above some minimum support threshold.
    • shell cube:
      • precomputing the cuboids for only a small number of dimensions

91

92 of 104

Data warehouse implementation:

Indexing OLAP Data: Bitmap Index

  • Index structures - To facilitate efficient data accessing
  • Bitmap indexing method - it allows quick searching in data cubes.
  • In the bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s domain.
  • If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap index (i.e., there are n bit vectors).
  • If the attribute has the value v for a given row in the data table, then the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits for that row are set to 0.

92

93 of 104

Data warehouse implementation:

Indexing OLAP Data: Bitmap Index

93

  • Example:- AllElectronics data warehouse
  • dim(item)={H,C,P,S} - 4 values - 4 bit vectors
  • dim(city)= {V,T} - 2 values - 2 bit vectors
  • Better than Hash & Tree Indices but good for low cardinality only (cardinality:number of unique items in the database column)

94 of 104

Exercise: Bitmap index on CITY

94

95 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

  • Traditional indexing maps the value in a given column to a list of rows having that value.
  • Join indexing registers the joinable rows of two relations from a relational database.
  • Join indexing registers the joinable rows of two relations from a relational database.
  • For example,
    • two relations - R(RID, A) and S(B, SID)
    • join on the attributes A and B,
    • join index record contains the pair (RID, SID),
    • where RID and SID are record identifiers from the R and S relations, respectively

95

96 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

  • Advantage:-
    • Identification of joinable tuples without performing costly join operations.
  • Useful:-
    • To maintain the relationship between a foreign key(fact table) and its matching primary keys(dimension table), from the joinable relation.
  • Indexing maintains relationships between attribute values of a dimension (e.g., within a dimension table) and the corresponding rows in the fact table.
  • Composite join indices: Join indices with multiple dimensions.

96

97 of 104

Data warehouse implementation:

Indexing OLAP Data: Join Index

  • Example:-Star Schema
    • sales_star [time, item, branch, location]: dollars_sold = sum (sales_in_dollars).”
    • join index is relationship between
      • Sales fact table and
      • the location, item dimension tables

97

To speed up query processing - join indexing & bitmap indexing methods can be integrated to form bitmapped join indices.

98 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

Given materialized views, query processing should proceed as follows:

  • 1. Determine which operations should be performed on the available cuboids:
    • This involves transforming any selection, projection, roll-up (group-by), and drill-down operations specified in the query

into

corresponding SQL and/or OLAP operations.

    • Example:
      • slicing and dicing a data cube may correspond to selection and/or projection operations on a materialized cuboid.

98

99 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

  • Determine to which materialized cuboid(s) the relevant operations should be applied:
    • pruning the set using knowledge of

“dominance” relationships among the cuboids,

    • estimating the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost.

99

100 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

Example:-

  • define a data cube for AllElectronics of the form “sales cube [time, item, location]: sum(sales in dollars).
  • dimension hierarchies
    • “day < month < quarter < year” for time;
    • “item_name < brand < type” for item
    • “street < city < province or state < country” for location
  • Query:
    • {brand, province or state}, with the selection constant “year = 2010.”

100

101 of 104

Data warehouse implementation:

Efficient processing of OLAP queries

  • suppose that there are four materialized cuboids available, as follows:

  • “Which of these four cuboids should be selected to process the query?” - 1,3,4
  • How would the costs of each cuboid compare if used to process the query?” - 4

101

102 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

  • Relational OLAP (ROLAP) servers:
    • ROLAP uses relational tables to store data for online analytical processing
    • Intermediate servers that stand in between a relational back-end server and client front-end tools.
    • Operation:
      • use a relational or extended-relational DBMS to store and manage warehouse data
      • OLAP middleware to support missing pieces
    • ROLAP has greater scalability than MOLAP.
    • Example:-
      • DSS server of Microstrategy

102

103 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

  • Multidimensional OLAP (MOLAP) servers:
    • support multidimensional data views through array-based multidimensional storage engines
    • maps multidimensional views directly to data cube array structures.
    • Advantage:
      • fast indexing to precomputed summarized data.
    • adopt a two-level storage representation
      • Denser subcubes are stored as array structures
      • Sparse subcubes employ compression technology

103

A sparse array is one that contains mostly zeros and few non-zero entries. A dense array contains mostly non-zeros.

104 of 104

Data warehouse implementation:

OLAP Server Architectures:ROLAP vs MOLAP vs HOLAP

  • Hybrid OLAP (HOLAP) servers:
    • Combines ROLAP and MOLAP technology
    • benefits
      • greater scalability from ROLAP and
      • faster computation of MOLAP.
    • HOLAP server may allow
        • large volumes of detailed data to be stored in a relational database,
        • while aggregations are kept in a separate MOLAP store.
    • Example:- Microsoft SQL Server 2000 (supports)
    • Specialized SQL servers:
      • provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.

104