GREENPLUM DATABASE DBA

HANDBOOK

Sachchida N. Ojha

First Edition

-

Covers

DCA V1 and V2

Greenplum Database 4.2.x

Data Domain and Greenplum Chorus

Chapter    

Page No.  

Chapter 1: Introduction

3-5

Chapter 2: Greenplum DCA Architecture

6-11

Chapter 3: DCA Utilities

12-16

Chapter 4: Greenplum Database System

17-58

Chapter 5: Installing Greenplum Database Single Node Edition (SNE)

59-65

Chapter 6: Greenplum Database Configuration Parameters

66-80

Chapter 7: Data Distribution Policy and Partitioning in Greenplum

81-91

Chapter 8: GP Toolkit Schema

92-92

Chapter 9: PG Catalog tables and views

93-95

Chapter 10: Greenplum DBA FAQ

96-110

Chapter 11: Greenplum Database SQL Commands

111-134

Chapter 12: Greenplum Oracle Compatibility Packages

135-147

Chapter 13: Greenplum Database Management Utilities

148-148

Chapter 14: Greenplum Database Client Utilities

149-149

Chapter 15: Greenplum Database System Catalog Tables and Views

150-152

Chapter 16: Greenplum Database Release Notes

153-160

Chapter 17: OMREPORT

161-175

Chapter 18: Greenplum DCA Performance

176-182

Chapter 19: Greenplum Database Activity Monitoring

183-183

Chapter 20: Greenplum Workload Management

184-191

Chapter 21: Greenplum Database Performance Tuning

192-196

Chapter 22: Greenplum Command Center

197-206

Chapter 23: GPPERFMON Database

207-240

Chapter 24: Greenplum Database Backup

241-266

Chapter 25: Greenplum Database Restore

267-280

Chapter 26: Data Integration Accelerator (DIA)

281-282

Chapter 27: Greenplum Connectivity Tools

283-289

Chapter 28: Greenplum Load Tools (gpload, gpfdist)

290-300

Chapter 29: Greenplum Client Tools (psql, gpmapreduce)

301-309

Chapter 30: Greenplum External Tables

310-325

CHAPTER 1 - INTRODUCTION

About the author

The author of this book has 23 years of experience in database architecture, administration and troubleshooting.

Greenplum Database Fundamental Concepts

Greenplum is a solution built to support the next generation of data warehousing.  It is built for large scale analytics processing.

Recent years Greenplum database becomes a preferred choice for deep analytics. It is designed for large-scale analytics processing and business intelligence.

What is Big Data?

Big Data refers to the tools, processes, and procedures used to create, manipulate, and manage very large data sets.

What is Data Warehouse?

A data warehouse is a culmination of information gathered about the enterprise.  The information can consist of variety of data including inventory data, HR data, sales data, operational data used to support decision making.  

In a data warehouse, data must be

  1. Centralized
  2. Agreed upon
  3. Easily accessible
  4. Timely and relevant

Basic Data flow of the Data Warehouse

The basic elements of the data warehouse includes,

  1. Source System
  2. Data Staging Area
  3. Operational Data Source
  4. Meta Data
  5. ETL
  6. Data Mart
  7. Business Intelligence

Data Warehouse Methodologies

Two commonly used methods to model a data warehouse are,

  1. Dimensional model
  2. Normalized Model

Dimensional model divides transactional data into facts and dimensions.  This is the most common data warehouse model. Normalized model stores data in tables.  The level of normalization for this approach is not as high as typical OLTP data models.

Dimensional Design

Information systems fall into two major categories,

  1. Those that support the execution of business processes
  2. Those that support the analysis of the business processes

Operational System

Operational system supports the execution of business process by capturing details about events and transactions. For example:  Capturing information about orders, shipments, and returns.

The activities recorded by these systems are called transactions. The system themselves are sometimes called online transaction processing system (OLTP).

Because of the focused on process execution, the operational system updates the data as things change, purge or archive data once its operational usefulness has ended.

The optimal schema design for an operational system which is widely accepted is in 3rd normal form.

Analytic System

Analytic system supports the evaluation of the process.  For example: How the sales orders are trending this month versus last month? Who are our best customers? Sales goal versus actual sales?  Is particular promotion working or not?

The interaction with an analytic system takes place exclusively through queries that retrieve data about business processes.  Here information is neither created nor modified.

Difference between operational systems and analytical systems are highlighted below.

                                   Operational Systems                   Analytical Systems

Purpose

Execution of a business process

Measurement of a business process

Primary Interaction Style

Insert, Update, Delete, Query

Query

Scope of Interaction

Individual transaction

Aggregated transactions

Query Patterns

Predictable and stable

Unpredictable and changing

Temporal Focus

Current

Historic and Current

Design Optimization

Updated concurrency

High-performance query

Design Principle

ER design in 3NF

Dimensional Design (Star Schema, Cube)

Also Known As

OLTP

Source System

Transaction System

Data Warehouse

Data Mart

Table 1: Operational System vs. Analytic System

Star Schema

Star schema consists of fact and dimension tables. The fact table holds the main data. The dimension tables describe the facts presented in the fact table. Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The focus is on summarized data.

The star schema gets its name from its appearance; when drawn with the fact table in the center, it looks like a star or asterisk.

Fact Table

At the core of the star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. Each row in the fact table stores facts at a specific level of details. This level of detail is known as the fact table’s grain.

Dimension Tables

In a star schema, a dimension table contains columns representing dimensions. These columns provide context for facts. A well-developed set of dimension tables provides powerful and diverse analytic capabilities. Dimension tables with a large number of attributes maximize analytic value.

Snowflake Schema

In the snowflake schema, multiple centralized fact tables connect to multiple dimension tables. A single dimension table may have multiple parent tables.

ETL

ETL means extract, transform and load.  You extract the data, perform the transformation on the ETL machine, and then load the data. ETL is tied to the hardware. There are many ETL tools available in the market.

ELT

ELT means extract, load, and then transform.  This strategy is strongly suggested for large data. While this is powerful, it is difficult to integrate with current tools.