GREENPLUM DATABASE DBA
Sachchida N. Ojha
DCA V1 and V2
Greenplum Database 4.2.x
Data Domain and Greenplum Chorus
Chapter 1: Introduction
Chapter 2: Greenplum DCA Architecture
Chapter 3: DCA Utilities
Chapter 4: Greenplum Database System
Chapter 5: Installing Greenplum Database Single Node Edition (SNE)
Chapter 6: Greenplum Database Configuration Parameters
Chapter 7: Data Distribution Policy and Partitioning in Greenplum
Chapter 8: GP Toolkit Schema
Chapter 9: PG Catalog tables and views
Chapter 10: Greenplum DBA FAQ
Chapter 11: Greenplum Database SQL Commands
Chapter 12: Greenplum Oracle Compatibility Packages
Chapter 13: Greenplum Database Management Utilities
Chapter 14: Greenplum Database Client Utilities
Chapter 15: Greenplum Database System Catalog Tables and Views
Chapter 16: Greenplum Database Release Notes
Chapter 17: OMREPORT
Chapter 18: Greenplum DCA Performance
Chapter 19: Greenplum Database Activity Monitoring
Chapter 20: Greenplum Workload Management
Chapter 21: Greenplum Database Performance Tuning
Chapter 22: Greenplum Command Center
Chapter 23: GPPERFMON Database
Chapter 24: Greenplum Database Backup
Chapter 25: Greenplum Database Restore
Chapter 26: Data Integration Accelerator (DIA)
Chapter 27: Greenplum Connectivity Tools
Chapter 28: Greenplum Load Tools (gpload, gpfdist)
Chapter 29: Greenplum Client Tools (psql, gpmapreduce)
Chapter 30: Greenplum External Tables
CHAPTER 1 - INTRODUCTION
About the author
The author of this book has 23 years of experience in database architecture, administration and troubleshooting.
Greenplum Database Fundamental Concepts
Greenplum is a solution built to support the next generation of data warehousing. It is built for large scale analytics processing.
Recent years Greenplum database becomes a preferred choice for deep analytics. It is designed for large-scale analytics processing and business intelligence.
What is Big Data?
Big Data refers to the tools, processes, and procedures used to create, manipulate, and manage very large data sets.
What is Data Warehouse?
A data warehouse is a culmination of information gathered about the enterprise. The information can consist of variety of data including inventory data, HR data, sales data, operational data used to support decision making.
In a data warehouse, data must be
Basic Data flow of the Data Warehouse
The basic elements of the data warehouse includes,
Data Warehouse Methodologies
Two commonly used methods to model a data warehouse are,
Dimensional model divides transactional data into facts and dimensions. This is the most common data warehouse model. Normalized model stores data in tables. The level of normalization for this approach is not as high as typical OLTP data models.
Information systems fall into two major categories,
Operational system supports the execution of business process by capturing details about events and transactions. For example: Capturing information about orders, shipments, and returns.
The activities recorded by these systems are called transactions. The system themselves are sometimes called online transaction processing system (OLTP).
Because of the focused on process execution, the operational system updates the data as things change, purge or archive data once its operational usefulness has ended.
The optimal schema design for an operational system which is widely accepted is in 3rd normal form.
Analytic system supports the evaluation of the process. For example: How the sales orders are trending this month versus last month? Who are our best customers? Sales goal versus actual sales? Is particular promotion working or not?
The interaction with an analytic system takes place exclusively through queries that retrieve data about business processes. Here information is neither created nor modified.
Difference between operational systems and analytical systems are highlighted below.
Operational Systems Analytical Systems
Execution of a business process
Measurement of a business process
Primary Interaction Style
Insert, Update, Delete, Query
Scope of Interaction
Predictable and stable
Unpredictable and changing
Historic and Current
ER design in 3NF
Dimensional Design (Star Schema, Cube)
Also Known As
Table 1: Operational System vs. Analytic System
Star schema consists of fact and dimension tables. The fact table holds the main data. The dimension tables describe the facts presented in the fact table. Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The focus is on summarized data.
The star schema gets its name from its appearance; when drawn with the fact table in the center, it looks like a star or asterisk.
At the core of the star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. Each row in the fact table stores facts at a specific level of details. This level of detail is known as the fact table’s grain.
In a star schema, a dimension table contains columns representing dimensions. These columns provide context for facts. A well-developed set of dimension tables provides powerful and diverse analytic capabilities. Dimension tables with a large number of attributes maximize analytic value.
In the snowflake schema, multiple centralized fact tables connect to multiple dimension tables. A single dimension table may have multiple parent tables.
ETL means extract, transform and load. You extract the data, perform the transformation on the ETL machine, and then load the data. ETL is tied to the hardware. There are many ETL tools available in the market.
ELT means extract, load, and then transform. This strategy is strongly suggested for large data. While this is powerful, it is difficult to integrate with current tools.