2 of 44

Introduction

Topic 1: OLTP and OLAP – Analytic system and data warehouse concept

Topic 2: MPP vs Hadoop – Infra and component perspective

Topic 3: Data platforms and ETL in batch system– application perspective

Topic 4: Batch processing vs stream processing

Topic 5: Future

3 of 44

What is the difference between OLAP and OLTP?

To C application, provide functionality to customers
For single buiness usage
Input: single user request
Output: some response related with the request

OLTP - Online Transaction Processing

Reporting system, provide report to senior mangement
For department/whole company level analyze
Input: OLTP data
Output: business report

OLAP - Online Analytical Processing

4 of 44

Examples

Suppose we have some employee salary data
Transaction: raise employee mike's salary to 15000
Analytic: what is the total cost for `IT` department

Transaction

Analytic

Python

SQL

5 of 44

Different query patterns

CRUD query for special row
OLTP DB：Mysql, Postgre, Orcale

Transaction - RDBMS

Aggregation query for special column
OLAP DB: Bigquery, Redshift, Snowflake, Vertica, Greenplum

Analytic – Data warehouse

6 of 44

Why use different database?

Row based storage - OLTP

Column based storage - OLAP

Advantage:

Fast query by row

Fast update and delete

Better index support

Fast query by column

Fast aggregate function and join

Easy to compress data

7 of 44

Design - Denormalization and dimension modeling

Add redundent data to avoid expensive joining
But it still follow the 3NF

What is denormailization?

Also called star schema or snowflake schema
Separates business process data into facts and dimension
Fact table hold the measurable, quantitative data about a business
Dimensions table owns descriptive attributes related to fact data.

What is dimension modeling?

8 of 44

Examples

Suppose we have some employee salary data

9 of 44

How to track the history changes?

Slow changing dimensions

Type 1: overwrite the existing data, like what OLTP do

Obviously lose some information
If we use SCD Type-1 in employee data, we can never know employee's previous name

Type 2: add new row and new columns

Do not lose any information
Avoid the update and delete operation

10 of 44

Examples

Suppose we have the dim_employee data

SCD – Type 2:

SCD – Type 1 - before:

SCD – Type 1 - after:

11 of 44

Traditional data warehouse architecture

New architecture with some fancy keyword and new users

Traditional architecture

12 of 44

End of topic 1

We covered some basic concepts about data warehouse
Also talk about some design philosophy

Any question?

13 of 44

MPP and Hadoop – storage and computation engine

What is Massive Parallel Processing database?

Original from 1983 Teradata
Original designed as distributed database (ACID, UDF, Stored procedure)
Very expensive software
Enterprise level service from giant company

What is Hadoop ecosystem?

Original from 2003-2005 Google's paper
Original designed as distributed file system + computation engine
Open source platform, zero cost
No service or from some startups

14 of 44

Before 2006

Hadoop:

Not implemented yet

MPP:

First MPP database: Teradata DBC/1012 in 1984, support 10^12 bytes data = 1TB
Fighting with traditional SMP(share everything, example: IBM db2, Oracle)
First production TB level data warehouse in Walmart 1992
Teradata claim they have the largest data warehouse with 130TB data in 1999

15 of 44

Traditional MPP architecture

Teradata architecture:

Each node is a physical database computer

BYNET is a hardware level message passing layer which also provide fault tolerance in comuptaion layer

Use RAID to provide storage layer fault tolerance

Need more information

AMP and Disk are virtual CPU and Disk

Very similar with some OLTP sharding mechanism

16 of 44

2006 - 2010

Hadoop and Mapreduce:

Implemented by Yahoo's engineering team
Pass 1000 nodes test in Yahoo 2007
Did 1 PB sorting in 17 hours 2009
Work for both structure and unstructure data

MPP:

Teradata announce that five of their customers have more than 1PB data 2008
MapReduce: A major step backwards, David DeWitt and Michael Stonebraker
Have other MPP system(vertica, greenplum) that can deploy in commerical computers
Only work for structure data with little flexibility

17 of 44

What is the different between MPP and Hadoop?

18 of 44

MPP

Pros:

Most MPP have full SQL support with ACID transcation
Query is very fast (40 year's query optimizer)
Symmetric architecture and data locality, execution node does not know other nodes

Cons:

Have obvious barrel effect. Slow node will influence the performance
Data need to be balanced very well, which means do not have best elasticity
Relevent low QPS, most MPP can not execute more than 100 queries in same time
Usually work with some transformation engine like informatica

20 of 44

Hadoop

Pros:

Not just SQL, can use programming to do something that SQL cant
Data do not need to be balanced very well, which means it have better elasticity
Open source, cost nothing on application
Good for heavey data transformation via interactive analyze

Cons:

Very, very slow. Mapreduce pipeline need to write intermedaite result in disk
Data is not balanced at initial, most of the mapreduce job need to shuffle data
Lower level computation framework, optimization need to handle by user

21 of 44

MapReduce: A major step backwards

1. MapReduce is a step backwards in database access

Imperative programming vs declarative programming

2. MapReduce is a poor implementation

No index to acclerate the query performance

3. MapReduce is not novel

Oracle RAC and IBM DB2 have similar computation model

4. MapReduce is missing features

No index, bulk loader, transaction, views.. etc

5. MapReduce is incompatible with the DBMS tools

22 of 44

2010 - 2020

Hadoop:

SQL on hadoop

Hive, compile Hive SQL into MR jobs (transformation engine)
Impala, use MPP model, does not gurantee exeuction result (query engine)

Separate storage and computation

Spark = MPP – query opitimzer – transaction + Speculative Execution

Solved the data loading issue of MPP

Better support for ML applications

MPP:

ELT instead of ETL, kick out informatica, also have similar ways to avoid the barrel effect
Separate storage and computation

Vertica – eon, use cloud storage engine as persistence layer
Snowflake, pure cloud based MPP database

23 of 44

Today's Hadoop eco-system

Hadoop and MPP become similar

1. MapReduce is a step backwards in database access

declarative programming win – Hive, impala

2. MapReduce is a poor implementation

Hive and Hbase both have index to optimize query

3. MapReduce is not novel

IBM Db2 only reach to 512 nodes but lots of Hadoop clusters reach to more than 10000 nodes

4. MapReduce is missing features
5. MapReduce is incompatible with the DBMS tools

Open source community implemented all of that

24 of 44

End of topic 2

We covered two popular big data solution, MPP and Hadoop
We covered some history of MPP database and Hadoop ecosystem
Also talk about differernt between the programming paradigm

Any question?

25 of 44

Extract, transform and load and dataplatform

Extract: get data out of OLTP, can be JDBC/Restful/File
Example, select * from <table>
Transformation: Use SQL or other langugae to maniplicate the data to correct format
Examlpe: Unify of time zone
Load: put data into OLAP
Example, copy command in SQL

ETL – Extract Transform Load

Extract: what kind of application hold those JDBC/restful calls
Transformation: what application execute those havey computation
Load: what kind of application handle the load logic

Data platform – Solution framework

26 of 44

World's smallest ETL framework and job

27 of 44

Roles, project dominate or product dominate

28 of 44

Pain point for data platform

Orchestration

Most of the ETL pipeline is not a services, so we will need someway to schedule it
Common schedule tools: airflow, jenkins, azkanban
DAG:

30 of 44

Pain point for data platform

Data discovery and metadata management

In a large organization, nobody know all the data they have
We will need some mechanism to help users find data they need
Basically, a search engine for data

Better to have
Description for every table and column, like a data dictionary
Data lineage, provide end-to-end view of data
Data profiling, know more about the data quality

32 of 44

Pain point for data platform

Data cleasing, quality and realtime profiling

Not all the data from source is decent
Even if the source is good, some of the data may not follow data standard
Need to find some mechanism to execute light transformation, check the quality of the data
Basically, the unit test of the data set

Easiest way is to use some regular expression to validate

33 of 44

Amazon deequ

34 of 44

End of topic 3

We covered basic definiton of ETL and data platform
We covered some pain points and related tools for data platform

Any question?

35 of 44

Batch and streaming

What is streaming process?

Different philosophy between the batch process

All application has their own data stream, batch process just pick a snapshot of it to process
In another word, streaming can be considered as realtime tiny batch process

Why we need streaming process?

Real time/near real time analytic, ML
We need to do some realtime visualization in big screen
Process can happen before the data come into the DW layer

36 of 44

Example

Streaming data

Snapshot data / batch

37 of 44

Extract in streaming process

How to track the change in another system?

Push vs pull

Push: when the source system had any update, send message to the extract application, need source system support
Pull: the extract application create a loop to polling the data from the source system

CDC – change data capature

Restful API: most of the restful api does not have event system, pull model will be better here
Database – from oracle 9i(2003):

Timestamp column: create a timestamp column and extract application run query like 'select * from table where ts > current_timestmp() - interval '1 minutes' '
Snapshot: create a snapshot table and query for the diff of those tables by interval, run query like '(select * from table) minus (select * from snapshot)' and update snapshot every time
SQL trigger: use sql trigger with a operation table, then still use the pull model to query the operation table
Binlog: if database have good documents for binlog and protocol, extract application can act like 'database replica' and subscript the binlog

38 of 44

Example

SQL trigger

Binlog

39 of 44

Transform in streaming process

Real streaming process(one at a time):

Flink:

At least one
Heavy transformation framework, can handle very high traffic (350k/s on regular computer)
Also have a batch(snapshot between time) process

Kafka：

Exact one
Kafka steams is pretty new, not sure if can handle high traffics
Pre flink, stream data
After flink, Peak shaving and valley filling

Micro batch process:

Spark stream:

Exact once
Need to setup the process interval

40 of 44

Real time analytic applications

For now, most system need real time monitoring especially IOT

Message queue based processing only do computation, they are not database

Traditional Data warehouse design for bulk import instead of small queries

TSDB:

Prometheus, InfluxDB
Special timestamp based index, have good performance on query timeseries data
All have great visualization layer for create dashboard
Usually have a data rentention cycle, because the value of hot data >> cold data

41 of 44

Lambda

Lambda, streaming and batch together
Why do it again?

Streaming may have different result
Realtime analytic with not 100% accurate
Accurate analytic with not real time

42 of 44

Kappa

Kappa, streaming process only
Pros:

No duplicate implementation
Low cost, all open source solution

Cons:

Need all data pipeline be event based
No tech and politics support
More suit for TP based appliaction rather than AP

43 of 44

End of topic 4

We covered the different between the batch and stream process
We covered two different streaming computation model
We covered two different streaming platform architectures

Any question?

44 of 44

Future

All in one solution, TP, AP, ETL covered

Google Spanner/F1
TiDB

OLTP: Distributed K-V database (TiKV) + SQL layer (Tidb server)
ETL: TiSpark, integrated Spark engine into persistence layer
OLAP: Raft based replication to TiFlash, separate node with TP

1 of 44

2 of 44

3 of 44

4 of 44

5 of 44

6 of 44

7 of 44

8 of 44

9 of 44

10 of 44

11 of 44

12 of 44

13 of 44

14 of 44

15 of 44

16 of 44

17 of 44

18 of 44

19 of 44

20 of 44

21 of 44

22 of 44

23 of 44

24 of 44

25 of 44

26 of 44

27 of 44

28 of 44

29 of 44

30 of 44

31 of 44

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

38 of 44

39 of 44

40 of 44

41 of 44

42 of 44

43 of 44

44 of 44