数据的采集,存储,分析系统简介
SHIHAO
NEZUMI
XIAODONG YANG
Introduction
Topic 1: OLTP and OLAP – Analytic system and data warehouse concept
Topic 2: MPP vs Hadoop – Infra and component perspective
Topic 3: Data platforms and ETL in batch system– application perspective
Topic 4: Batch processing vs stream processing
Topic 5: Future
What is the difference between OLAP and OLTP?
OLTP - Online Transaction Processing
OLAP - Online Analytical Processing
Examples
Transaction
Analytic
Python
SQL
Different query patterns
Transaction - RDBMS
Analytic – Data warehouse
Why use different database?
Row based storage - OLTP
Column based storage - OLAP
Advantage:
Fast query by row
Fast update and delete
Better index support
Fast query by column
Fast aggregate function and join
Easy to compress data
Design - Denormalization and dimension modeling
What is denormailization?
What is dimension modeling?
Examples
How to track the history changes?
Examples
SCD – Type 2:
SCD – Type 1 - before:
SCD – Type 1 - after:
Traditional data warehouse architecture
New architecture with some fancy keyword and new users
Traditional architecture
End of topic 1
MPP and Hadoop – storage and computation engine
Before 2006
Traditional MPP architecture
2006 - 2010
What is the different between MPP and Hadoop?
MPP
Hadoop
Hadoop
MapReduce: A major step backwards
2010 - 2020
Today's Hadoop eco-system
End of topic 2
Extract, transform and load and dataplatform
ETL – Extract Transform Load
Data platform – Solution framework
World's smallest ETL framework and job
Roles, project dominate or product dominate
Pain point for data platform
Airflow
Pain point for data platform
Amundsen
Pain point for data platform
Amazon deequ
End of topic 3
Batch and streaming
Example
Streaming data
Snapshot data / batch
Extract in streaming process
Example
SQL trigger
Binlog
Transform in streaming process
Real time analytic applications
Lambda
Kappa
End of topic 4
Future