1 of 44

数据的采集,存储,分析系统简介

SHIHAO 

NEZUMI

XIAODONG YANG

2 of 44

Introduction

Topic 1: OLTP and OLAP – Analytic system and data warehouse concept 

Topic 2: MPP vs Hadoop – Infra and component perspective

Topic 3: Data platforms and ETL  in batch system– application perspective

Topic 4: Batch processing vs stream processing 

Topic 5: Future

3 of 44

What is the difference between OLAP and OLTP?

    • To C application, provide functionality to customers
    • For single buiness usage
    • Input: single user request
    • Output: some response related with the request

OLTP - Online Transaction Processing

    • Reporting system, provide report to senior mangement
    • For department/whole company level analyze
    • Input: OLTP data
    • Output: business report 

OLAP - Online Analytical Processing

4 of 44

Examples

  • Suppose we have some employee salary data
  • Transaction: raise employee mike's salary to 15000
  • Analytic: what is the total cost for `IT` department

Transaction

Analytic

Python

SQL

5 of 44

Different query patterns

    • CRUD query for special row
    • OLTP DB:Mysql, Postgre, Orcale

Transaction - RDBMS

    • Aggregation query for special column
    • OLAP DB: Bigquery, Redshift, Snowflake, Vertica, Greenplum

Analytic – Data warehouse

6 of 44

Why use different database?

Row based storage - OLTP

Column based storage - OLAP

Advantage:

Fast query by row

Fast update and delete

Better index support

Fast query by column

Fast aggregate function and join

Easy to compress data

7 of 44

Design - Denormalization and dimension modeling

    • Add redundent data to avoid expensive joining
    • But it still follow the 3NF

What is denormailization?

    • Also called star schema or snowflake schema
    • Separates business process data into facts and dimension
    • Fact table hold the measurable, quantitative data about a business 
    • Dimensions table owns descriptive attributes related to fact data.

What is dimension modeling?

8 of 44

Examples

  • Suppose we have some employee salary data

9 of 44

How to track the history changes?

  • Slow changing dimensions

    • Type 1: overwrite the existing data, like what OLTP do
      • Obviously lose some information
      • If we use SCD Type-1 in employee data, we can never know employee's previous name

    • Type 2: add new row and new columns
      • Do not lose any information
      • Avoid the update and delete operation

10 of 44

Examples

  • Suppose we have the dim_employee data

SCD – Type 2:

SCD – Type 1 - before:

SCD – Type 1 - after:

11 of 44

Traditional data warehouse architecture

New architecture with some fancy keyword and new users

Traditional architecture

12 of 44

End of topic 1

  • We covered some basic concepts about data warehouse
  • Also talk about some design philosophy

  • Any question?

13 of 44

MPP and Hadoop – storage and computation engine

  • What is Massive Parallel Processing database?
    • Original from 1983 Teradata 
    • Original designed as distributed database (ACID, UDF, Stored procedure)
    • Very expensive software 
    • Enterprise level service from giant company
  • What is Hadoop ecosystem?
    • Original from 2003-2005 Google's paper 
    • Original designed as distributed file system + computation engine
    • Open source platform, zero cost
    • No service or from some startups

14 of 44

Before 2006

  • Hadoop:
    • Not implemented yet

  • MPP:
    • First MPP database: Teradata DBC/1012 in 1984, support 10^12 bytes data = 1TB
    • Fighting with traditional SMP(share everything, example: IBM db2, Oracle)
    • First production TB level data warehouse in Walmart 1992
    • Teradata claim they have the largest data warehouse with 130TB data in 1999

15 of 44

Traditional MPP architecture

  • Teradata architecture:
    • Each node is a physical database computer

    • BYNET is a hardware level message passing layer which also provide fault tolerance in comuptaion layer

    • Use RAID to provide storage layer fault tolerance
      • Need more information

    • AMP and Disk are virtual CPU and Disk

    • Very similar with some OLTP sharding mechanism 

16 of 44

2006 - 2010

  • Hadoop and Mapreduce:
    • Implemented by Yahoo's engineering team
    • Pass 1000 nodes test in Yahoo 2007
    • Did 1 PB sorting in 17 hours 2009
    • Work for both structure and unstructure data

  • MPP:
    • Teradata announce that five of their customers have more than 1PB data 2008
    • MapReduce: A major step backwards, David DeWitt and Michael Stonebraker
    • Have other MPP system(vertica, greenplum) that can deploy in commerical computers
    • Only work for structure data with little flexibility

17 of 44

What is the different between MPP and Hadoop?

18 of 44

MPP

  • Pros:
    • Most MPP have full SQL support with ACID transcation
    • Query is very fast (40 year's query optimizer)
    • Symmetric architecture and data locality, execution node does not know other nodes 

  • Cons:
    • Have obvious barrel effect. Slow node will influence the performance
    • Data need to be balanced very well, which means do not have best elasticity
    • Relevent low QPS, most MPP can not execute more than 100 queries in same time
    • Usually work with some transformation engine like informatica

19 of 44

Hadoop

20 of 44

Hadoop

  • Pros:
    • Not just SQL, can use programming to do something that SQL cant
    • Data do not need to be balanced very well, which means it have better elasticity
    • Open source, cost nothing on application
    • Good for heavey data transformation via interactive analyze
  • Cons:
    • Very, very slow. Mapreduce pipeline need to write intermedaite result in disk
    • Data is not balanced at initial, most of the mapreduce job need to shuffle data  
    • Lower level computation framework, optimization need to handle  by user

21 of 44

MapReduce: A major step backwards

    • 1. MapReduce is a step backwards in database access
      • Imperative programming vs declarative programming
    • 2. MapReduce is a poor implementation
      • No index to acclerate the query performance
    • 3. MapReduce is not novel
      • Oracle RAC and IBM DB2 have similar computation model
    • 4. MapReduce is missing features
      • No index, bulk loader, transaction, views.. etc
    • 5. MapReduce is incompatible with the DBMS tools

22 of 44

2010 - 2020

  • Hadoop:
    • SQL on hadoop
      • Hive, compile Hive SQL into MR jobs (transformation engine)
      • Impala, use MPP model, does not gurantee exeuction result (query engine)
    • Separate storage and computation
      • Spark = MPP – query opitimzer – transaction + Speculative Execution
        • Solved the data loading issue of MPP
      • Better support for ML applications
  • MPP:
    • ELT instead of ETL, kick out informatica, also have similar ways to avoid the barrel effect
    • Separate storage and computation
      • Vertica – eon, use cloud storage engine as persistence layer
      • Snowflake, pure cloud based MPP database

23 of 44

Today's Hadoop eco-system

    • Hadoop and MPP become similar

    • 1. MapReduce is a step backwards in database access
      • declarative programming win – Hive, impala
    • 2. MapReduce is a poor implementation
      • Hive and Hbase both have index to optimize query
    • 3. MapReduce is not novel
      • IBM Db2 only reach to 512 nodes but lots of Hadoop clusters reach to more than 10000 nodes
    • 4. MapReduce is missing features 
    • 5. MapReduce is incompatible with the DBMS tools
      • Open source community implemented all of that

24 of 44

End of topic 2

  • We covered two popular big data solution, MPP and Hadoop
  • We covered some history of MPP database and Hadoop ecosystem
  • Also talk about differernt between the programming paradigm

  • Any question?

25 of 44

 Extract, transform and load and dataplatform

    • Extract: get data out of OLTP, can be JDBC/Restful/File
    • Example, select * from <table>
    • Transformation: Use SQL or other langugae to maniplicate the data to correct format
    • Examlpe: Unify of time zone
    • Load: put data into OLAP
    • Example, copy command in SQL

ETL – Extract Transform Load

    • Extract: what kind of application hold those JDBC/restful calls
    • Transformation: what application execute those havey computation
    • Load: what kind of application handle the load logic 

Data platform – Solution framework

26 of 44

World's smallest ETL framework and job

27 of 44

Roles, project dominate or product dominate

28 of 44

Pain point for data platform

  • Orchestration

    • Most of the ETL pipeline is not a services, so we will need someway to schedule it
    • Common schedule tools: airflow, jenkins, azkanban
    • DAG:

29 of 44

Airflow

30 of 44

Pain point for data platform

  • Data discovery and metadata management

    • In a large organization, nobody know all the data they have
    • We will need some mechanism to help users find data they need
    • Basically, a search engine for data

    • Better to have
    • Description for every table and column, like a data dictionary
    • Data lineage, provide end-to-end view of data
    • Data profiling, know more about the data quality

31 of 44

Amundsen

32 of 44

Pain point for data platform

  • Data cleasing, quality and realtime profiling
    • Not all the data from source is decent
    • Even if the source is good, some of the data may not follow data standard
    • Need to find some mechanism to execute light transformation, check the quality of the data
    • Basically, the unit test of the data set

    • Easiest way is to use some regular expression to validate

33 of 44

Amazon deequ

34 of 44

End of topic 3

  • We covered basic definiton of ETL and data platform
  • We covered some pain points and related tools for data platform

  • Any question?

35 of 44

Batch and streaming

  • What is streaming process?
    • Different philosophy between the batch process
      • All application has their own data stream, batch process just pick a snapshot of it to process
      • In another word, streaming can be considered as realtime tiny batch process

  • Why we need streaming process?
    • Real time/near real time analytic, ML
    • We need to do some realtime visualization in big screen
    • Process can happen before the data come into the DW layer

36 of 44

Example

Streaming data

Snapshot data / batch

37 of 44

Extract in streaming process

  • How to track the change in another system?
    • Push vs pull
      • Push: when the source system had any update, send message to the extract application, need source system support
      • Pull: the extract application create a loop to polling the data from the source system
    • CDC – change data capature
      • Restful API: most of the restful api does not have event system, pull model will be better here
      • Database – from oracle 9i(2003): 
        • Timestamp column: create a timestamp column and extract application run query like 'select * from table where ts > current_timestmp() - interval '1 minutes' '
        • Snapshot: create a snapshot table and query for the diff of those tables by interval, run query like '(select * from table)  minus (select * from snapshot)' and update snapshot every time 
        • SQL trigger: use sql trigger with a operation table, then still use the pull model to query the operation table
        • Binlog: if database have good documents for binlog and protocol, extract application can act like 'database replica' and subscript the binlog

38 of 44

Example

SQL trigger

Binlog

39 of 44

Transform in streaming process

  • Real streaming process(one at a time):
    • Flink:
      • At least one 
      • Heavy transformation framework, can handle very high traffic (350k/s on regular computer)
      • Also have a batch(snapshot between time) process
    • Kafka:
      • Exact one
      • Kafka steams is pretty new, not sure if can handle high traffics
      • Pre flink, stream data  
      • After flink, Peak shaving and valley filling
  • Micro batch process:
    • Spark stream:
      • Exact once
      • Need to setup the process interval

40 of 44

Real time analytic applications

    • For now, most system need real time monitoring especially IOT

    • Message queue based processing only do computation, they are not database

    • Traditional Data warehouse design for bulk import instead of small queries

    • TSDB:
      • Prometheus, InfluxDB
      • Special timestamp based index, have good performance on query timeseries data
      • All have great visualization layer for create dashboard
      • Usually have a data rentention cycle, because the value of hot data >> cold data

41 of 44

Lambda

    • Lambda, streaming and batch together
    • Why do it again?
      • Streaming may have different result 
      • Realtime analytic with not 100% accurate 
      • Accurate analytic with not real time

42 of 44

Kappa

    • Kappa, streaming process only
    • Pros:
      • No duplicate implementation
      • Low cost, all open source solution

    • Cons:
      • Need all data pipeline be event based
      • No tech and politics support
      • More suit for TP based appliaction rather than AP

43 of 44

End of topic 4

  • We covered the different between the batch and stream process
  • We covered two different streaming computation model
  • We covered two different streaming platform architectures

  • Any question?

44 of 44

Future

    • All in one solution, TP, AP, ETL covered
      • Google Spanner/F1
      • TiDB
        • OLTP: Distributed K-V database (TiKV) + SQL layer (Tidb server)
        • ETL: TiSpark, integrated Spark engine into persistence layer
        • OLAP: Raft based replication to TiFlash, separate node with TP