1 of 60

Distributed Systems 101

Insight Data

2 of 60

Outline

  • Seminar 1
    • Why not use a single machine?
    • Why distributed systems?
    • Failure likelihoods
    • Exercise
    • Data Pipeline Overview
    • Ingestion
  • Seminar 2
    • Processing
    • Datastores
    • Frontend
    • Exercise
  • Seminar 3: DevOps

3 of 60

Seminar 1

4 of 60

Why not use a single machine?

Machine A

Local Applications

Local OS

Network

  • What happens if failure in
    • Machine
    • OS
    • Local application(s)
  • Performance
    • Too much data for a single machine
    • Too many network requests for a single machine

5 of 60

Why not use a single machine?

Machine A

Network

  • System no longer works 🙁

6 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Local OS

Local OS

Network

Machine A

Local Applications

Local OS

Network

7 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Local OS

Local OS

Network

  • Multiple machines operating as a single system
    • We will discuss more on how this works

  • What happens if a failure occurs?

8 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Local OS

Network

  • Applications persist 😃

9 of 60

Why distributed systems?

10 of 60

Failure Likelihood ← link to experiment

Avail %

1 Component

2 Components

3 Components

4 Components

5 Components

Web Server

(eg Flask)

85.000%

97.750%

99.663%

99.949%

99.992%

Application Service 1 �(Ingestion, eg Kafka)

95.000%

99.750%

99.988%

99.999%

100.000%

Application Service 2 �(Processing, eg Spark)

95.000%

99.750%

99.988%

99.999%

100.000%

Application Service 3 �(Datastore, eg Cassandra)

95.000%

99.750%

99.988%

99.999%

100.000%

System Avail %

72.877%

97.019%

99.625%

99.948%

99.992%

Downtime per year (mins)

142559.1

15669.7

1970.3

275.9

40.4

11 of 60

Failure Likelihood ← link to experiment

Avail %

1 Component

2 Components

3 Components

4 Components

5 Components

Web Server

(eg Flask)

85.000%

97.750%

99.663%

99.949%

99.992%

Application Service 1 �(Ingestion, eg Kafka)

95.000%

99.750%

99.988%

99.999%

100.000%

Application Service 2 �(Processing, eg Spark)

95.000%

99.750%

99.988%

99.999%

100.000%

Application Service 3 �(Datastore, eg Cassandra)

95.000%

99.750%

99.988%

99.999%

100.000%

Database

99.900%

100.000%

100.000%

100.000%

100.000%

DNS

98.000%

99.960%

99.999%

100.000%

100.000%

Firewall

85.000%

97.750%

99.663%

99.949%

99.992%

Switch

99.000%

99.990%

100.000%

100.000%

100.000%

Data Center

99.990%

100.000%

100.000%

100.000%

100.000%

ISP

95.000%

99.750%

99.988%

99.999%

100.000%

System Avail %

57.032%

94.551%

99.276%

99.896%

99.985%

Downtime per year (mins)

225841.9

28638.3

3807.5

545.3

80.5

12 of 60

  • Make a copy of this sheet
  • Fill out everything in yellow, and calculate:
    • System availability
    • Downtime per time period

13 of 60

Data Pipeline

Ingestion Platform or Datastore

Processing

Ingestion Platform / Datastore

14 of 60

Data Stores

  • Distributed file systems
    • Hadoop File System (HDFS)
    • Amazon S3
    • GCP Storage
  • Ingestion Platform
    • Kafka
    • Pulsar
    • RabbitMQ

15 of 60

Ingestion Platform - Basic Concepts

16 of 60

Kafka cluster

17 of 60

Key Concepts

  • Key features:
    • Fault Tolerant - consensus!
    • Scalable
    • High throughput
    • Low latency
  • Exactly once / At least once / At most once
  • Topics and Partitions
  • Subscription: Exclusive, Fail-over or Shared

18 of 60

Distributed Processing

  • MapReduce!

19 of 60

Distributed Processing Cluster - Spark

20 of 60

Distributed Processing Frameworks

  • Batch/Unified
    • Hadoop
    • MapR
    • Spark
    • Flink
  • Distributed SQL Engine
    • Presto
  • Streaming
    • Spark Streaming
    • Flink
    • KStreams
    • Pulsar Functions

21 of 60

Databases

  • Database vs file system?
  • Fault tolerance?
  • SQL or NoSQL?
  • Columnar or Row?
  • Consistency, Availability, Partitioning?
  • In-memory or non-disk?
  • And so much more!

22 of 60

Database vs File System

  • A database stores data in a structured way, and allows you to query the data
  • A Query engine takes a query, and converts it into file operations
  • Trade-offs?
    • The more structure, the more space is required
    • The more structure, the faster complicated queries run
    • The more structure, the slower basic operations (add, delete, update) run
  • In a way, you can think of a NoSQL database as a compromise between a file system and a SQL Database
  • NoSQL is not always faster than SQL!

23 of 60

Data Modeling (The DE version)

24 of 60

Columnar or Row storage?

  • Which is better for….
    • Inserts/updates?
    • Filters?
    • Compression?
    • Aggregations?

25 of 60

Consistency, Availability, Partitioning

  • Next time!

26 of 60

Examples?

  • SQL:
    • MySQL, MariaDB, PostgreSQL, CockroachDB
  • Document-based / NoSQL
    • MongoDB, Redis, Memcache
  • Columnar:
    • Redshift, Cassandra,

27 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

or

Processing Options

Database Options

Frontend

Monitoring

28 of 60

29 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

30 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

File format

31 of 60

Ingestion

32 of 60

File System

33 of 60

File format

34 of 60

Distributed Systems 101

Seminar 2

35 of 60

Data Pipeline

Ingestion

Processing

Datastore

36 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

File format

37 of 60

Data Pipeline

Batch

Datastore

Unified

Streaming

Ingestion

File System

or

File format

Scheduling

or

or

Processing Options

38 of 60

Batch

39 of 60

Streaming

40 of 60

Unified

41 of 60

Yarn

Oozie

Zookeeper

Airflow

Luigi

Data engineering workflows can become incredibly complex in a production environment, with regularly scheduled jobs, dependencies, and allocation of shared resources. In order to manage these systems, scheduling and monitoring tools are critical.

Scheduling

Scheduling

42 of 60

Data Pipeline

Datastore

Ingestion

File System

or

File format

Batch

Unified

Streaming

Scheduling

or

or

Processing Options

43 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

or

Processing Options

Database Options

44 of 60

Picking Database: CAP Theorem

45 of 60

In other words….

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

46 of 60

CAP Theorem (Pick 2!)

mySQL

47 of 60

ACID Database

48 of 60

Relational

  • MySQL
  • PostgresSQL
  • AWS RDS

49 of 60

Key-Value

  • Redis
  • DynamoDB
  • Riak

50 of 60

Columnar

  • Cassandra

51 of 60

Invert Index

  • Lucene
    • ElasticSearch
    • Solr
  • CouchDB

52 of 60

Graph

  • Neo4J
  • OrientDB
  • Neptune

53 of 60

Data Pipeline

Ingestion

Processing

Datastore

Frontend

54 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

or

Processing Options

Database Options

Frontend

55 of 60

Frontend

  • Flask
  • Django

Database

Model

App Logic

View Logic

Frontend Template

Server Side

Client Side

Frontend Framework

56 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

or

Processing Options

Database Options

Frontend

Monitoring

57 of 60

  • Work on your version of this sheet
  • Update your future Insight project architecture with
    • Data ingestion
    • File system
    • Processing
      • Batch, streaming or unified
      • Scheduling
    • Datastore
      • Relational, key-value, columnar, inverted-index, and/or graph
    • Frontend

58 of 60

Appendix

59 of 60

60 of 60