1 of 60

Distributed Systems 101

Insight Data

2 of 60

Outline

Seminar 1

Why not use a single machine?
Why distributed systems?
Failure likelihoods
Exercise
Data Pipeline Overview
Ingestion

Seminar 2

Processing
Datastores
Frontend
Exercise

Seminar 3: DevOps

3 of 60

Seminar 1

4 of 60

Why not use a single machine?

Machine A

Local Applications

Local OS

Network

What happens if failure in

Machine
OS
Local application(s)

Performance

Too much data for a single machine
Too many network requests for a single machine

5 of 60

Why not use a single machine?

Machine A

Network

System no longer works 🙁

6 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Network

Machine A

Local Applications

Local OS

Network

7 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Network

Multiple machines operating as a single system

We will discuss more on how this works

What happens if a failure occurs?

8 of 60

Why distributed systems?

Machine A

Machine B

Machine C

Distributed Applications

Local OS

Network

Applications persist 😃

9 of 60

Why distributed systems?

https://www.cs.helsinki.fi/webfm_send/1262

10 of 60

Failure Likelihood ← link to experiment

Avail %	1 Component	2 Components	3 Components	4 Components	5 Components
Web Server (eg Flask)	85.000%	97.750%	99.663%	99.949%	99.992%
Application Service 1 �(Ingestion, eg Kafka)	95.000%	99.750%	99.988%	99.999%	100.000%
Application Service 2 �(Processing, eg Spark)	95.000%	99.750%	99.988%	99.999%	100.000%
Application Service 3 �(Datastore, eg Cassandra)	95.000%	99.750%	99.988%	99.999%	100.000%
System Avail %	72.877%	97.019%	99.625%	99.948%	99.992%

Downtime per year (mins)	142559.1	15669.7	1970.3	275.9	40.4

11 of 60

Failure Likelihood ← link to experiment

Avail %	1 Component	2 Components	3 Components	4 Components	5 Components
Web Server (eg Flask)	85.000%	97.750%	99.663%	99.949%	99.992%
Application Service 1 �(Ingestion, eg Kafka)	95.000%	99.750%	99.988%	99.999%	100.000%
Application Service 2 �(Processing, eg Spark)	95.000%	99.750%	99.988%	99.999%	100.000%
Application Service 3 �(Datastore, eg Cassandra)	95.000%	99.750%	99.988%	99.999%	100.000%
Database	99.900%	100.000%	100.000%	100.000%	100.000%
DNS	98.000%	99.960%	99.999%	100.000%	100.000%
Firewall	85.000%	97.750%	99.663%	99.949%	99.992%
Switch	99.000%	99.990%	100.000%	100.000%	100.000%
Data Center	99.990%	100.000%	100.000%	100.000%	100.000%
ISP	95.000%	99.750%	99.988%	99.999%	100.000%
System Avail %	57.032%	94.551%	99.276%	99.896%	99.985%

Downtime per year (mins)	225841.9	28638.3	3807.5	545.3	80.5

12 of 60

Distributed Systems 101 - Exercise 1

Make a copy of this sheet
Fill out everything in yellow, and calculate:

System availability
Downtime per time period

13 of 60

Data Pipeline

Ingestion Platform or Datastore

Processing

Ingestion Platform / Datastore

http://xyz.insightdataengineering.com/blog/pipeline_map/

14 of 60

Data Stores

Distributed file systems

Hadoop File System (HDFS)
Amazon S3
GCP Storage

Ingestion Platform

Kafka
Pulsar
RabbitMQ

15 of 60

Ingestion Platform - Basic Concepts

Source Article

16 of 60

Kafka cluster

17 of 60

Key Concepts

Key features:

Fault Tolerant - consensus!
Scalable
High throughput
Low latency

Exactly once / At least once / At most once
Topics and Partitions
Subscription: Exclusive, Fail-over or Shared

18 of 60

Distributed Processing

MapReduce!

Source

19 of 60

Distributed Processing Cluster - Spark

20 of 60

Distributed Processing Frameworks

Batch/Unified

Hadoop
MapR
Spark
Flink

Distributed SQL Engine

Presto

Streaming

Spark Streaming
Flink
KStreams
Pulsar Functions

21 of 60

Databases

Database vs file system?
Fault tolerance?
SQL or NoSQL?
Columnar or Row?
Consistency, Availability, Partitioning?
In-memory or non-disk?
And so much more!

22 of 60

Database vs File System

A database stores data in a structured way, and allows you to query the data
A Query engine takes a query, and converts it into file operations
Trade-offs?

The more structure, the more space is required
The more structure, the faster complicated queries run
The more structure, the slower basic operations (add, delete, update) run

In a way, you can think of a NoSQL database as a compromise between a file system and a SQL Database
NoSQL is not always faster than SQL!

23 of 60

Data Modeling (The DE version)

24 of 60

Columnar or Row storage?

Which is better for….

Inserts/updates?
Filters?
Compression?
Aggregations?

25 of 60

Consistency, Availability, Partitioning

Next time!

26 of 60

Examples?

SQL:

MySQL, MariaDB, PostgreSQL, CockroachDB

Document-based / NoSQL

MongoDB, Redis, Memcache

Columnar:

Redshift, Cassandra,

27 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

Processing Options

Database Options

Frontend

Monitoring

28 of 60

29 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

30 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

File format

31 of 60

Ingestion

32 of 60

File System

33 of 60

File format

https://www.nexla.com/resource/introduction-big-data-formats-understanding-avro-parquet-orc/

34 of 60

Distributed Systems 101

Seminar 2

35 of 60

Data Pipeline

Ingestion

Processing

Datastore

http://xyz.insightdataengineering.com/blog/pipeline_map/

36 of 60

Data Pipeline

Ingestion

Processing

Datastore

File System

or

File format

37 of 60

Data Pipeline

Batch

Datastore

Unified

Streaming

Ingestion

File System

or

File format

Scheduling

or

Processing Options

38 of 60

Batch

39 of 60

Streaming

40 of 60

Unified

41 of 60

Yarn

Oozie

Zookeeper

Airflow

Luigi

Data engineering workflows can become incredibly complex in a production environment, with regularly scheduled jobs, dependencies, and allocation of shared resources. In order to manage these systems, scheduling and monitoring tools are critical.

Scheduling

42 of 60

Data Pipeline

Datastore

Ingestion

File System

or

File format

Batch

Unified

Streaming

Scheduling

or

Processing Options

43 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

Processing Options

Database Options

44 of 60

Picking Database: CAP Theorem

45 of 60

In other words….

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

46 of 60

CAP Theorem (Pick 2!)

mySQL

47 of 60

ACID Database

48 of 60

Relational

MySQL
PostgresSQL
AWS RDS

source

49 of 60

Key-Value

Redis
DynamoDB
Riak

50 of 60

Columnar

Cassandra

51 of 60

Invert Index

source

Lucene

ElasticSearch
Solr

CouchDB

52 of 60

Graph

Neo4J
OrientDB
Neptune

53 of 60

Data Pipeline

Ingestion

Processing

Datastore

http://xyz.insightdataengineering.com/blog/pipeline_map/

Frontend

54 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

Processing Options

Database Options

Frontend

55 of 60

Frontend

Flask
Django

Database

Model

App Logic

View Logic

Frontend Template

Server Side

Client Side

Frontend Framework

56 of 60

Data Pipeline

Relational

Ingestion

File System

or

File format

Key-Value

Columnar

Invert Index

Graph

Batch

Unified

Streaming

Scheduling

or

Processing Options

Database Options

Frontend

Monitoring

57 of 60

Distributed Systems 101 - Exercise 1

Work on your version of this sheet
Update your future Insight project architecture with

Data ingestion
File system
Processing

Batch, streaming or unified
Scheduling

Datastore

Relational, key-value, columnar, inverted-index, and/or graph

Frontend

58 of 60

Appendix

1 of 60

2 of 60

3 of 60

4 of 60

5 of 60

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

28 of 60

29 of 60

30 of 60

31 of 60

32 of 60

33 of 60

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

39 of 60

40 of 60

41 of 60

42 of 60

43 of 60

44 of 60

45 of 60

46 of 60

47 of 60

48 of 60

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

54 of 60

55 of 60

56 of 60

57 of 60

58 of 60

59 of 60

60 of 60