Distributed Systems 101
Insight Data
Outline
Seminar 1
Why not use a single machine?
Machine A
Local Applications
Local OS
Network
Why not use a single machine?
Machine A
Network
Why distributed systems?
Machine A
Machine B
Machine C
Distributed Applications
Local OS
Local OS
Local OS
Network
Machine A
Local Applications
Local OS
Network
Why distributed systems?
Machine A
Machine B
Machine C
Distributed Applications
Local OS
Local OS
Local OS
Network
Why distributed systems?
Machine A
Machine B
Machine C
Distributed Applications
Local OS
Local OS
Network
Why distributed systems?
Failure Likelihood ← link to experiment
Avail % | 1 Component | 2 Components | 3 Components | 4 Components | 5 Components |
Web Server (eg Flask) | 85.000% | 97.750% | 99.663% | 99.949% | 99.992% |
Application Service 1 �(Ingestion, eg Kafka) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
Application Service 2 �(Processing, eg Spark) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
Application Service 3 �(Datastore, eg Cassandra) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
System Avail % | 72.877% | 97.019% | 99.625% | 99.948% | 99.992% |
| | | | | |
Downtime per year (mins) | 142559.1 | 15669.7 | 1970.3 | 275.9 | 40.4 |
Failure Likelihood ← link to experiment
Avail % | 1 Component | 2 Components | 3 Components | 4 Components | 5 Components |
Web Server (eg Flask) | 85.000% | 97.750% | 99.663% | 99.949% | 99.992% |
Application Service 1 �(Ingestion, eg Kafka) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
Application Service 2 �(Processing, eg Spark) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
Application Service 3 �(Datastore, eg Cassandra) | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
Database | 99.900% | 100.000% | 100.000% | 100.000% | 100.000% |
DNS | 98.000% | 99.960% | 99.999% | 100.000% | 100.000% |
Firewall | 85.000% | 97.750% | 99.663% | 99.949% | 99.992% |
Switch | 99.000% | 99.990% | 100.000% | 100.000% | 100.000% |
Data Center | 99.990% | 100.000% | 100.000% | 100.000% | 100.000% |
ISP | 95.000% | 99.750% | 99.988% | 99.999% | 100.000% |
System Avail % | 57.032% | 94.551% | 99.276% | 99.896% | 99.985% |
| | | | | |
Downtime per year (mins) | 225841.9 | 28638.3 | 3807.5 | 545.3 | 80.5 |
Data Pipeline
Ingestion Platform or Datastore
Processing
Ingestion Platform / Datastore
Data Stores
Ingestion Platform - Basic Concepts
Kafka cluster
Key Concepts
Distributed Processing
Distributed Processing Cluster - Spark
Distributed Processing Frameworks
Databases
Database vs File System
Data Modeling (The DE version)
Columnar or Row storage?
Consistency, Availability, Partitioning
Examples?
Data Pipeline
Relational
Ingestion
File System
or
File format
Key-Value
Columnar
Invert Index
Graph
Batch
Unified
Streaming
Scheduling
or
or
Processing Options
Database Options
Frontend
Monitoring
Data Pipeline
Ingestion
Processing
Datastore
File System
or
Data Pipeline
Ingestion
Processing
Datastore
File System
or
File format
Ingestion
File System
File format
Distributed Systems 101
Seminar 2
Data Pipeline
Ingestion
Processing
Datastore
Data Pipeline
Ingestion
Processing
Datastore
File System
or
File format
Data Pipeline
Batch
Datastore
Unified
Streaming
Ingestion
File System
or
File format
Scheduling
or
or
Processing Options
Batch
Streaming
Unified
Yarn
Oozie
Zookeeper
Airflow
Luigi
Data engineering workflows can become incredibly complex in a production environment, with regularly scheduled jobs, dependencies, and allocation of shared resources. In order to manage these systems, scheduling and monitoring tools are critical.
Scheduling
Scheduling
Data Pipeline
Datastore
Ingestion
File System
or
File format
Batch
Unified
Streaming
Scheduling
or
or
Processing Options
Data Pipeline
Relational
Ingestion
File System
or
File format
Key-Value
Columnar
Invert Index
Graph
Batch
Unified
Streaming
Scheduling
or
or
Processing Options
Database Options
Picking Database: CAP Theorem
In other words….
CAP Theorem (Pick 2!)
mySQL
ACID Database
Relational
Key-Value
Columnar
Invert Index
Graph
Data Pipeline
Ingestion
Processing
Datastore
Frontend
Data Pipeline
Relational
Ingestion
File System
or
File format
Key-Value
Columnar
Invert Index
Graph
Batch
Unified
Streaming
Scheduling
or
or
Processing Options
Database Options
Frontend
Frontend
Database
Model
App Logic
View Logic
Frontend Template
Server Side
Client Side
Frontend Framework
Data Pipeline
Relational
Ingestion
File System
or
File format
Key-Value
Columnar
Invert Index
Graph
Batch
Unified
Streaming
Scheduling
or
or
Processing Options
Database Options
Frontend
Monitoring
Appendix