1 of 26

PyCon 2019

Scraping a Million Pokemon Battles

Distributed Systems By Example

Author: Duy Nguyen (pugs-rule)

2 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

“Dweeeeeeeeeeeeeee”

3 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Introduction

  • Currently at Google in Core Systems

  • Particularly care about gender equality and mentorship in tech

  • Previously at Ellevation Education then left to wander the world

4 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Introduction

  • Currently at Google in Core Systems

  • Particularly care about gender equality and mentorship in tech

  • Previously at Ellevation Education then left to wander the world

5 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Objective and Scope

  • I love Pokemon...for the most part.

  • Distributed Systems “102”
    • We’ll journey together through this passion project of mine and draw on specific examples to better understand getting started working with distributed systems or microservice architectures in the cloud.

  • Scalability and 3 “Pillars”
    • Concurrency of Resources
    • Asserting for Correctness
    • Resilience against Failures

Correctness

Resilience

Concurrency

6 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Architecture Overview

Correctness

Resilience

Concurrency

7 of 26

(4) Lambda Function watches the S3 Bucket for new objects

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Architecture Overview

Correctness

Resilience

Concurrency

(5) New objects are indexed in ElastiCache (Redis)

(3) Battle logs are stored in S3

(1) Room List Watcher pushes new URLs onto the SQS

(2) Download Bots pull new URLs from SQS and scrapes the battle logs

8 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency

“Our application cannot handle increases in traffic.”

1 urls = room_list_watcher.scrape()

2

3 for url in urls:

4

5 download_bot.scrape(url=url)

9 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - System Characteristics

1 urls = room_list_watcher.scrape()

2

3 for url in urls:

4

5 download_bot.scrape(url=url)

Business Logic

Business Logic

State

10 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - System Characteristics

1 urls = room_list_watcher.scrape()

2

3 for url in urls:

4

5 download_bot.scrape(url=url)

Business Logic

Business Logic

State

t ~= 300 milliseconds

11 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - System Characteristics

1 urls = room_list_watcher.scrape()

2

3 for url in urls:

4

5 download_bot.scrape(url=url)

Business Logic

Business Logic

State

t ~= 300 milliseconds

2 seconds <= t <= 45 minutes

12 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - Partitioning

1 urls = room_list_watcher.scrape()

2

3 for url in urls:

4

5 download_bot.scrape(url=url)

Business Logic

Business Logic

State

t ~= 300 milliseconds

2 seconds <= t <= 45 minutes

13 of 26

“Unless you're implementing an operating system,

use higher-level primitives such as atomic message queues.”

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - Model of Computation (Python)

1 def produce():

2 while True:

3 urls = room_list_watcher.scrape()

4 for url in urls:

5 url_queue.put(url)

1 url_queue = Queue.Queue()

1 def consume():

2 while True:

3 url = url_queue.get()

4 download_bot.scrape(url=url)

14 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - Model of Computation (AWS)

1 def produce():

2 while True:

3 urls = room_list_watcher.scrape()

4 for url in urls:

5 url_queue.put(url)

1 def consume():

2 while True:

3 url = url_queue.get()

4 download_bot.scrape(url=url)

1 url_queue = SQSQueue(

2 name=self._properties['queue']['name'])

15 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Concurrency - Model of Computation (Go)

1 urlQueue := make(chan string)

1 go func() {

2 for {

3 urls := roomListWatcher.Scrape()

4 for _, url := range urls {

5 urlQueue <- url

6 }

7 }

8 }

1 go func() {

2 for {

3 url := <-urlQueue

4 downloadBot.Scrape(url)

5 }

6 }

“Concurrency is the composition of independently executing components.”

16 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Correctness

“Our application is more difficult to reason about.”

  • New Problems
    • Loss of determinism
    • Long startup times
    • Increased flakiness

  • Glossary

Term

Definition

Double

A generic term for any object that replaces/stand in for a production object during testing.

Mock

Has expectations about the calls or interactions.

Fake

Has working implementations but take some shortcut (e.g. InMemoryDatabase).

17 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Correctness - Testing External Dependencies

Implementation

Requirements

Trade-Offs

Pokemon

Google

Community

18 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Correctness - Testing External Dependencies

Implementation

Requirements

Trade-Offs

Fake

  • Somewhat closer to production environment
  • Somewhat more engineering time for development and maintenance

Pokemon

Google

Community

19 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Correctness - Testing External Dependencies

Implementation

Requirements

Trade-Offs

Fake

  • Somewhat closer to production environment
  • Somewhat more engineering time for development and maintenance

Real

  • Closest to production environment
  • Least engineering time for development and maintenance
  • Longest test execution time
  • Most computationally expensive

Pokemon

Google

Community

20 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Correctness - Testing External Dependencies

Implementation

Requirements

Trade-Offs

Fake

  • Somewhat closer to production environment
  • Somewhat more engineering time for development and maintenance

Real

  • Closest to production environment
  • Least engineering time for development and maintenance
  • Longest test execution time
  • Most computationally expensive

Mock

  • The original Instagram engineering philosophy is

“Do the simple thing first.”

  • Furthest from production environment
  • Most susceptible to false positives (interface drift)

Pokemon

Google

Community

21 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Resilience

“Our application will inevitably fail.”

Policy

Premise

AKA

Timeout

Beyond a certain wait, a success result is unlikely.

"Don't wait forever"

Retry

Many faults are transient and may self-correct after a short delay.

"Maybe it's just a blip"

Fallback

Things will still fail - plan what you will do when that happens.

"Degrade gracefully"

22 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Resilience - Timeout

1 def find_button(self, locator):

2 condition = expected_conditions.element_to_be_clickable(

3 locator=locator)

4 try:

5 button = self._wait_context.until(condition)

6 except selenium.common.exceptions.TimeoutException:

7 button = None

8

9 result = lookup.results.Find(value=button, zero_value=None)

10 return result

23 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Resilience - Retry

1 def build_policy(...):

2 stop_strategy = retry.stop_strategies.AfterDuration(

3 maximum_duration=self._properties['policy']['stop_strategy']['maximum_duration'])

4 wait_strategy = retry.wait_strategies.Fixed(

5 wait_time=self._properties['policy']['wait_strategy']['wait_time'])

6

7 retry_policy = retry.PolicyBuilder() \

8 .with_stop_strategy(stop_strategy) \

9 .with_wait_strategy(wait_strategy) \

10 .continue_on_exception(automation.exceptions.ConnectionLost) \

11 .continue_on_exception(automation.exceptions.WebDriverError) \

12 .continue_on_exception(exceptions.BattleNotCompleted) \

13 .build()

24 of 26

Correctness

Resilience

Concurrency

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

Resilience - Fallback

1 def scrape(self, url):

2 elements = None

3 while elements is None:

4 try:

5 elements = self._policy.execute(self._scraper.scrape, url=url)

6 except retry.exceptions.MaximumRetry as e:

7 # The expected errors have persisted. Defer to the

8 # fallback.

9 elements = list()

10 except selenium.common.exceptions.StaleElementReferenceException as e:

11 # An expected error has occurred that cannot be handled

12 # by alternative measures. Reload the existing scraper.

13 self._reload()

16

17 return elements

25 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

“We don’t ship code; we ship features.”

“We don’t solve problems for computers; we solve problems for people.”

26 of 26

Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)

HAH, you think there’s still time for questions?