PyCon 2019
|
Scraping a Million Pokemon Battles
Distributed Systems By Example
Author: Duy Nguyen (pugs-rule)
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
“Dweeeeeeeeeeeeeee”
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Introduction
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Introduction
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Objective and Scope
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Architecture Overview
Correctness
Resilience
Concurrency
(4) Lambda Function watches the S3 Bucket for new objects
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Architecture Overview
Correctness
Resilience
Concurrency
(5) New objects are indexed in ElastiCache (Redis)
(3) Battle logs are stored in S3
(1) Room List Watcher pushes new URLs onto the SQS
(2) Download Bots pull new URLs from SQS and scrapes the battle logs
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency
“Our application cannot handle increases in traffic.”
1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - System Characteristics
1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) |
Business Logic
Business Logic
State
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - System Characteristics
1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) |
Business Logic
Business Logic
State
t ~= 300 milliseconds
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - System Characteristics
1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) |
Business Logic
Business Logic
State
t ~= 300 milliseconds
2 seconds <= t <= 45 minutes
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - Partitioning
1 urls = room_list_watcher.scrape() 2 3 for url in urls: 4 5 download_bot.scrape(url=url) |
Business Logic
Business Logic
State
t ~= 300 milliseconds
2 seconds <= t <= 45 minutes
“Unless you're implementing an operating system,
use higher-level primitives such as atomic message queues.”
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - Model of Computation (Python)
1 def produce(): 2 while True: 3 urls = room_list_watcher.scrape() 4 for url in urls: 5 url_queue.put(url) |
1 url_queue = Queue.Queue() |
1 def consume(): 2 while True: 3 url = url_queue.get() 4 download_bot.scrape(url=url) |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - Model of Computation (AWS)
1 def produce(): 2 while True: 3 urls = room_list_watcher.scrape() 4 for url in urls: 5 url_queue.put(url) |
1 def consume(): 2 while True: 3 url = url_queue.get() 4 download_bot.scrape(url=url) |
1 url_queue = SQSQueue( 2 name=self._properties['queue']['name']) |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Concurrency - Model of Computation (Go)
1 urlQueue := make(chan string) |
1 go func() { 2 for { 3 urls := roomListWatcher.Scrape() 4 for _, url := range urls { 5 urlQueue <- url 6 } 7 } 8 } |
1 go func() { 2 for { 3 url := <-urlQueue 4 downloadBot.Scrape(url) 5 } 6 } |
“Concurrency is the composition of independently executing components.”
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Correctness
“Our application is more difficult to reason about.”
Term | Definition |
Double | A generic term for any object that replaces/stand in for a production object during testing. |
Mock | Has expectations about the calls or interactions. |
Fake | Has working implementations but take some shortcut (e.g. InMemoryDatabase). |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Correctness - Testing External Dependencies
Implementation | Requirements | Trade-Offs |
| | |
| | |
| | |
Pokemon
Community
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Correctness - Testing External Dependencies
Implementation | Requirements | Trade-Offs |
Fake |
|
|
| | |
| | |
Pokemon
Community
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Correctness - Testing External Dependencies
Implementation | Requirements | Trade-Offs |
Fake |
|
|
Real |
|
|
| | |
Pokemon
Community
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Correctness - Testing External Dependencies
Implementation | Requirements | Trade-Offs |
Fake |
|
|
Real |
|
|
Mock |
“Do the simple thing first.” |
|
Pokemon
Community
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Resilience
“Our application will inevitably fail.”
Policy | Premise | AKA |
Timeout | Beyond a certain wait, a success result is unlikely. | "Don't wait forever" |
Retry | Many faults are transient and may self-correct after a short delay. | "Maybe it's just a blip" |
Fallback | Things will still fail - plan what you will do when that happens. | "Degrade gracefully" |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Resilience - Timeout
1 def find_button(self, locator): 2 condition = expected_conditions.element_to_be_clickable( 3 locator=locator) 4 try: 5 button = self._wait_context.until(condition) 6 except selenium.common.exceptions.TimeoutException: 7 button = None 8 9 result = lookup.results.Find(value=button, zero_value=None) 10 return result |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Resilience - Retry
1 def build_policy(...): 2 stop_strategy = retry.stop_strategies.AfterDuration( 3 maximum_duration=self._properties['policy']['stop_strategy']['maximum_duration']) 4 wait_strategy = retry.wait_strategies.Fixed( 5 wait_time=self._properties['policy']['wait_strategy']['wait_time']) 6 7 retry_policy = retry.PolicyBuilder() \ 8 .with_stop_strategy(stop_strategy) \ 9 .with_wait_strategy(wait_strategy) \ 10 .continue_on_exception(automation.exceptions.ConnectionLost) \ 11 .continue_on_exception(automation.exceptions.WebDriverError) \ 12 .continue_on_exception(exceptions.BattleNotCompleted) \ 13 .build() |
Correctness
Resilience
Concurrency
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
Resilience - Fallback
1 def scrape(self, url): 2 elements = None 3 while elements is None: 4 try: 5 elements = self._policy.execute(self._scraper.scrape, url=url) 6 except retry.exceptions.MaximumRetry as e: 7 # The expected errors have persisted. Defer to the 8 # fallback. 9 elements = list() 10 except selenium.common.exceptions.StaleElementReferenceException as e: 11 # An expected error has occurred that cannot be handled 12 # by alternative measures. Reload the existing scraper. 13 self._reload() 16 17 return elements |
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
“We don’t ship code; we ship features.”
“We don’t solve problems for computers; we solve problems for people.”
Duy Nguyen - Scaping a Million Pokemon Battles (bit.ly/pycon-poke)
HAH, you think there’s still time for questions?