1 of 31

Cutting the price of �Scraping Cloud Costs

Ed Crewe�Developer, EDB Postgres AI �Pycon LT 2025 Data day

©EDB 2024 — ALL RIGHTS RESERVED.

2 of 31

Introduction

Ed Crewe - developer on EDB’s Postgres AI product, from Bristol, UK��

EDB Postgres AI and Data pipeline ingredients

3 of 31

Part 1�Starting at the End

�what our pipeline’s data is used for

©EDB 2024 — ALL RIGHTS RESERVED.

4 of 31

How does cloud pricing work?

It is complicated! … and the price lists are not small.�The full price lists for the 3 main CSPs, Amazon, Google and Azure together are almost 5 million prices - known as SKUs (Stock Keeping Unit price)

csp x service x type x tier x region�3 x 200 x 50 x 3 x 50 = 4.5 million�For example, hard disk storage should be simple right? �

  • hardware type
  • capacity
  • throughput
  • region (intra/cross)�us-east2, ap-south1
  • backup costs
  • type
  • retention size/time
  • schedule etc.

5 of 31

The pipeline is used for EDB Postgres AI on CSP, cloud service providers

CSPs’ SKUs��object storage�data transfer�networking��

CSPs’ SKUs ��compute nodes�volume storage snapshot backup�networking�load balancer�dedicated fee�

AWS, Azure or Google

6 of 31

… as I said it is complicated - calculated by the microservice

Total of SKU costs as a price estimate�The single SKU's cost calculation is in pkg/cost, which returns the cost based on the service parameters in the given CSP and region.

The factors are stored in a formula_factor table which can be dynamically tweaked to meet the cost estimation calibration goal.

Discounts per customer�Another microservice returns all the products and potentially includes per Customer discounts from our customer private offer service. The clusters are checked if the related customer has any discounts they are applied to the price estimate.

7 of 31

The consumer of the data pipeline artifact

m

DB

price microservice

Istio�REST�API

Web �UI

synch cron service

DB API

gRPC API

cloud

UI

customer browser

CSP �blob store

sql

airflow

json

CSP SKU sources

web

temp� dbs

scrapers

month

week

day

week

application pipeline

discount microservice

8 of 31

What we use our data pipeline artifact for

9 of 31

Part 2�Pipeline Architecture

�reasons, rules and redesigns

©EDB 2024 — ALL RIGHTS RESERVED.

10 of 31

The tools and data sources for cloud prices

CSV�5m�rows

Bulk API = JSON static dump files

S

Cloud Service Providers

Changes Monthly

your application

sku price data

or

11 of 31

Assumptions behind the old live ELT architecture

Extract = Use existing aggregate extracted source data�- don’t reinvent the wheel

Cloud Postgres�Cluster

Load = EDB Cloud Postgres cluster �- eat our own dog food

pg_dump

Transform = Bash / SQL scripts - DBA standard

Workflow = GitHub workflow �- already in use for CI/CD of micro services

Artefacts = pg_dumps in blob store �- Postgres dump standard in cloud file store

transform

Long lived

deployment�6Gb PG cluster

12 of 31

Why did we decide to rewrite?

Infracost.io stopped updating our core source data �.. move to paid customer only access��Other reasons

  1. We had uniquely built a data pipeline in our CI/CD system
  2. Remote only executing bash / SQL is not ideal for dev, test and visibility.
  3. We now had a data science team and standard platform, Airflow on Astronomy SaaS
  4. One of the Gopher & Rustacean account management team was also a Pythonista
  5. We had never fully finished the pipeline to automate the update step to the microservice

13 of 31

Changes for the new pipeline - ETL per step

Extract = Scrape only the data we need from original source bulk APIs

Load = Temporary Embedded Postgres of step duration lifespan

pg_dump

Transform = Python Transform as Extracted

Workflow = Airflow

Artefacts = pg_dumps in blob store �- Postgres dump standard in cloud file store

temp� dbs

scrapers

step lifespan�local socket embedded postgres

transform

14 of 31

Old ELT -> New ETL per step architecture

Web �UI

Airflow

json

CSP SKU sources

scrapers

CSP �blob store

sql

dump db

temp� dbs

transform

Cloud Postgres�source skus

Github�workflow

CSV�5 m�rows

infracost.io

Cloud Postgres�prices

DB

manual�push

synch cron service

cron pull

Old��New

2 db schemas

transform

15 of 31

What design principles can we derive from our case?

1. Data currency governs architecture. Price changes are monthly - very low frequency is cheap.

2. Choose the best data sources - research all available, how homogeneous, ownership, alternatives?

3. Extract the minimum - downloading all and filtering after is costly. Filter on extraction if possible.�

4. Persistence. How are we persisting the ETL output data? How suited is it to the data consumers?

5. Cheap long term storage - snapshots don’t need remote live cloud db cluster dependencies.

6. Avoid step dependencies or shared data to increase parallelism and robustness. � Allow separate component build and test outside of the pipeline if possible.

7. Push or Pull result. Determine this based on the nature of the data consumers. � Pull is more maintainable in most cases.

8. Retain the minimum. Extensive backup and retention is not required for 3rd party data. � Wipe dumps older than treble the frequency, with every fresh one created.

16 of 31

Part 3�Python at last !

�the new pipeline code

©EDB 2024 — ALL RIGHTS RESERVED.

17 of 31

Python section overview

  1. The Python base class for the scrapers of each CSP
  2. Use of Golang to wrap up Embedded Postgres
  3. Creation of a Click CLI Python package to wrap pipeline step functionality
  4. An example of the use of the Click package within an Airflow step
  5. Use of Soda to verify the data we are scraping

©EDB 2024 — ALL RIGHTS RESERVED.

18 of 31

Python scraper base class - price-scraper pkg

from abc import ABC��class BaseScraper(ABC):

"""Abstract base class for Scrapers"""

batch = 500

conn = None

unit_map = {"FAIL": ""}

root_url = ""

def map_units(self, entry, key):

"""To standardize naming of units between CSPs"""

return self.unit_map.get(entry.get(key, "FAIL"), entry[key])

def scrape_sku(self):

"""Scrapes prices from CSP bulk JSON API - uses CSP specific methods"""

Pass

def bulk_insert_rows(self, rows):

"""Bulk insert batches of rows - Note that Psycopg >= 3.1 uses pipeline mode"""�

query = """INSERT INTO api_price.infra_price VALUES

(%(sku_id)s, %(cloud_provider)s, %(region)s, %(sku_name)s, %(end_usage_amount)s)"""�

with self.conn.cursor() as cur:

cur.executemany(query, rows)

19 of 31

Embedded Postgres Golang wrapper

package main

import epg "github.com/fergusstrange/embedded-postgres"

// Binary callable by Python, creates embedded Postgres DB starts it and loads schema

func main() {

args := os.Args

if len(args) != 3 {

fmt.Println("Please run as > pgembed port directory")

fmt.Println("> pgembed 5432 /tmp/epg-run-dir")

os.Exit(1)

}

database := epg.NewDatabase(config)

err = database.Start()

...

20 of 31

Separate price-scraper click based python package

> pscraper -h

Usage: pscraper [OPTIONS] COMMAND [ARGS]...�

price-scraper: python web scraping of CSP prices for api-price�

Options:

-h, --help Show this message and exit.

Commands:

awsscrape Scrape prices from AWS

azurescrape Scrape prices from Azure

delold Delete old Azure blob storage files All files more than 12 weeks old are deleted

gcpscrape Scrape prices from GCP - set env GCP_BILLING_KEY

pgdump Dump postgres file and upload to Azure blob storage - set env AZURE_STORAGE_KEY� > pscraper pgdump --port 5377 --file price.sql

pgembed Run up local embedded Postgres on a random port for tests > pscraper pgembed

pgload Load schema to local embedded postgres for testing > pscraper pgload --port 5377 --file price.sql

version Display the price-scraper version.

21 of 31

Scrape Azure DAG

def scrape_azure():

"""Scrape Azure via API public json web pages"""

from price_scraper.commands import azurescrape, pgembed

folder, port = setup_pg_db(PORT + 3)

error = azurescrape.run_azure_scrape(port, HOST)

if not error:

error = csp_dump(port, "azure")

if error:

pgembed.teardown_pg_embed(folder)

notify_slack("azure", error)

raise AirflowFailException(error)

data_test = SodaScanOperator(

dag=dag,

task_id="data_test",

data_sources=[

{

"data_source_name": "embedpg",

"soda_config_path": "price-scraper/soda/configuration_azure.yml",

}

],

soda_cl_path="price-scraper/soda/price_azure_checks.yml",

)

data_test.execute(dict())

pgembed.teardown_pg_embed(folder)

22 of 31

soda_cl_path="price-scraper/soda/price_azure_checks.yml"

# Prices

- azure_prices_vm between 150 and 600:

azure_prices_vm query: |

select max(price) from infra_price where cloud_provider = 'azure' and service='Virtual Machines'

- azure_prices_storage between 5000 and 20000:

azure_prices_storage query: |

select max(price) from infra_price where cloud_provider = 'azure' and service='Storage'

� - azure_prices_network between 0.05 and 0.5:

azure_prices_network query: |

select max(price) from infra_price where cloud_provider = 'azure' and service='Bandwidth'

# Check tiered rates

- azure_rates >= 10:

azure_rates query: |

select count(*) from (

select distinct start_usage_amount, end_usage_amount from infra_price where cloud_provider='azure' and not (start_usage_amount='0' and end_usage_amount='Inf')) as v

- azure_rate_rows >= 3000:

azure_rate_rows query: |

select sum(v.count) from (

select distinct start_usage_amount, end_usage_amount, count(*) from infra_price where cloud_provider='azure' and not (start_usage_amount='0' and end_usage_amount='Inf') group by start_usage_amount, end_usage_amount) v

23 of 31

Part 4�Pipeline Deployment & Cost

�the many different costs of cloud

©EDB 2024 — ALL RIGHTS RESERVED.

24 of 31

Vendor pricing strategy $$$

Free trial - time or resource limited to push getting a T-shirt (eg our infracost.io case)�T-shirt sizing - push larger T-shirts for lower prices per unit�Private Offers - After XL vendors use these, haggle a deal - encourage unlimited usage

Gotchas - the devil is in the details - hence the need for usage cost telemetry

SaaS Github - Free for 200 public open source repositories are free� - half million $/yr. for 1000 private git repos including full git workflow cloud usage.� Tshirt git workflow Azure usage may be charged at 3 x the sku cost of direct Azure usage.��SaaS Astronomer - may charge $50k /yr for 20 modest DAGs if they infrequent, eg. daily

- could charge more than that for one, if it is very large scale and high frequency. � Like most SaaS == Cloud cost plus percentage markup

AWS RDS for Postgres 12 the cost of a medium instance recently trebled as standard support expires ($0.1 vcpu/hr adds $0.2 vcpu/hr extended support)

25 of 31

Dev/Test/Staging pipeline cloud cost

https://www.vecteezy.com/

Production cloud

Dev/Test/Failover cloud

All other metered electricity

IaC == ‘just spin something up’

Ireland is a main location for the CSPs European cloud data centres.

Thanks to tax breaks and perhaps being English speaking.�Cloud consumes a quarter of Eire’s metered �electricity.

26 of 31

What is the cost of our different pipeline architectures

Infracost washed final size == same as new scrape = 70 Mb = Only 200k rows.

Artefact store is the same, so same cost, Azure blobs, except new pipeline upload auto deletes old dumps.��Charges for use of astronomer.io vs github.com? �Both private offers for running possibly even same CSP, temporary cloud compute runners and storage. ��Difficult to say if our much bigger deal with github is cheaper, possibly given how much use we put it too, but if under utilized it becomes more expensive. If no cost gotchas, assume slightly pricier for more featured Airflow.��Extra costs for the old ELT pipeline

  1. Running a constant current database of the 5m (x20) rows of twice the row length for loading the Infracost.io CSV dump. So = 40 x 70 mb -> 3 Gb x 2 for a cluster. For hosted Postgres ~ $150/month
  2. Takes 30 minutes to load CSV file even with disabled indexes. Which is an extra pipeline execution cost.�$20/monthy
  3. Execution time was about 30% longer with more whole pipeline reruns, rather than in step retries.

27 of 31

Cost control strategy

Rule of 40 % profit margin for a SaaS product. A core element is cost control:�

  1. Price transparency for anybody (sales, engineering) who can spin up infrastructure
  2. Automated short lifespan for non-prod envs and resources, eg days as default.
  3. Regular infrastructure removal based on usage cost telemetry
  4. Service removal based on environment, eg turn off certain tools �on non-prod infra (eg MS Sentinel, $2k a day)
  5. Use on demand spin up where possible
  6. Quarterly check that SaaS services are required - �for all significant cost ones, and when private offers expire. �Consider available alternatives, open source / in house provision.�eg. LaunchDarkly -> Flagd

28 of 31

Conclusion and summing up

  • Outline of the requirements for this data pipeline use case.�the product that consumes it.�
  • Detailed the old/new pipeline architectures and why we moved.�What principles can be derived from our redesign.�
  • Some Python, a few points about the new pipeline code�
  • Pricing strategies and the cost of cloud in practise. �Cost control to handle its odd rules and gotchas.

©EDB 2024 — ALL RIGHTS RESERVED.

29 of 31

The pipeline running in Astronomer

30 of 31

Ed Crewe

EDB Bristol, UK�

ed.crewe@enterprisedb.com��https://edcrewe.blogspot.com/has a post about this talk with �some more details on testing

https://enterprisedb.com/

Thanks to Karthik Dulam �& Guang Yi Xu �EDB Principal engineers

for their input

31 of 31

The pipeline running in Astronomy