Cutting the price of �Scraping Cloud Costs
Ed Crewe�Developer, EDB Postgres AI �Pycon LT 2025 Data day
©EDB 2024 — ALL RIGHTS RESERVED.
Introduction
Ed Crewe - developer on EDB’s Postgres AI product, from Bristol, UK��
EDB Postgres AI and Data pipeline ingredients
Part 1�Starting at the End
�what our pipeline’s data is used for
©EDB 2024 — ALL RIGHTS RESERVED.
How does cloud pricing work?
It is complicated! … and the price lists are not small.�The full price lists for the 3 main CSPs, Amazon, Google and Azure together are almost 5 million prices - known as SKUs (Stock Keeping Unit price)
csp x service x type x tier x region�3 x 200 x 50 x 3 x 50 = 4.5 million��For example, hard disk storage should be simple right? �
The pipeline is used for EDB Postgres AI on CSP, cloud service providers
CSPs’ SKUs��object storage�data transfer�networking��
CSPs’ SKUs ��compute nodes�volume storage snapshot backup�networking�load balancer�dedicated fee�
AWS, Azure or Google
… as I said it is complicated - calculated by the microservice
Total of SKU costs as a price estimate�The single SKU's cost calculation is in pkg/cost, which returns the cost based on the service parameters in the given CSP and region.
The factors are stored in a formula_factor table which can be dynamically tweaked to meet the cost estimation calibration goal.
Discounts per customer�Another microservice returns all the products and potentially includes per Customer discounts from our customer private offer service. The clusters are checked if the related customer has any discounts they are applied to the price estimate.
The consumer of the data pipeline artifact
m
DB
price microservice
Istio�REST�API
Web �UI
synch cron service
DB API
gRPC API
cloud
UI
customer browser
CSP �blob store
sql
airflow
json
CSP SKU sources
web
temp� dbs
scrapers
month
week
day
week
application pipeline
discount microservice
What we use our data pipeline artifact for
Part 2�Pipeline Architecture
�reasons, rules and redesigns
©EDB 2024 — ALL RIGHTS RESERVED.
The tools and data sources for cloud prices
CSV�5m�rows
Bulk API = JSON static dump files
S
Cloud Service Providers
Changes Monthly
your application
sku price data
or
Assumptions behind the old live ELT architecture
Extract = Use existing aggregate extracted source data�- don’t reinvent the wheel
Cloud Postgres�Cluster
Load = EDB Cloud Postgres cluster �- eat our own dog food
pg_dump
Transform = Bash / SQL scripts - DBA standard
Workflow = GitHub workflow �- already in use for CI/CD of micro services
Artefacts = pg_dumps in blob store �- Postgres dump standard in cloud file store
transform
Long lived
deployment�6Gb PG cluster
Why did we decide to rewrite?
Infracost.io stopped updating our core source data �.. move to paid customer only access��Other reasons
Changes for the new pipeline - ETL per step
Extract = Scrape only the data we need from original source bulk APIs
Load = Temporary Embedded Postgres of step duration lifespan
pg_dump
Transform = Python Transform as Extracted
Workflow = Airflow
Artefacts = pg_dumps in blob store �- Postgres dump standard in cloud file store
temp� dbs
scrapers
step lifespan�local socket embedded postgres
transform
Old ELT -> New ETL per step architecture
Web �UI
Airflow
json
CSP SKU sources
scrapers
CSP �blob store
sql
dump db
temp� dbs
transform
Cloud Postgres�source skus
Github�workflow
CSV�5 m�rows
infracost.io
Cloud Postgres�prices
DB
manual�push
synch cron service
cron pull
Old��New
2 db schemas
transform
What design principles can we derive from our case?
1. Data currency governs architecture. Price changes are monthly - very low frequency is cheap.
2. Choose the best data sources - research all available, how homogeneous, ownership, alternatives?
3. Extract the minimum - downloading all and filtering after is costly. Filter on extraction if possible.�
4. Persistence. How are we persisting the ETL output data? How suited is it to the data consumers?
5. Cheap long term storage - snapshots don’t need remote live cloud db cluster dependencies.�
6. Avoid step dependencies or shared data to increase parallelism and robustness. � Allow separate component build and test outside of the pipeline if possible.
7. Push or Pull result. Determine this based on the nature of the data consumers. � Pull is more maintainable in most cases.
8. Retain the minimum. Extensive backup and retention is not required for 3rd party data. � Wipe dumps older than treble the frequency, with every fresh one created.
Part 3�Python at last !
�the new pipeline code
©EDB 2024 — ALL RIGHTS RESERVED.
Python section overview
©EDB 2024 — ALL RIGHTS RESERVED.
Python scraper base class - price-scraper pkg
from abc import ABC��class BaseScraper(ABC):
"""Abstract base class for Scrapers"""
batch = 500
conn = None
unit_map = {"FAIL": ""}
root_url = ""
def map_units(self, entry, key):
"""To standardize naming of units between CSPs"""
return self.unit_map.get(entry.get(key, "FAIL"), entry[key])
def scrape_sku(self):
"""Scrapes prices from CSP bulk JSON API - uses CSP specific methods"""
Pass
def bulk_insert_rows(self, rows):
"""Bulk insert batches of rows - Note that Psycopg >= 3.1 uses pipeline mode"""�
query = """INSERT INTO api_price.infra_price VALUES
(%(sku_id)s, %(cloud_provider)s, %(region)s, … %(sku_name)s, %(end_usage_amount)s)"""�
with self.conn.cursor() as cur:
cur.executemany(query, rows)
Embedded Postgres Golang wrapper
package main
import epg "github.com/fergusstrange/embedded-postgres"�
// Binary callable by Python, creates embedded Postgres DB starts it and loads schema
func main() {
args := os.Args
if len(args) != 3 {
fmt.Println("Please run as > pgembed port directory")
fmt.Println("> pgembed 5432 /tmp/epg-run-dir")
os.Exit(1)
}
database := epg.NewDatabase(config)
err = database.Start()
...
Separate price-scraper click based python package
> pscraper -h
Usage: pscraper [OPTIONS] COMMAND [ARGS]...�
price-scraper: python web scraping of CSP prices for api-price�
Options:
-h, --help Show this message and exit.
Commands:
awsscrape Scrape prices from AWS
azurescrape Scrape prices from Azure
delold Delete old Azure blob storage files All files more than 12 weeks old are deleted
gcpscrape Scrape prices from GCP - set env GCP_BILLING_KEY
pgdump Dump postgres file and upload to Azure blob storage - set env AZURE_STORAGE_KEY� > pscraper pgdump --port 5377 --file price.sql
pgembed Run up local embedded Postgres on a random port for tests > pscraper pgembed
pgload Load schema to local embedded postgres for testing > pscraper pgload --port 5377 --file price.sql
version Display the price-scraper version.
https://edcrewe.blogspot.com/ - testing
Scrape Azure DAG
def scrape_azure():
"""Scrape Azure via API public json web pages"""
from price_scraper.commands import azurescrape, pgembed
folder, port = setup_pg_db(PORT + 3)
error = azurescrape.run_azure_scrape(port, HOST)
if not error:
error = csp_dump(port, "azure")
if error:
pgembed.teardown_pg_embed(folder)
notify_slack("azure", error)
raise AirflowFailException(error)
data_test = SodaScanOperator(
dag=dag,
task_id="data_test",
data_sources=[
{
"data_source_name": "embedpg",
"soda_config_path": "price-scraper/soda/configuration_azure.yml",
}
],
soda_cl_path="price-scraper/soda/price_azure_checks.yml",
)
data_test.execute(dict())
pgembed.teardown_pg_embed(folder)
soda_cl_path="price-scraper/soda/price_azure_checks.yml"
# Prices
- azure_prices_vm between 150 and 600:
azure_prices_vm query: |
select max(price) from infra_price where cloud_provider = 'azure' and service='Virtual Machines'
- azure_prices_storage between 5000 and 20000:
azure_prices_storage query: |
select max(price) from infra_price where cloud_provider = 'azure' and service='Storage'
� - azure_prices_network between 0.05 and 0.5:
azure_prices_network query: |
select max(price) from infra_price where cloud_provider = 'azure' and service='Bandwidth'
# Check tiered rates
- azure_rates >= 10:
azure_rates query: |
select count(*) from (
select distinct start_usage_amount, end_usage_amount from infra_price where cloud_provider='azure' and not (start_usage_amount='0' and end_usage_amount='Inf')) as v
- azure_rate_rows >= 3000:
azure_rate_rows query: |
select sum(v.count) from (
select distinct start_usage_amount, end_usage_amount, count(*) from infra_price where cloud_provider='azure' and not (start_usage_amount='0' and end_usage_amount='Inf') group by start_usage_amount, end_usage_amount) v
Part 4�Pipeline Deployment & Cost
�the many different costs of cloud
©EDB 2024 — ALL RIGHTS RESERVED.
Vendor pricing strategy $$$
Free trial - time or resource limited to push getting a T-shirt (eg our infracost.io case)�T-shirt sizing - push larger T-shirts for lower prices per unit�Private Offers - After XL vendors use these, haggle a deal - encourage unlimited usage
Gotchas - the devil is in the details - hence the need for usage cost telemetry
SaaS Github - Free for 200 public open source repositories are free� - half million $/yr. for 1000 private git repos including full git workflow cloud usage.� Tshirt git workflow Azure usage may be charged at 3 x the sku cost of direct Azure usage.��SaaS Astronomer - may charge $50k /yr for 20 modest DAGs if they infrequent, eg. daily
- could charge more than that for one, if it is very large scale and high frequency. � Like most SaaS == Cloud cost plus percentage markup
�AWS RDS for Postgres 12 the cost of a medium instance recently trebled as standard support expires ($0.1 vcpu/hr adds $0.2 vcpu/hr extended support)
Dev/Test/Staging pipeline cloud cost
https://www.vecteezy.com/
Production cloud
Dev/Test/Failover cloud
All other metered electricity
IaC == ‘just spin something up’ �
Ireland is a main location for the CSPs European cloud data centres.
Thanks to tax breaks and perhaps being English speaking.�Cloud consumes a quarter of Eire’s metered �electricity.
What is the cost of our different pipeline architectures
Infracost washed final size == same as new scrape = 70 Mb = Only 200k rows.
Artefact store is the same, so same cost, Azure blobs, except new pipeline upload auto deletes old dumps.��Charges for use of astronomer.io vs github.com? �Both private offers for running possibly even same CSP, temporary cloud compute runners and storage. ��Difficult to say if our much bigger deal with github is cheaper, possibly given how much use we put it too, but if under utilized it becomes more expensive. If no cost gotchas, assume slightly pricier for more featured Airflow.��Extra costs for the old ELT pipeline
Cost control strategy
Rule of 40 % profit margin for a SaaS product. A core element is cost control:�
Conclusion and summing up
©EDB 2024 — ALL RIGHTS RESERVED.
The pipeline running in Astronomer
Ed Crewe
EDB Bristol, UK�
ed.crewe@enterprisedb.com��https://edcrewe.blogspot.com/�has a post about this talk with �some more details on testing�
�
Thanks to Karthik Dulam �& Guang Yi Xu �EDB Principal engineers
for their input
The pipeline running in Astronomy