CSV Processing at Scale
Suraj Nath
Infrastructure Engineer, Clarisights
About Me
About Clarisights
Ad-Hoc Data Analytics, Enrichment and Insights Platform for Marketing Teams.
Why CSV?
CSVs @ Clarisights
CSVs @ Clarisights
CSVs @ Clarisights
Credits: https://xkcd.com/927/
RFC 4180...
And obviously there is an XKCD for it...
Aggregation on CSV
Also known as sum/agg. stats problem(internally)
Raw Data
Date | City | Orders | Revenue |
2019-12-12 | Bengaluru | 10 | 100 INR |
2019-12-12 | Bengaluru | 20 | 500 INR |
2019-12-13 | Bengaluru | 30 | 800 INR |
Raw Data
Date | City | Orders | Revenue |
2019-12-12 | Bengaluru | 10 | 100 INR |
2019-12-12 | Bengaluru | 20 | 500 INR |
2019-12-13 | Bengaluru | 30 | 800 INR |
Raw Data
Date | City | Orders | Revenue |
2019-12-12 | Bengaluru | 10 | 100 INR |
2019-12-12 | Bengaluru | 20 | 500 INR |
2019-12-13 | Bengaluru | 30 | 800 INR |
Aggregated Data
Date | City | Orders | Revenue |
2019-12-12 | Bengaluru | 30 | 600 INR |
2019-12-13 | Bengaluru | 30 | 800 INR |
Run an Aggregation Function on Metrics with GROUP BY on Dimensions
Now letβs try doing it in Millions of Rows...
Problems...
Problems...
Problems...
Problems...
Pretty Much this...
What Ifβ¦.
We can run SQL on CSV files?
Nothing New, this has been done before
Nothing New, this has been done before
Nothing New, this has been done before
Downsides...
Downsides...
Downsides...
So we started searchingβ¦.
Introducing...
Apache Drill
What is Apache Drill?
What is Apache Drill?
Why Drill?
Can query complex, semi-structured data on-site
(no transformation required)
Why Drill?
Full SQL Support
Drill supports the standard SQL:2003 syntax.
Why Drill?
Support for multiple data sources
File System, CSV, parquet, Hbase, Hive etc.
Why Drill?
Support for User-Defined Functions (UDFs)
Why Drill?
Distributed and High performance,
Used at scale by MapR and others...
Why Drill?
An Apache Project
Maintained and has community, so itβs not going anywhere soon.
How Apache Drill Works...
π€
Source: https://drill.apache.org/docs/drill-query-execution/
Source: https://drill.apache.org/docs/drill-query-execution/
Source: https://drill.apache.org/docs/drill-query-execution/
Some Downsides...
Introducing...
Rig
What is Rig?
Proxy to abstract out Apache Drill and expose fast and easy to use Interface to Load and Query CSV files
How Rig Works?
π€
Client
LoadData RPC (With CSV)
Rig
Apache Drill
Client
LoadData RPC (With CSV)
CSV to Parquet Table
Rig
Apache Drill
Client
LoadData RPC (With CSV)
CSV to Parquet Table
Parquet Table Path
Rig
Apache Drill
Client
LoadData RPC (With CSV)
CSV to Parquet Table
Parquet Table Path
Dataset ID (UUID)
Rig
Apache Drill
LoadData RPC (With CSV)
CSV to Parquet Table
Parquet Table Path
Dataset ID (UUID)
Query RPC (With Dataset ID)
Client
Rig
Apache Drill
LoadData RPC (With CSV)
CSV to Parquet Table
Parquet Table Path
Dataset ID (UUID)
Query RPC (With Dataset ID)
Query Parquet Table
Client
Rig
Apache Drill
LoadData RPC (With CSV)
CSV to Parquet Table
Parquet Table Path
Dataset ID (UUID)
Query RPC (With Dataset ID)
Query Parquet Table
Return Results
Return Results
Client
Rig
Apache Drill
Results...
Before Rig
After Rig
time taken to process, before and after Rig
Problems/Issues faced
Learnings
Questions?
We are Hiring...
Come talk to me (suraj.dev), or
Visit clarisights.com/careers