1 of 63

CSV Processing at Scale

Suraj Nath

Infrastructure Engineer, Clarisights

2 of 63

About Me

Web: suraj.dev

@electron0zero on Twitter and rest of Interwebs

3 of 63

About Clarisights

Ad-Hoc Data Analytics, Enrichment and Insights Platform for Marketing Teams.

4 of 63

Why CSV?

5 of 63

6 of 63

CSVs @ Clarisights

Time Series tabular data

7 of 63

CSVs @ Clarisights

Time Series tabular data
KBs to GBs (120+ Mill CSV rows per day)

8 of 63

CSVs @ Clarisights

Time Series tabular data
KBs to GBs (120+ Mill CSV rows per day)
Multiple Dialects of CSV files

9 of 63

Credits: https://xkcd.com/927/

RFC 4180...

And obviously there is an XKCD for it...

10 of 63

Aggregation on CSV

Also known as sum/agg. stats problem(internally)

11 of 63

Raw Data

Date	City	Orders	Revenue
2019-12-12	Bengaluru	10	100 INR
2019-12-12	Bengaluru	20	500 INR
2019-12-13	Bengaluru	30	800 INR

12 of 63

Raw Data

Date	City	Orders	Revenue
2019-12-12	Bengaluru	10	100 INR
2019-12-12	Bengaluru	20	500 INR
2019-12-13	Bengaluru	30	800 INR

13 of 63

Raw Data

Date	City	Orders	Revenue
2019-12-12	Bengaluru	10	100 INR
2019-12-12	Bengaluru	20	500 INR
2019-12-13	Bengaluru	30	800 INR

14 of 63

Aggregated Data

Date	City	Orders	Revenue
2019-12-12	Bengaluru	30	600 INR
2019-12-13	Bengaluru	30	800 INR

15 of 63

Run an Aggregation Function on Metrics with GROUP BY on Dimensions

16 of 63

Now let’s try doing it in Millions of Rows...

17 of 63

18 of 63

Problems...

Containers getting OOMKilled

19 of 63

Problems...

Containers getting OOMKilled
Unpredictable memory footprint

20 of 63

Problems...

Containers getting OOMKilled
Unpredictable memory footprint
Unpredictable time to process data

21 of 63

Problems...

Containers getting OOMKilled
Unpredictable memory footprint
Unpredictable time to process data
Need to Provision for peak memory usage

22 of 63

Pretty Much this...

23 of 63

What If….

We can run SQL on CSV files?

24 of 63

Nothing New, this has been done before

PostgreSQL fdw

25 of 63

Nothing New, this has been done before

PostgreSQL fdw
github.com/harelba/q

26 of 63

Nothing New, this has been done before

PostgreSQL fdw
github.com/harelba/q
github.com/dinedal/textql

27 of 63

Downsides...

Not horizontally scalable

28 of 63

Downsides...

Not horizontally scalable
Full SQL standard is not supported

29 of 63

Downsides...

Not horizontally scalable
Full SQL standard is not supported
Not Maintained/Used at scale

30 of 63

So we started searching….

31 of 63

Introducing...

Apache Drill

32 of 63

What is Apache Drill?

Schema-Free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

33 of 63

What is Apache Drill?

Schema-Free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
Based on Google’s Dremel Paper

34 of 63

Why Drill?

Can query complex, semi-structured data on-site

(no transformation required)

35 of 63

Why Drill?

Full SQL Support

Drill supports the standard SQL:2003 syntax.

36 of 63

Why Drill?

Support for multiple data sources

File System, CSV, parquet, Hbase, Hive etc.

37 of 63

Why Drill?

Support for User-Defined Functions (UDFs)

38 of 63

Why Drill?

Distributed and High performance,

Used at scale by MapR and others...

39 of 63

Why Drill?

An Apache Project

Maintained and has community, so it’s not going anywhere soon.

40 of 63

How Apache Drill Works...

🤔

41 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

42 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

43 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

44 of 63

Some Downsides...

Direct SQL on big CSV files is Slow

Developers have to Learn and be aware of Quirks of Drill

45 of 63

Introducing...

Rig

46 of 63

What is Rig?

Proxy to abstract out Apache Drill and expose fast and easy to use Interface to Load and Query CSV files

47 of 63

How Rig Works?

🤔

48 of 63

49 of 63

Client

LoadData RPC (With CSV)

Rig

Apache Drill

50 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Rig

Apache Drill

51 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Rig

Apache Drill

52 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Rig

Apache Drill

53 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Client

Rig

Apache Drill

54 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Query Parquet Table

Client

Rig

Apache Drill

55 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Query Parquet Table

Return Results

Client

Rig

Apache Drill

56 of 63

Results...

57 of 63

Before Rig

We failed load data from big CSV files, even after giving 32GB memory.
Processing used fail often due to out of memory errors.
20 GB - 28 GB Memory was a norm for CSV Processing Jobs

58 of 63

After Rig

Out of Memory failures were down to Zero
CSV processing came down from 22-24 hours to 2-3 hours
Memory usage came down from 20-30 GB range to 2-3 GB range

59 of 63

time taken to process, before and after Rig

60 of 63

Problems/Issues faced

Apache Drill on Kubernetes
gRPC Load Balancing on Kubernetes
Tuning Apache Drill
Apache Drill on Preemptable Hosts
Downtime and Query Failure during Deploys

61 of 63

Learnings

gRPC and HTTP/2 Load Balancing
Running Stateful workloads on Kubernetes
GCS as Distributed FS and gcsfuse

62 of 63

Questions?

63 of 63

We are Hiring...

Come talk to me (suraj.dev), or

Visit clarisights.com/careers