1 of 63

CSV Processing at Scale

Suraj Nath

Infrastructure Engineer, Clarisights

2 of 63

About Me

Web: suraj.dev

@electron0zero on Twitter and rest of Interwebs

3 of 63

About Clarisights

Ad-Hoc Data Analytics, Enrichment and Insights Platform for Marketing Teams.

4 of 63

Why CSV?

5 of 63

6 of 63

CSVs @ Clarisights

  • Time Series tabular data

7 of 63

CSVs @ Clarisights

  • Time Series tabular data
  • KBs to GBs (120+ Mill CSV rows per day)

8 of 63

CSVs @ Clarisights

  • Time Series tabular data
  • KBs to GBs (120+ Mill CSV rows per day)
  • Multiple Dialects of CSV files

9 of 63

Credits: https://xkcd.com/927/

RFC 4180...

And obviously there is an XKCD for it...

10 of 63

Aggregation on CSV

Also known as sum/agg. stats problem(internally)

11 of 63

Raw Data

Date

City

Orders

Revenue

2019-12-12

Bengaluru

10

100 INR

2019-12-12

Bengaluru

20

500 INR

2019-12-13

Bengaluru

30

800 INR

12 of 63

Raw Data

Date

City

Orders

Revenue

2019-12-12

Bengaluru

10

100 INR

2019-12-12

Bengaluru

20

500 INR

2019-12-13

Bengaluru

30

800 INR

13 of 63

Raw Data

Date

City

Orders

Revenue

2019-12-12

Bengaluru

10

100 INR

2019-12-12

Bengaluru

20

500 INR

2019-12-13

Bengaluru

30

800 INR

14 of 63

Aggregated Data

Date

City

Orders

Revenue

2019-12-12

Bengaluru

30

600 INR

2019-12-13

Bengaluru

30

800 INR

15 of 63

Run an Aggregation Function on Metrics with GROUP BY on Dimensions

16 of 63

Now let’s try doing it in Millions of Rows...

17 of 63

18 of 63

Problems...

  • Containers getting OOMKilled

19 of 63

Problems...

  • Containers getting OOMKilled
  • Unpredictable memory footprint

20 of 63

Problems...

  • Containers getting OOMKilled
  • Unpredictable memory footprint
  • Unpredictable time to process data

21 of 63

Problems...

  • Containers getting OOMKilled
  • Unpredictable memory footprint
  • Unpredictable time to process data
  • Need to Provision for peak memory usage

22 of 63

Pretty Much this...

23 of 63

What If….

We can run SQL on CSV files?

24 of 63

Nothing New, this has been done before

  • PostgreSQL fdw

25 of 63

Nothing New, this has been done before

  • PostgreSQL fdw
  • github.com/harelba/q

26 of 63

Nothing New, this has been done before

  • PostgreSQL fdw
  • github.com/harelba/q
  • github.com/dinedal/textql

27 of 63

Downsides...

  • Not horizontally scalable

28 of 63

Downsides...

  • Not horizontally scalable
  • Full SQL standard is not supported

29 of 63

Downsides...

  • Not horizontally scalable
  • Full SQL standard is not supported
  • Not Maintained/Used at scale

30 of 63

So we started searching….

31 of 63

Introducing...

Apache Drill

32 of 63

What is Apache Drill?

  • Schema-Free SQL Query Engine for Hadoop, NoSQL and Cloud Storage

33 of 63

What is Apache Drill?

  • Schema-Free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
  • Based on Google’s Dremel Paper

34 of 63

Why Drill?

Can query complex, semi-structured data on-site

(no transformation required)

35 of 63

Why Drill?

Full SQL Support

Drill supports the standard SQL:2003 syntax.

36 of 63

Why Drill?

Support for multiple data sources

File System, CSV, parquet, Hbase, Hive etc.

37 of 63

Why Drill?

Support for User-Defined Functions (UDFs)

38 of 63

Why Drill?

Distributed and High performance,

Used at scale by MapR and others...

39 of 63

Why Drill?

An Apache Project

Maintained and has community, so it’s not going anywhere soon.

40 of 63

How Apache Drill Works...

πŸ€”

41 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

42 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

43 of 63

Source: https://drill.apache.org/docs/drill-query-execution/

44 of 63

Some Downsides...

  • Direct SQL on big CSV files is Slow

  • Developers have to Learn and be aware of Quirks of Drill

45 of 63

Introducing...

Rig

46 of 63

What is Rig?

Proxy to abstract out Apache Drill and expose fast and easy to use Interface to Load and Query CSV files

47 of 63

How Rig Works?

πŸ€”

48 of 63

49 of 63

Client

LoadData RPC (With CSV)

Rig

Apache Drill

50 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Rig

Apache Drill

51 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Rig

Apache Drill

52 of 63

Client

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Rig

Apache Drill

53 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Client

Rig

Apache Drill

54 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Query Parquet Table

Client

Rig

Apache Drill

55 of 63

LoadData RPC (With CSV)

CSV to Parquet Table

Parquet Table Path

Dataset ID (UUID)

Query RPC (With Dataset ID)

Query Parquet Table

Return Results

Return Results

Client

Rig

Apache Drill

56 of 63

Results...

57 of 63

Before Rig

  • We failed load data from big CSV files, even after giving 32GB memory.
  • Processing used fail often due to out of memory errors.
  • 20 GB - 28 GB Memory was a norm for CSV Processing Jobs

58 of 63

After Rig

  • Out of Memory failures were down to Zero
  • CSV processing came down from 22-24 hours to 2-3 hours
  • Memory usage came down from 20-30 GB range to 2-3 GB range

59 of 63

time taken to process, before and after Rig

60 of 63

Problems/Issues faced

  • Apache Drill on Kubernetes
  • gRPC Load Balancing on Kubernetes
  • Tuning Apache Drill
  • Apache Drill on Preemptable Hosts
  • Downtime and Query Failure during Deploys

61 of 63

Learnings

  • gRPC and HTTP/2 Load Balancing
  • Running Stateful workloads on Kubernetes
  • GCS as Distributed FS and gcsfuse

62 of 63

Questions?

63 of 63

We are Hiring...

Come talk to me (suraj.dev), or

Visit clarisights.com/careers