1 of 22

DATA ENGINEERING

(from the trenches)

2 of 22

ABOUT ME

  • PhD. in Computer Science (FaMAF - UNC).
    • Research: Modal Logics and Automated Reasoning.
  • Head of Data @ Winclap
    • Previously:
      • Lead Data Engineer @ Jampp.
      • Sr. Data Engineer @ Olapic.

@ezequiel_orbe

ezequiel-orbe-3b83502

Ezequiel Orbe

3 of 22

I’m a data scientist,

Why should I care about this?

4 of 22

The AI Hierarchy of Needs

Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure)

MASLOW’S

PYRAMID OF NEEDS

5 of 22

Being a Data Scientist: Expectations

6 of 22

Being a Data Scientist: Reality

7 of 22

The truth is that...

Given

  • the need to differentiate/survive,
  • the pressure to hit the market first, and,
  • the lack of knowledge,

companies tend to start from the top of the pyramid forcing you as data scientist to be a

Business don’t wait

one man band

8 of 22

So, keep this in mind...

  • As a data scientist, your capability to extract value from data is tightly coupled with the maturity of the data infrastructure of the company.
  • If the company doesn’t have a mature data infrastructure in place, you’ll find yourself in need of some degree of skills in data engineering.

The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit. I find this to be true for both evaluating project or job opportunities and scaling one’s work on the job.”

9 of 22

Data Engineering Defined

10 of 22

What’s Data Engineering all about?

“Data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.”

“Data engineering primarily focus on the following areas: Build and maintain the organization’s data pipeline systems ...Clean and wrangle data into a usable state

11 of 22

Data Scientists vs Data Engineers

  • Well, Data Scientists are more glamorous..

  • A Data Engineer:
    • Has a strong Software Engineering background.
    • Builds tools, infrastructure, frameworks, and services.
    • Operates an organization’s (big) data infrastructure.

12 of 22

Wait...there is more, the ….

  • An ML Engineer:
    • Proficient in both Data Science and Data Engineering.
    • Responsible for productizing ML models.
    • Creates the last mile of the data science pipeline.

13 of 22

A few basic concepts

14 of 22

Data Infrastructure

Data infrastructure refers to the systems, processes and infrastructure required to collect, move, store and transform an organization’s data.

  • You could think of your DI as made of a set of layers and stages:

  • Some requirements:
    • It must make information easily accessible.
    • It must present information consistently.
    • It must adapt to change.

    • It must present information in a timely way.
    • It must scale with the business.
    • It must be secure

15 of 22

The data ecosystem (a small part of it)

16 of 22

Data Pipelines

A data pipeline is any set of processing elements that move data from one system to another, possibly transforming the data along the way.

An ETL pipeline is a data pipeline that extracts data from one system, transforms it, and loads it into some database or data warehouse. It usually implies that the pipeline works in batches.

17 of 22

Data Warehouse vs Data Lake

A data warehouse is a centralized repository of integrated data from different sources specifically structured for query and analysis.

  • In a DW:
    • Schema on Write.
    • High quality data.
    • Used mostly by business analysts.
    • Used for reporting and BI.
  • In a DL:
    • Schema on Read.
    • Raw data.
    • Used mostly by Data Scientists.
    • Used for machine learning.

The data lake is a centralized repository of structured and unstructured data.

18 of 22

Big Data: don’t fool yourself...

  • Companies at different stages produce data in different velocity, variety, and volume.
    • A new start-up probably don’t need “big data” because there isn’t much data.
    • As the start-up grows it will be more data intensive but might do just fine using PostgreSQL.
    • At certain data scale, your only choice would be to resort to the Hadoop ecosystem.
  • Don't use Hadoop - your data isn't that big

19 of 22

If you need to...

  • The best advice I can give you: Understand your use case.
    • The match use case / technology is critical.
    • Choosing the wrong technology stack will lower your quality of life...

  • To scale your SQL-based ETLs:

  • Just remember: There is no silver bullet.
  • This is Useless (Without Use Cases)
  • To scale your code-based ETLs and ML models:
  • To scale your SQL analysis capabilities:

20 of 22

Transient Clusters are your friends

  • Compute and storage separation
    • All data stored in a external storage.
  • Use shared metastores.
    • EMRFS (on AWS).
    • Hive metastore.
  • Don’t mix YARN and non-YARN workloads .

21 of 22

Reading Material

  • Recommended Books:

22 of 22