DATA ENGINEERING
(from the trenches)
ABOUT ME
@ezequiel_orbe
ezequiel-orbe-3b83502
Ezequiel Orbe
I’m a data scientist,
Why should I care about this?
The AI Hierarchy of Needs
“Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure)”
MASLOW’S
PYRAMID OF NEEDS
Being a Data Scientist: Expectations
Being a Data Scientist: Reality
The truth is that...
Given
companies tend to start from the top of the pyramid forcing you as data scientist to be a
Business don’t wait
one man band
So, keep this in mind...
“The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit. I find this to be true for both evaluating project or job opportunities and scaling one’s work on the job.”
Data Engineering Defined
What’s Data Engineering all about?
“Data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.”
“Data engineering primarily focus on the following areas: Build and maintain the organization’s data pipeline systems ...Clean and wrangle data into a usable state”
Data Scientists vs Data Engineers
Wait...there is more, the ….
A few basic concepts
Data Infrastructure
Data infrastructure refers to the systems, processes and infrastructure required to collect, move, store and transform an organization’s data.
The data ecosystem (a small part of it)
Data Pipelines
A data pipeline is any set of processing elements that move data from one system to another, possibly transforming the data along the way.
An ETL pipeline is a data pipeline that extracts data from one system, transforms it, and loads it into some database or data warehouse. It usually implies that the pipeline works in batches.
Data Warehouse vs Data Lake
A data warehouse is a centralized repository of integrated data from different sources specifically structured for query and analysis.
The data lake is a centralized repository of structured and unstructured data.
Big Data: don’t fool yourself...
If you need to...
➕
Transient Clusters are your friends
Reading Material