1 of 45

Lecture 14: �A Few Tips for Data Science in “Real” World

Applied Data Science Fall 2025

Amir Hesam Salavati

saloot@gmail.com

Hamed Shah-Mansouri

hamedsh@sharif.edu

2 of 45

Last Session We Covered...

Explainable AI

Global Explanation

Local Explanation

Imbalanced �Data

3 of 45

In this session

Real world challenges and tools

Pipelines
Automated Machine Learning (AutoML)
Some examples and their solutions

Necessary/helpful skills for a successful data scientists

Hard skills
Soft skills

Podcasts and videos to keep your edges sharp

4 of 45

Data Pipelines

https://russiabusinesstoday.com/infrastructure/russia-to-build-20000-km-of-oil-gas-pipelines-by-2022-report/

5 of 45

Pipeline Purpose

Image: https://www.geeksforgeeks.org/whats-data-science-pipeline/

Data goes in, insight comes out
Like a normal pipeline, it takes data (the liquid) from one place to another

6 of 45

Pipeline Tools

A data pipeline is a tool for automated series of operations performed on data in consecutive steps (or in parallel)�
The set of operations vary, but the main goals are:

Save time for repeated tasks
Splitting large tasks into smaller ones (possible with different frameworks for each part)
Reproducibility

7 of 45

Spectrum of Pipelines

Extract and load data

ETL: Extract, Transform and Load
ELT: Extract, Load and Transform

Analytics-based: not only load, but do analysis
Infrastructure

Across servers and platforms, with parallel processing
Language/library-specific, mainly for having cleaner reproducible codes

The final architecture depends on the project and our purpose

8 of 45

Apache Spark

lightning-fast cluster computing technology
A unified analytics engine for large-scale data processing
Pros:

Very very fast!
Supports multiple languages (Python, R, etc.)

Cons:

High RAM consumption
File management issues

More: https://trustradius.com/products/apache-spark/reviews?qs=pros-and-cons

Source: https://databricks.com/spark/about

9 of 45

Kafka

Open source
Very popular
Especially useful with analysis of real time feed of data
Also useful in handling logs
Disadvantages:

Queuing mechanism is not the best
Integrated tools are far from ideal

Source: https://axual.com/what-is-apache-kafka/

10 of 45

AirFlow

Originally built by AirBnB
Open source
Can distribute tasks for parallel processing
Integrates nicely with Python
Disadvantage: has steep learning curve.

To work well with Airflow you need DevOps knowledge

11 of 45

Scikit-learn Pipelines + Transformers

Helpful in writing code that is:

Cleaner
Easier to expand/debug
Reproducible�

Not necessarily useful in parallel processing or task-distribution�
We will see how to use them in the lab session

Image: https://youtube.com/watch?v=jzKSAeJpC6s

12 of 45

Pandas Pipes

Similar to Scikit-learn pipelines in spirit

It helps in writing cleaner codes,
But not parallel processing and task scheduling

Basically, we can wrap a lot of what we did in our Colab notebooks in a Pandas pipe.
We will see how to use them in the lab session

https://hackersandslackers.com/merge-dataframes-with-pandas/

13 of 45

No-coding Pipeline

A rather new trend to allow less tech-savvy users benefit from pipelines
Integrate with data sources (e.g. Google, Facebook, Amazon, etc.), load the data and analyze them

Hevo Data
Datapipelines.com
Datado
Many more, and many to come!

Image:https://siliconangle.com/2021/12/17/hevo-data-closes-30m-round-provide-data-pipelines-cloud/

14 of 45

Cloud-based Pipeline

Pipelines can be defined in a cloud-based system

Your data is fetched from the source you specify,
The result is shown to you on a dashboard
Less effort on infrastructure
Usually not free

Examples

15 of 45

Some Examples of Real Data Pipelines

Pinterest�over 100M MAU doing over 10B+ pageviews per month.

AirBnB�over 100M users browsing over 2M listings�

More:https://keen.io/blog/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest

16 of 45

Automated Machine Learning (AutoML)

Image: https://datafloq.com/read/why-automated-machine-learning-important/

17 of 45

AutoML, Its Roots and Applications

Created to simplify the process of hyper-parameter tuning
Currently, they can help in data cleaning/preprocessing, model selection, hyper parameter tuning, prediction and visualizations!

18 of 45

AutoML Range of Applications

Hyperparameter Tuning
Data Exploration and Understanding
Feature Engineering
Model Selection
Time Efficiency

19 of 45

AutoML Tools

Many available options:

Auto Weka: simultaneous selection of the algorithm and its hyperparameters
Auto sklearn: extension of auto Weka
Auto Keras/PyTorch: mainly for deep learning methods
TPOT: a data-science assistant to optimize the ML pipelines using genetic programming.
Many many more!�

We will explore some of them in the lab session

20 of 45

Cloud AutoML

Targets individuals and companies with limited coding resources
Designed to perform ML in the cloud, with a few clicks
Multiple options available:

Could be used to benchmark algorithms developed locally on larger datasets

But these services are usually not free

21 of 45

Challenges of Data Science in “Real” Life Applications

Image: https://summitphotography.ca/

Real life environments usually come with constraints that mean we need to solve the challenge within those boundaries (e.g. cost, resources, speed, etc.)

22 of 45

Accuracy vs. Speed vs. Cost Tradeoffs

A lot of the times, we want algorithms that are very accurate
But before/after deployed, they might:

Take too long to design! (engineering mindset)
Take too long to return the results
Consume a lot of resources (CPU/RAM/Storage)

We need to step back a little and probably sacrifice accuracy a bit

23 of 45

Accuracy vs. Explainability (vs. Manipulation) Trade Offs

explainability might become an issue in certain cases (e.g. the loan approval algorithms)

In some cases, we not only need to predict something, but maybe manipulate the behavior of user in order to improve that prediction.

airline pricing
order fulfillment probability in Ubaar (MLP instead of RF)

24 of 45

Cold Start Problem: What to do When We Have No Data?!

25 of 45

Cold Start Problem

When you start from scratch, you do not have data to work with at day 1

Recommendation systems
Pricing algorithms

Possible “remedies”:

Get data from other resources (e.g. other datasets, crawling, social media, etc.)
Rule-based/intuition-based algorithms to get started
Deploy first, gather data later
More:https://expressanalytics.com/blog/cold-start-problem/

26 of 45

A/B Testing: Why Do Limited Releases at First?

27 of 45

A/B Testing, Even When You are Pretty Sure!

You have designed an algorithm, accuracy is to the roof, resource consumption is low, model is explainable. Sounds perfect!
But make sure to to test it on a subset of users first!

There might have been a feature we didn’t have access to
Our model is based on “assumptions” that are not valid in practice
….

28 of 45

Necessary Skills for an Applied Data Scientist

Image: http://www.leansolutions.gr/blog/what-are-benefits-multi-skilling-in-production/

29 of 45

Another Data Analysis Pipeline/Workflow

Problem

Gathering Data

Analysis

Knowledge & Insight

Exploratory Data Analysis

Data Cleaning/�Preprocess

30 of 45

Which Parts Did We Cover in This Course?

31 of 45

What We Have Covered in This Course

Problem

Gathering Data

Analysis

Knowledge & Insight

Exploratory Data Analysis

Data Cleaning/�Preprocess

32 of 45

What Remains: Problem Formulation

Probably the most important skill of a data analysis/scientist (and to some extent, ML engineer)�
Translate the business/technical/academic problem into a question that can be answered using data and ML algorithms

Image: https://techgrabyte.com/10-machine-learning-algorithms-application/

33 of 45

What Remains: Problem Formulation

What is the value our algorithm is going to create? �
How to implement it? (given the constraints)�
How it can be measured?�
Requires:

Domain knowledge or
Curiosity/Trial and error

Image: https://techgrabyte.com/10-machine-learning-algorithms-application/

34 of 45

What Remained: Data Gathering

We might need to gather data from data sources ourselves
Requires

Database Management: MySQL, PostgresSQL,MongoDB
Working with 3rd party APIs (e.g. Twitter, Facebook, etc.)
Crawling existing sources: text/image/video from web
Distributed Storage: Hadoops, Apache Spark/Flink

Image: https://docs.eazybi.com/eazybi/data-import/external-data-sources

35 of 45

What Remained: Insight Sharing and Storytelling

The result of our algorithms is insight�
Effectively sharing that insight is almost as important as all of your previous endeavours�
We are going to practice this in the final project presentation and report�

Source:https://madelearningdesigns.com/2014/02/06/using-digital-stories-in-elearning/ + https:/forbes.com/sites/jeffthomson/2021/05/28/the-power-of-storytelling-in-the-finance-function/

36 of 45

What Remained: Insight Sharing and Storytelling

Art of Storytelling�
Proper visualizations�
Complete and reader-friendly reports�
Clear and actionable steps�

Source:https://madelearningdesigns.com/2014/02/06/using-digital-stories-in-elearning/ + https:/forbes.com/sites/jeffthomson/2021/05/28/the-power-of-storytelling-in-the-finance-function/

37 of 45

Additional Soft Skills That Could Help

Soft skills are really really really important for you to excel!�
Be the owner/boss of what you do:

The owner of a task does not wait for others to tell her/him what needs to be done
The owner is curious and doesn’t let go until a solution is found (or all options are explored)
As a result, he/she almost always over-delivers �

38 of 45

Additional Soft Skills That Could Help

Be meticulous and always double check your results firsts,

Then triple check it with your colleagues (don’t be like me!)

others are going to use your data to take action!��

Learn time management and respect deadlines�

39 of 45

Additional Soft Skills That Could Help

Become an expert in finding the fine line between useful and harmful perfectionism:

When is ok to go for 85% accuracy instead of 95%?
Have the end goal in mind!�

Be creative and don’t be scared to test new ideas.�
And again, be the owner/boss of your tasks and stay up to date!�

40 of 45

Some Helpful Resources

Image:https://ideas.ted.com/teds-summer-culture-list-114-podcasts-books-tv-shows-movies-and-more-to-nourish-you/

41 of 45

Courses

Blogs

Community

Medium

Make sure to sign in to receive recommendations

TowradsDataScience
M achinelearningmastery
KDnuggets
…

42 of 45

Podcasts

Youtube Channels

DataFramed: biweekly podcast that features interviews with practitioners
Data Skeptic: covers mini topics or in-depth issues
Lex Fridman Podcast: talks about science, technology, history, philosophy
AI in Business: AI applications in Business and its challenges
…

43 of 45

Data Pipelines

AutoML

Tips for data science in “real” world

44 of 45

https://towardsdatascience.com/the-absolute-beginners-guide-for-data-science-rookies-736e4fcbff0a

45 of 45

ToDo List for Next Session

Checkout the Google Colab notebook before our lab session�https://a4re.ir/lab14 ��
Don’t forget the survey :)�https://forms.gle/4ds2ZkvbjTVC1z2N9 ��
No homework on AutoML and pipelines :D �