1 of 45

Lecture 14: �A Few Tips for Data Science in “Real” World

Applied Data Science Fall 2025

Amir Hesam Salavati

saloot@gmail.com

Hamed Shah-Mansouri

hamedsh@sharif.edu

2 of 45

Last Session We Covered...

Explainable AI

Global Explanation

Local Explanation

Imbalanced �Data

3 of 45

In this session

  • Real world challenges and tools
    • Pipelines
    • Automated Machine Learning (AutoML)
    • Some examples and their solutions
  • Necessary/helpful skills for a successful data scientists
    • Hard skills
    • Soft skills
  • Podcasts and videos to keep your edges sharp

4 of 45

Data Pipelines

https://russiabusinesstoday.com/infrastructure/russia-to-build-20000-km-of-oil-gas-pipelines-by-2022-report/

5 of 45

Pipeline Purpose

Image: https://www.geeksforgeeks.org/whats-data-science-pipeline/

  • Data goes in, insight comes out
  • Like a normal pipeline, it takes data (the liquid) from one place to another

6 of 45

Pipeline Tools

  • A data pipeline is a tool for automated series of operations performed on data in consecutive steps (or in parallel)�
  • The set of operations vary, but the main goals are:
    • Save time for repeated tasks
    • Splitting large tasks into smaller ones (possible with different frameworks for each part)
    • Reproducibility

7 of 45

Spectrum of Pipelines

  • Extract and load data
    • ETL: Extract, Transform and Load
    • ELT: Extract, Load and Transform
  • Analytics-based: not only load, but do analysis
  • Infrastructure
    • Across servers and platforms, with parallel processing
    • Language/library-specific, mainly for having cleaner reproducible codes
  • The final architecture depends on the project and our purpose

8 of 45

Apache Spark

  • lightning-fast cluster computing technology
  • A unified analytics engine for large-scale data processing
  • Pros:
    • Very very fast!
    • Supports multiple languages (Python, R, etc.)
  • Cons:
    • High RAM consumption
    • File management issues
  • More: https://trustradius.com/products/apache-spark/reviews?qs=pros-and-cons

Source: https://databricks.com/spark/about

9 of 45

Kafka

  • Open source
  • Very popular
  • Especially useful with analysis of real time feed of data
  • Also useful in handling logs
  • Disadvantages:
    • Queuing mechanism is not the best
    • Integrated tools are far from ideal

Source: https://axual.com/what-is-apache-kafka/

10 of 45

AirFlow

  • Originally built by AirBnB
  • Open source
  • Can distribute tasks for parallel processing
  • Integrates nicely with Python
  • Disadvantage: has steep learning curve.
    • To work well with Airflow you need DevOps knowledge

11 of 45

Scikit-learn Pipelines + Transformers

  • Helpful in writing code that is:
    • Cleaner
    • Easier to expand/debug
    • Reproducible
  • Not necessarily useful in parallel processing or task-distribution�
  • We will see how to use them in the lab session

Image: https://youtube.com/watch?v=jzKSAeJpC6s

12 of 45

Pandas Pipes

  • Similar to Scikit-learn pipelines in spirit
    • It helps in writing cleaner codes,
    • But not parallel processing and task scheduling
  • Basically, we can wrap a lot of what we did in our Colab notebooks in a Pandas pipe.
  • We will see how to use them in the lab session

https://hackersandslackers.com/merge-dataframes-with-pandas/

13 of 45

No-coding Pipeline

  • A rather new trend to allow less tech-savvy users benefit from pipelines
  • Integrate with data sources (e.g. Google, Facebook, Amazon, etc.), load the data and analyze them
    • Hevo Data
    • Datapipelines.com
    • Datado
    • Many more, and many to come!

Image:https://siliconangle.com/2021/12/17/hevo-data-closes-30m-round-provide-data-pipelines-cloud/

14 of 45

Cloud-based Pipeline

  • Pipelines can be defined in a cloud-based system
    • Your data is fetched from the source you specify,
    • The result is shown to you on a dashboard
    • Less effort on infrastructure
    • Usually not free
  • Examples
    • Keboola
    • Stitch
    • Etleap
    • Integrate.io
    • Many many more!

15 of 45

Some Examples of Real Data Pipelines

Pinterest�over 100M MAU doing over 10B+ pageviews per month.

AirBnB�over 100M users browsing over 2M listings

16 of 45

Automated Machine Learning (AutoML)

Image: https://datafloq.com/read/why-automated-machine-learning-important/

17 of 45

AutoML, Its Roots and Applications

  • Created to simplify the process of hyper-parameter tuning
  • Currently, they can help in data cleaning/preprocessing, model selection, hyper parameter tuning, prediction and visualizations!

18 of 45

AutoML Range of Applications

  • Hyperparameter Tuning
  • Data Exploration and Understanding
  • Feature Engineering
  • Model Selection
  • Time Efficiency

19 of 45

AutoML Tools

  • Many available options:
    • Auto Weka: simultaneous selection of the algorithm and its hyperparameters
    • Auto sklearn: extension of auto Weka
    • Auto Keras/PyTorch: mainly for deep learning methods
    • TPOT: a data-science assistant to optimize the ML pipelines using genetic programming.
    • Many many more!�
  • We will explore some of them in the lab session

20 of 45

Cloud AutoML

  • Targets individuals and companies with limited coding resources
  • Designed to perform ML in the cloud, with a few clicks
  • Multiple options available:
  • Could be used to benchmark algorithms developed locally on larger datasets
    • But these services are usually not free

21 of 45

Challenges of Data Science in “Real” Life Applications

Image: https://summitphotography.ca/

Real life environments usually come with constraints that mean we need to solve the challenge within those boundaries (e.g. cost, resources, speed, etc.)

22 of 45

Accuracy vs. Speed vs. Cost Tradeoffs

  • A lot of the times, we want algorithms that are very accurate
  • But before/after deployed, they might:
    • Take too long to design! (engineering mindset)
    • Take too long to return the results
    • Consume a lot of resources (CPU/RAM/Storage)
  • We need to step back a little and probably sacrifice accuracy a bit

23 of 45

Accuracy vs. Explainability (vs. Manipulation) Trade Offs

  • explainability might become an issue in certain cases (e.g. the loan approval algorithms)
  • In some cases, we not only need to predict something, but maybe manipulate the behavior of user in order to improve that prediction.
    • airline pricing
    • order fulfillment probability in Ubaar (MLP instead of RF)

24 of 45

Cold Start Problem: What to do When We Have No Data?!

25 of 45

Cold Start Problem

  • When you start from scratch, you do not have data to work with at day 1
    • Recommendation systems
    • Pricing algorithms
  • Possible “remedies”:
    • Get data from other resources (e.g. other datasets, crawling, social media, etc.)
    • Rule-based/intuition-based algorithms to get started
    • Deploy first, gather data later
    • More:https://expressanalytics.com/blog/cold-start-problem/

26 of 45

A/B Testing: Why Do Limited Releases at First?

27 of 45

A/B Testing, Even When You are Pretty Sure!

  • You have designed an algorithm, accuracy is to the roof, resource consumption is low, model is explainable. Sounds perfect!
  • But make sure to to test it on a subset of users first!
    • There might have been a feature we didn’t have access to
    • Our model is based on “assumptions” that are not valid in practice
    • ….

28 of 45

Necessary Skills for an Applied Data Scientist

Image: http://www.leansolutions.gr/blog/what-are-benefits-multi-skilling-in-production/

29 of 45

Another Data Analysis Pipeline/Workflow

Problem

Gathering Data

Analysis

Knowledge & Insight

Exploratory Data Analysis

Data Cleaning/�Preprocess

30 of 45

Which Parts Did We Cover in This Course?

31 of 45

What We Have Covered in This Course

Problem

Gathering Data

Analysis

Knowledge & Insight

Exploratory Data Analysis

Data Cleaning/�Preprocess

32 of 45

What Remains: Problem Formulation

  • Probably the most important skill of a data analysis/scientist (and to some extent, ML engineer)�
  • Translate the business/technical/academic problem into a question that can be answered using data and ML algorithms

Image: https://techgrabyte.com/10-machine-learning-algorithms-application/

33 of 45

What Remains: Problem Formulation

  • What is the value our algorithm is going to create? �
  • How to implement it? (given the constraints)�
  • How it can be measured?�
  • Requires:
    • Domain knowledge or
    • Curiosity/Trial and error

Image: https://techgrabyte.com/10-machine-learning-algorithms-application/

34 of 45

What Remained: Data Gathering

  • We might need to gather data from data sources ourselves
  • Requires
    • Database Management: MySQL, PostgresSQL,MongoDB
    • Working with 3rd party APIs (e.g. Twitter, Facebook, etc.)
    • Crawling existing sources: text/image/video from web
    • Distributed Storage: Hadoops, Apache Spark/Flink

Image: https://docs.eazybi.com/eazybi/data-import/external-data-sources

35 of 45

What Remained: Insight Sharing and Storytelling

  • The result of our algorithms is insight�
  • Effectively sharing that insight is almost as important as all of your previous endeavours�
  • We are going to practice this in the final project presentation and report�

Source:https://madelearningdesigns.com/2014/02/06/using-digital-stories-in-elearning/ + https:/forbes.com/sites/jeffthomson/2021/05/28/the-power-of-storytelling-in-the-finance-function/

36 of 45

What Remained: Insight Sharing and Storytelling

  • Art of Storytelling�
  • Proper visualizations�
  • Complete and reader-friendly reports�
  • Clear and actionable steps�

Source:https://madelearningdesigns.com/2014/02/06/using-digital-stories-in-elearning/ + https:/forbes.com/sites/jeffthomson/2021/05/28/the-power-of-storytelling-in-the-finance-function/

37 of 45

Additional Soft Skills That Could Help

  • Soft skills are really really really important for you to excel!�
  • Be the owner/boss of what you do:
    • The owner of a task does not wait for others to tell her/him what needs to be done
    • The owner is curious and doesn’t let go until a solution is found (or all options are explored)
    • As a result, he/she almost always over-delivers

38 of 45

Additional Soft Skills That Could Help

  • Be meticulous and always double check your results firsts,

  • Then triple check it with your colleagues (don’t be like me!)
    • others are going to use your data to take action!��
  • Learn time management and respect deadlines�

39 of 45

Additional Soft Skills That Could Help

  • Become an expert in finding the fine line between useful and harmful perfectionism:
    • When is ok to go for 85% accuracy instead of 95%?
    • Have the end goal in mind!�
  • Be creative and don’t be scared to test new ideas.�
  • And again, be the owner/boss of your tasks and stay up to date!�

40 of 45

Some Helpful Resources

Image:https://ideas.ted.com/teds-summer-culture-list-114-podcasts-books-tv-shows-movies-and-more-to-nourish-you/

41 of 45

Courses

Blogs

Community

  • Medium
    • Make sure to sign in to receive recommendations
  • TowradsDataScience
  • Machinelearningmastery
  • KDnuggets

42 of 45

Podcasts

Youtube Channels

  • DataFramed: biweekly podcast that features interviews with practitioners
  • Data Skeptic: covers mini topics or in-depth issues
  • Lex Fridman Podcast: talks about science, technology, history, philosophy
  • AI in Business: AI applications in Business and its challenges

43 of 45

Data Pipelines

AutoML

Tips for data science in “real” world

44 of 45

https://towardsdatascience.com/the-absolute-beginners-guide-for-data-science-rookies-736e4fcbff0a

45 of 45

ToDo List for Next Session

  • Checkout the Google Colab notebook before our lab session�https://a4re.ir/lab14 ��
  • Don’t forget the survey :)�https://forms.gle/4ds2ZkvbjTVC1z2N9 ��
  • No homework on AutoML and pipelines :D �