1 of 20

Data Science/ ML

Interviews

  • LLMs

2 of 20

Interview Pillars

&

Resources

...the never ending list

And now we have LLMs too!

3 of 20

BUT FIRST...

4 of 20

If you have no idea whatsoever..

Just do these courses to get started:

Machine Learning A-Z (Python & R in Data Science Course)

Learn Python for Data Structures, Algorithms & Interviews

..Create a github repo. E.g. this

.. Start solving past and present Kaggle competitions, starting with forecasting using LGBM, xgboost, Bayesian Optimization

5 of 20

  1. Python/ Java Data structures > LeetCode
  2. Data Engineering > SQL / NoSQL window function (lag/lead)
  3. Statistics > Probability [Bayes Theorem] + Z score, Expected Values + Markovian Chain Principles
  4. AB Testing > Bonferroni Correction, what happens with multiple metrics, OEC, MVT, Joint Distribution
  5. Product Sense > health of a product, improving the product, root cause analysis
  6. SWE System Design > Design Twitter
  7. ML System Design > Design a Personalized News Feed Rank
  8. Live Coding of data / modeling issue
  9. Writing algorithms from scratch
  10. Behavioral Interviews
  11. Data Science Concepts
  12. MLE Concepts such as NER, Deep Learning
  13. And, LLMs!

Disclaimer: All this is based on my experience of failing many many interviews and endlessly reading stuff on Reddit, Blind and LC

6 of 20

Resources

  1. Leetcode is leetcode > consistency is key [topics vary from DS to MLE]
    1. https://leetcode.com/discuss/career/450215/How-to-use-LeetCode-to-help-yourself-efficiently-and-effectively-(for-beginners)
    2. https://www.mrcodeswildride.com/challenges/algorithms

https://www.youtube.com/watch?v=-WEpWH1NHGU [start with this]

https://www.interviewquery.com/questions

7 of 20

4. A/B Testing

Sample Size, Power, alpha

Multiple Metrics

Bonferroni Correction

Bonferroni correction - multiple testing

Trustworthy online experiments [you can find a pdf]

Type 1, type 2 errors

Experimental design - randomization unit, MDE

MVT

ABn Testing, multi arm bandits

Simpson's paradox

Sample Ratio mismatch

quasi experiment

Z-test, T-test, Anova, Ancova, chi sq

Non normal AB Test

Proportion Testing

Summary of AB Testing

8 of 20

5. Product Sense

  • An important metric goes down, how would you dig into the causes?
  • What metrics would you use to quantify the success of youtube ads (this could also be extended to other products like Snapchat filters, twitter live-streaming, fort-nite new features, etc)
  • How do you measure the success or failure of a product/product feature
  • Google has released a new version of their search algorithm, for which they used A/B testing. During the testing process, engineers realized that the new algorithm was not implemented correctly and returned less relevant results. Two things happened during testing:
    • People in the treatment group performed more queries than the control group.
    • Advertising revenue was higher in the treatment group as well.
    • What may be the cause of people in the treatment group performing more searches than the control group? There are different possible answers here.
  • Product Manager Interview Questions
  • The Product Manager Interview

9 of 20

Resources

6. System Design : Enough Tutorials on YouTube [Tech Dummies is great!]

7. ML System Design: www.boringbot.xyz

8. Live Coding on ML Problems: Kaggle Kaggle Kaggle! [learn how to write precision/recall from scratch]

9. Writing Algorithms from scratch: hamzafarooq/algos: Building ML Algorithms ground-up

10. Behavior : Amazon Leadership Principles

11/12. DS + ML Concepts: 100 Page ML Book

Bonus link for Data Science Concepts: ISLR Textbook Slides, Videos and Resources

10 of 20

DS Skills

Description

Expectations

Data Querying

Ability to write queries involving not limited to SQL/MySQL/Hive etc. for joining datasets, summarizing and aggregating from large scale databases

Minimum Expectation (fixed ) : Different types of Joins, when are they used, Group By, Distinct , UNIONS, basic sub queries , comparators

Good to haves ( depending on level ) : window functions, date/time manipulations, string formatting, running totals, pivoting, lag, lead operations.

Statistics

Understanding of statistical intuition behind samples, population & hypothesis testing

Minimum Expectations (fixed ) : Central limit theorem , Different statistical distribution ( top 3 ) uses cases, p-values, confidence intervals, linear regression, basic parametric tests like z test, t-test etc.

Good to haves (depending on level ) : Effect size, power analysis, sampling techniques, top 10 statistical distributions use cases, L1/L2 regularizations understanding.

Machine Learning

Understanding of how Machine learnings works , algorithms and thought process behind different top ML algos.

Minimum Expectations (fixed) : Understanding of over/under fitting, training/test/validation set, ability to deal with uncleaned data, How trees, clustering, logistic regression and dimension reduction work, cross validation and evaluate which algorithm is better.

Good to haves ( depending on level ) : Tree Pruning, bootstrapping, ensemble models, boosting, ROC curves, parameter tunings, when and how to balance between accuracy and interpretability of ML models.

11 of 20

DS Skills

Description

Expectations

Fundamentals of Programming

Experience in writing basic programs using any language, understanding of basic data structures.

Minimum Expectation (fixed) : pseudo code of common problems, loops , counters, edge cases , space and time complexity ( basics ) of any approach, ability to write common programs of finding area of triangle, palindrome etc.

Good to haves ( depending on level ) : Object oriented programming, dictionaries & hash maps , breaking down problem to sub problems, solid understanding of big O notations, experience in writing big programs, experience in version control/git.

Applied Math & Probability

Solid fundamentals of high school math and numbers , probability

Minimum Expectation (fixed ) : understanding of permutation, combination, fundamentals properties of probabilities, bayes theorem, basics of linear algebra and matrix.

Good to haves ( depending on level ) : optimization, inflection point intutition.

12 of 20

Language Models

But why do I need to learn all this?

13 of 20

Resources

What is NLP anyways?

Resource: Ultimate Guide to Understand and Implement Natural Language Processing

Natural Language Processing (NLP) is defined as the branch of Artificial Intelligence that provides computers with the capability of understanding text and spoken words in the same way a human being can. It incorporates machine learning models, statistics, and deep learning models into computational linguistics i.e. rule-based modeling of human language to allow

The ultimate Notion Resource - built by my team

14 of 20

15 of 20

Different kind of Roles

16 of 20

  • Product Analyst vs data analyst vs business analyst
  • Research vs applied research
  • Research scientist vs research engineer
  • Data scientist vs machine learning engineer

17 of 20

Product Analyst

Data Analyst

Metrics for Product growth/health

Pretty much a Data Analyst

Ex: Our MAU, DAU are down by 10%

Ex: derive insights for all different kind of users across the board

Focus on immediate commercial outcome

Focus on long term outcome

18 of 20

Data scientist

ML Engineers

Extract knowledge and insights from structured and unstructured data

ML models learn from data

-> ML is part of data science

Use data to help company make decisions

Develop models to turn data into products

Is a scientist

-> engineering isn’t a top priority

Is an engineer

-> engineering is a top priority

Caveats

  • MLEs at startups might spend most of their time wrangling data, understanding data, setting up infrastructure, and deploying models instead of training ML models.

19 of 20

Research

Applied research

Find the answers for fundamental questions and expand the body of theoretical knowledge.

Find solutions to practical problems

Ex: develop a new learning method for unsupervised transfer learning

Ex: develop techniques to make that new learning method work on a real world dataset

Focus on long term outcome

Focus on immediate commercial outcome

Caveats

  • Cutting-edge research is spearheaded by big corporations
  • Lacking theories to explain methods that work well empirically

20 of 20