1 of 83

Process and Technical Debt

Machine Learning in Production

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

2 of 83

Process...

2

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

3 of 83

Readings

Required Reading:

  • Alex Serban, Koen van der Blom, Holger Hoos, and Joost Visser. 2020. "Adoption and Effects of Software Engineering Best Practices in Machine Learning." In Proceedings of the 14th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) (ESEM '20).

Suggested Readings:

  • Fowler and Highsmith. The Agile Manifesto
  • Steve McConnell. Software project survival guide. Chapter 3
  • Kruchten, Philippe, Robert L. Nord, and Ipek Ozkaya. "Technical debt: From metaphor to theory and practice." IEEE Software 29, no. 6 (2012): 18-21.

3

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

4 of 83

Learning Goals

  • Overview of common data science workflows (e.g., CRISP-DM)
    • Importance of iteration and experimentation
    • Role of computational notebooks in supporting data science workflows
  • Overview of software engineering processes and lifecycles: costs and benefits of process, common process models, role of iteration and experimentation
  • Contrasting data science and software engineering processes, goals and conflicts
  • Integrating data science and software engineering workflows in process model for engineering AI-enabled systems with ML and non-ML components; contrasting different kinds of AI-enabled systems with data science trajectories
  • Overview of technical debt as metaphor for process management; common sources of technical debt in AI-enabled systems

4

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

5 of 83

What is Process?

5

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

6 of 83

Software Process

“The set of activities and associated results that produce a software product”

A structured, systematic way of carrying out these activities

Q. Examples?

6

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

7 of 83

Example of Process Activities?

7

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

8 of 83

Developers dislike processes

8

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

9 of 83

What does a developer’s day look like?

  • How many hours do they spend in meetings, coding, testing, debugging, documenting, etc.?

9

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

10 of 83

10

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

11 of 83

11

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

12 of 83

Developers' view of processes

12

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

13 of 83

What developers want

13

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

14 of 83

What developers want

14

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

15 of 83

What developers think of processes

15

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

16 of 83

What eventually happens anyway

16

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

17 of 83

Hypothesis: Process increases flexibility and efficiency + Upfront investment for later greater returns

17

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

18 of 83

Survival Mode

Missed deadlines -> "solo development mode" to meet own deadlines

Ignore integration work

Stop interacting with testers, technical writers, managers, ...

-> Results in further project delays, added costs, poor product quality...

18

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

19 of 83

Example of Process Problems?

19

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

20 of 83

Example: Healthcare.gov

  • Launched Oct, 2013; high demand (5x expected) causes site crash
  • UI incomplete (e.g., missing drop-down menu); missing/incomplete insurance data; log-in system also crashed for IT technicians
  • On 1st day, 6 users managed to register
  • Initial budget: 93.7M USD; Final cost: 1.7B

20

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

21 of 83

Example: Healthcare.gov

  • Lack of experience: "...and project managers had little knowledge on the amount of work required and typical product development processes"
  • Lack of leadership: "...no formal division of responsibilities in place...a lack of communication when key decisions were made"
  • Schedule pressure: "...employees were pressured to launch on time regardless of completion or the amount (and results) of testing"

21

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

22 of 83

22

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

23 of 83

Case Study: Real Estate Website

23

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

24 of 83

ML Component: Predicting Real Estate Value

Given a large database of house sales and statistical/demographic data from public records, predict the sales price of a house.

24

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

25 of 83

What's your process?

Q. What steps would you take to build this component?

25

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

26 of 83

Exploratory Questions

  • What exactly are we trying to model and predict?
  • What types of data do we need?
  • What type of model works the best for this problem?
  • What are the right metrics to evaluate the model performance?
  • What is the user actually interested in seeing?
  • Will this product actually help with the organizational goals?
  • ...

26

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

27 of 83

Time estimation

27

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

28 of 83

Time estimation

28

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

29 of 83

Hofstadter’s Law

29

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

30 of 83

Is Estimation Evil?

30

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

31 of 83

Data Science: Iteration and Exploration

31

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

32 of 83

Data Science is Iterative and Exploratory

32

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

33 of 83

Data Science is Iterative and Exploratory

33

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

34 of 83

Data Science is Iterative and Exploratory

34

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

35 of 83

Data Science is Iterative and Exploratory

35

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

36 of 83

Data Science is Iterative and Exploratory

Science mindset: start with rough goal, no clear specification, unclear whether possible

Heuristics and experience to guide the process

Try and error, refine iteratively, hypothesis testing

Go back to data collection and cleaning if needed, revise goals

36

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

37 of 83

Different Trajectories

37

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

38 of 83

Computational Notebooks

  • Origins in "literate programming", interleaving text and code, treating programs as literature (Knuth'84)
  • First notebook in Wolfram Mathematica 1.0 in 1988
  • Document with text and code cells, showing execution results under cells
  • Code of cells is executed, per cell, in a kernel
  • Many notebook implementations and supported languages, Python + Jupyter currently most popular

38

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

39 of 83

Notebooks Support Iteration and Exploration

Quick feedback, similar to REPL

Visual feedback including figures and tables

Incremental computation: reexecuting individual cells

Quick and easy: copy paste, no abstraction needed

Easy to share: document includes text, code, and results

39

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

40 of 83

Share Experience?

40

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

41 of 83

Brief Discussion: Notebook Limitations and Drawbacks?

41

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

42 of 83

Different Trajectories

42

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

43 of 83

Software Process Models

43

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

44 of 83

Ad-hoc Processes

  1. Discuss the software that needs to be written
  2. Write some code
  3. Test the code to identify the defects
  4. Debug to find causes of defects
  5. Fix the defects
  6. If not done, return to step 1

44

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

45 of 83

Waterfall Model

Understand requirements, plan & design before coding, test & deploy

45

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

46 of 83

Looks like mass manufacturing?

46

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

47 of 83

Problems with Waterfall?

47

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

48 of 83

Waterfall Model

Understand requirements, plan & design before coding, test & deploy

48

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

49 of 83

Risk First: Spiral Model

Incremental prototypes, starting with most risky components

49

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

50 of 83

Constant iteration: Agile

  • Constant interactions with customers, constant replanning
  • Scrum: Break into sprints; daily meetings, sprint reviews, planning

50

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

51 of 83

51

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

52 of 83

Selecting Process Models

Individually, vote in #lecture slack: [1] Ad-hoc [2] Waterfall [3] Spiral [4] Agile

52

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

53 of 83

Data Science vs Software Engineering

53

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

54 of 83

Discussion: Iteration in Notebook vs Agile?

54

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

55 of 83

Poor Software Engineering Practices in Notebooks?

  • Little abstraction
  • Global state
  • No testing
  • Heavy copy and paste
  • Little documentation
  • Poor version control
  • Out of order execution
  • Poor development features (vs IDE)

55

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

56 of 83

Understanding Data Scientist Workflows

Instead of blindly recommended "SE Best Practices" understand context

Documentation and testing not a priority in exploratory phase

Help with transitioning into practice

  • From notebooks to pipelines
  • Support maintenance and iteration once deployed
  • Provide infrastructure and tools

56

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

57 of 83

Data Science Practices by Software Eng.

  • Many software engineers get involved in data science without explicit training
  • Copying from public examples, little reading of documentation
  • Lack of data visualization/exploration/understanding, no focus on data quality
  • Strong preference for code editors, non-GUI tools
  • Improve model by adding more data or changing models, rarely feature engineering or debugging
  • Lack of awareness about overfitting/bias problems, single focus on accuracy, no monitoring
  • More system thinking about the product and its needs

57

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

58 of 83

58

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

59 of 83

Integrated Process for AI-Enabled Systems

59

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

60 of 83

60

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

61 of 83

61

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

62 of 83

Recall: ML models are system components

62

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

63 of 83

63

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

64 of 83

64

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

65 of 83

65

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

66 of 83

Process for AI-Enabled Systems

  • Integrate Software Engineering and Data Science processes
  • Establish system-level requirements (e.g., user needs, safety, fairness)
  • Inform data science modeling with system requirements (e.g., privacy, fairness)
  • Try risky parts first (most likely include ML components; ~spiral)
  • Incrementally develop prototypes, incorporate user feedback (~agile)
  • Provide flexibility to iterate and improve
  • Design system with characteristics of AI component (e.g., UI design, safeguards)
  • Plan for testing throughout the process and in production
  • Manage project understanding both software engineering and data science workflows
  • No existing "best practices" or workflow models

66

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

67 of 83

Trajectories

Not every project follows the same development process, e.g.

  • Small ML addition: Product first, add ML feature later
  • Research only: Explore feasibility before thinking about a product
  • Data science first: Model as central component of potential product, build system around it

Different focus on system requirements, qualities, and upfront planning

Manage interdisciplinary teams and different expectations

67

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

68 of 83

Technical debt

68

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

69 of 83

Technical Debt Metaphor

Analogy to financial debt

  • Make a decision for an immediate benefit (e.g., release now)
  • Accepting later cost (loss of productivity, higher maintenance and operating cost, rework)
  • Debt accumulates and can suffocate project

Ideally, a deliberate decision (short term tactical or long term strategic)

Ideally, track debt and plan for paying it down later

Q. Examples?

69

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

70 of 83

What causes technical debt?

70

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

71 of 83

71

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

72 of 83

72

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

73 of 83

Technical Debt: Examples

Prudent & deliberate: Skip using a CI platform

  • Reason for debt: Short deadline; test the product viability with alpha users using a prototype
  • Debt payback: Refactoring effort to integrate the system into CI

Reckless & inadvertent: Forget to encrypt user credentials in DB

  • Reason for debt: Lack of in-house security expertise
  • Debt payback: Security vulnerabilities & fallouts from an attack (loss of data); effort to retrofit security into the system

73

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

74 of 83

Breakout: Technical Debt from ML

As a group in #lecture, tagging members: Post two plausible examples technical debt in housing price prediction system:

  1. Deliberate, prudent:
  2. Reckless, inadvertent:

74

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

75 of 83

Technical Debt through Notebooks?

Jupyter Notebooks are a gift from God to those who work with data. They allow us to do quick experiments with Julia, Python, R, and more -- John Paul Ada

75

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

76 of 83

ML and Technical Debt

Often reckless and inadvertent in inexperienced teams

ML can seem like an easy addition, but it may cause long-term costs

Needs to be maintained, evolved, and debugged

Goals may change, environment may change, some changes are subtle

76

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

77 of 83

Example problems: ML and Technical Debt

  • Systems and models are tangled and changing one has cascading effects on the other
  • Untested, brittle infrastructure; manual deployment
  • Unstable data dependencies, replication crisis
  • Data drift and feedback loops
  • Magic constants and dead experimental code paths

77

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

78 of 83

Controlling Technical Debt from ML Components

78

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

79 of 83

Controlling Technical Debt from ML Components

  • Avoid AI when not needed
  • Understand and document requirements, design for mistakes
  • Build reliable and maintainable pipelines, infrastructure, good engineering practices
  • Test infrastructure, system testing, testing and monitoring in production
  • Test and monitor data quality
  • Understand and model data dependencies, feedback loops, ...
  • Document design intent and system architecture
  • Strong interdisciplinary teams with joint responsibilities
  • Document and track technical debt
  • ...

79

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

80 of 83

80

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

81 of 83

Summary

Data scientists and software engineers follow different processes

ML projects need to consider process needs of both

Iteration and upfront planning are both important, process models codify good practices

Deliberate technical debt can be good, too much debt can suffocate a project

Easy to amount (reckless) technical debt with machine learning

81

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

82 of 83

Further Reading

82

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025

83 of 83

Further Reading 2

83

Machine Learning in Production/AI Engineering · Claire Le Goues & Austin Henley · Spring 2025