1 of 60

DataOps For The Modern Computer Vision Stack

James Le

2 of 60

DataOps For The Modern Computer Vision Stack

James Le

3 of 60

Presenter Profile

James Le

Now

  • Data Advocate
  • Data Writer
  • Data Podcaster

Before

  • ML Researcher
  • Data Scientist
  • Data Journalist

Interests

  • Data/ML Infrastructure
  • Venture Capital
  • Community-Led Growth

4 of 60

About Me

NOW

  • Data Advocate
  • Data Writer
  • Data Podcaster

BEFORE

  • Machine Learning Researcher
  • Data Scientist
  • Data Journalist

INTERESTS

  • Data/ML Infrastructure
  • Venture Capital
  • Community-Led Growth

5 of 60

Agenda

1. What Is DataOps?

2. Why DataOps For Computer Vision?

3. DataOps Key Principles

4. DataOps Pipeline for the Computer Vision Stack

5. Data Challenges for Computer Vision Teams

6. The Future of the Modern Computer Vision Stack

6 of 60

Agenda

1 - What Is DataOps?

2 - Why DataOps for Computer Vision?

3 - DataOps Key Principles

4 - DataOps Pipeline for the Computer Vision Stack

5 - Data Challenges for Computer Vision Teams

6 - The Future of The Modern Computer Vision Stack

7 of 60

What Is

DataOps?

8 of 60

What Is DataOps?

9 of 60

DataOps vs DevOps

Source: DevOps vs DataOps (by Sprinkle Data)

10 of 60

DataOps vs DevOps

Source: DevOps vs DataOps (by Sprinkle Data)

11 of 60

DataOps vs MLOps

12 of 60

DataOps vs MLOps

13 of 60

What Led To The Rise of DataOps?

  1. Massive Volumes of Complex Data
  2. Technology Overload
  3. Diverse Roles and Mandates

Source: Modern Analytics Stack (by Datafold)

Source: What is DataOps? (by Atlan)

14 of 60

What Led To The Rise of DataOps?

Source: What is DataOps? (by Atlan)

  • Massive Volumes of Complex Data
  • Technology Overload
  • Diverse Roles and Mandates

Source: Modern Analytics Stack (by Datafold)

15 of 60

The DataOps Landscape

Source: What is DataOps? (by Gradient Flow)

16 of 60

The DataOps Landscape

Source: What is DataOps? (by Gradient Flow)

17 of 60

Why DataOps For

Computer Vision?

18 of 60

Why DataOps For Computer Vision?

19 of 60

Data Is More Important Than Models

Why DataOps For

Computer Vision? (⅓)

This sentiment is conveyed by Francois Chollet - the creator of Keras (Source: Twitter)

20 of 60

Why DataOps For Computer Vision? (⅓)

Data Is More Important Than Models

This sentiment is conveyed by Francois Chollet - the creator of Keras (Source: Twitter)

21 of 60

Unstructured Data Preparation Is Challenging

Why DataOps For

Computer Vision? (⅔)

Rareplane dataset that incorporates both real and synthetically generated satellite imagery (Source: Superb AI)

22 of 60

Why DataOps For Computer Vision? (⅔)

Unstructured Data Preparation Is Challenging

Rareplanes dataset that incorporates both real and synthetically generated satellite imagery (Source: Superb AI)

23 of 60

Building Computer Vision Applications Is Iterative

Why DataOps For

Computer Vision? (3/3)

The Two Loops of Building Algorithmic Products (Source: Taivo Pungas)

24 of 60

Why DataOps For Computer Vision? (3/3)

Building Computer Vision Applications is Iterative

The Two Loops of Building Algorithmic Products (Source: Taivo Pungas)

25 of 60

DataOps

Key Principles

26 of 60

DataOps Key Principles

27 of 60

Principle 1 - Implement Best Practices for Development

Follow Software Engineering Cycle Guidelines

  • Version control
  • Code reviews
  • Unit testing
  • Artifacts management
  • Release automation
  • Infrastructure as code
  • OSS Tools: Git, Docker, Terraform

Source: Engineering Best Practices for ML (by Alex Serban)

Source: Rules of ML (by Google)

28 of 60

Principle 1 - Implement Best Practices for Development

Follow Software Engineering Cycle Guidelines

  • Version control
  • Code reviews
  • Unit testing
  • Artifacts management
  • Release automation
  • Infrastructure as code
  • OSS Tools: Git, Docker, Terraform

Source: Engineering Best Practices for ML (by Alex Serban)

Source: Rules of ML (by Google)

29 of 60

Principle 2 - Automate and Orchestrate All Data Flows

Continuous Integration and Continuous Delivery

  • Automate deployment with CI/CD pipelines
  • Discourage manual data wrangling
  • Run the data flows using an orchestrator
    • Backfilling
    • Scheduling
    • Pipeline metrics
  • OSS Tools: Airflow, Dagster, Prefect

30 of 60

Principle 2 - Automate and Orchestrate All Data Flows

Continuous Integration and Continuous Delivery

  • Automate deployment with CI/CD pipelines
  • Discourage manual data wrangling
  • Run the data flows using an orchestrator
    • Backfilling
    • Scheduling
    • Pipeline metrics
  • OSS Tools: Airflow, Dagster, Prefect

31 of 60

Principle 3 - Test Data Quality In All Stages of Data Lifecycle

Source: Why Data Quality Is Key to Successful MLOps (by Superconductive)

Continuous Testing

  • Test the data arriving from sources
    • Data unit tests
    • Schema/SQL/Streaming tests
  • Validate data at different stages in the data flow
  • Capture and publish metrics
  • Reuse test tools across projects
  • OSS Tool: great_expectations

32 of 60

Principle 3 - Test Data Quality In All Stages of Data Lifecycle

Continuous Testing

  • Test the data arriving from sources
    • Data unit tests
    • Schema/SQL/Streaming tests
  • Validate data at different stages in the data flow
  • Capture and publish metrics
  • Reuse test tools across projects
  • OSS Tool: great_expectations

Source: Why Data Quality Is Key to Successful MLOps (by Superconductive)

33 of 60

Principle 4 - Monitor Quality and Performance Metrics Across Data Flows

Source: What is Data Observability? (by Monte Carlo)

Improve Observability

  • Define data quality metrics
    • Technical metrics
    • Functional metrics
    • Performance metrics
  • Visualize metrics
  • Configure meaningful alerts

34 of 60

Principle 4 - Monitor Quality and Performance Metrics Across Data Flows

Improve Observability

  • Define data quality metrics
    • Technical metrics
    • Functional metrics
    • Performance metrics
  • Visualize metrics
  • Configure meaningful alerts

Source: What is Data Observability? (by Monte Carlo)

35 of 60

Principle 5 - Build a Common Data and Metadata Model

Source: Automated Data Versioning (by Pachyderm)

Focus on Data Semantics

  • Create a common data model
  • Share the same terminology and schemas
    • Development teams
    • Data teams
    • Business teams
  • Use a data catalog to share knowledge
  • OSS Tools: dbt, Amundsen, DataHub, Marquez

36 of 60

Principle 5 - Build a Common Data and Metadata Model

Focus on Data Semantics

  • Create a common data model
  • Share the same terminology and schemas
    • Development teams
    • Data teams
    • Business teams
  • Use a data catalog to share knowledge
  • OSS Tools: dbt, Amundsen, DataHub, Marquez

Source: Automated Data Versioning (by Pachyderm)

37 of 60

Principle 6 - Empower Collaboration Among Data Stakeholders

Cross-Functional Teams

  • Use knowledge in cross-functional teams
    • Define important metrics and KPIs
    • Shared-objectives with business goals
  • Remove bottlenecks for data usage
    • Self-service data monitoring
    • Democratize access to the data

38 of 60

Principle 6 - Empower Collaboration Among Data Stakeholders

Cross-Functional Teams

  • Use knowledge in cross-functional teams
    • Define important metrics and KPIs
    • Shared-objectives with business goals
  • Remove bottlenecks for data usage
    • Self-service data monitoring
    • Democratize access to the data

39 of 60

DataOps For

Computer Vision Stack?

40 of 60

DataOps For Computer Vision Stack?

41 of 60

42 of 60

Proposed DataOps for the Modern Computer Vision Stack

43 of 60

Key Data Challenges For

Computer Vision Teams

44 of 60

Key Data Challenges For Computer Vision Teams

45 of 60

Challenge 1: Curate High-Quality Data Points

Pain Points

  1. Require domain knowledge
  2. Can’t deal with the 4 Vs of big data (Volume, Velocity, Variety, Veracity)
  3. Narrow solutions

Solutions

  1. Visualize massive datasets
  2. Discover and retrieve data with ease
  3. Curate diverse scenarios
  4. Integrate seamlessly with existing workflows and tools

46 of 60

Challenge 1: Curate High-Quality Data Points

  • Pain Points
    • Require domain knowledge
    • Can’t deal with the 4 Vs of big data
    • Narrow solutions
  • Solutions
    • Visualize massive datasets
    • Discover and retrieve data with ease
    • Curate diverse scenarios
    • Integrate seamlessly with existing workflows and tools

47 of 60

Challenge 2: Label and Audit Data at Massive Scale

Pain Points

  1. Manual labeling and quality assurance is painfully slow
  2. Label quality is bad when dealing with domain-specific datasets and hard edge cases

Solutions

  1. Automatically label data
  2. Identify and audit hard labels
  3. Use active learning for human verification of labels

48 of 60

Challenge 2: Label and Audit Data at Massive Scale

  • Pain Points
    • Manual labeling and quality assurance is painfully slow
    • Label quality is bad when dealing with (1) domain-specific datasets and (2) hard edge cases
  • Solutions
    • Automatically label data
    • Identify and audit hard labels
    • Use active learning for human verification of labels

49 of 60

Challenge 3: Account For Data Drift

Pain Points

  1. Upstream process changes
  2. Data quality issues
  3. Natural drift in the data
  4. Covariate shift

Solutions

  1. Detect data drifts and raise alerts
  2. Analyze where and why drift happens
  3. Adapt to drift and improve model performance

50 of 60

Challenge 3: Account for Data Drift

  • Pain Points
    • Upstream process changes
    • Data quality issues
    • Natural drift in the data
    • Covariate shift
  • Solutions
    • Detect data drifts and raise alerts
    • Analyze where and why drift happens
    • Adapt to drift and improve model performance

51 of 60

The Future Of The

Modern Computer Vision Stack

52 of 60

The Future of The Modern Computer Vision Stack

53 of 60

Following The Footsteps of The Modern Data Stack

The Modern Data Stack is a collection of cloud-native tools centered around a cloud data warehouse.

Benefits:

  1. Ease of Use
  2. Wide Adoption
  3. Automation
  4. Cost

54 of 60

Following The Footstep of “The Modern Data Stack”

The Modern Data Stack is a collection of cloud-native tools centered around a cloud data warehouse.

Benefits:

  • Ease of Use
  • Wide Adoption
  • Automation
  • Cost

55 of 60

The Canonical Stack for ML

56 of 60

The Canonical Stack for Machine Learning

57 of 60

Startup Opportunities in ML Infrastructure

Source: Startup Opportunities in ML Infrastructure (by Leigh-Marie Braswell)

58 of 60

Startup Opportunities in Machine Learning Infrastructure

Source: Startup Opportunities in ML Infrastructure (by Leigh-Marie Braswell)

59 of 60

Thank you!

James Le

Website: jameskle.com

Twitter: @le_james94

Email: james.le@superb-ai.com

60 of 60

Thank you!

James Le

Website: jameskle.com

Twitter: @le_james94

Email: james.le@superb-ai.com