1 of 60

DataOps For The Modern Computer Vision Stack

James Le

2 of 60

DataOps For The Modern Computer Vision Stack

James Le

3 of 60

Presenter Profile

James Le

Now

Data Advocate
Data Writer
Data Podcaster

Before

ML Researcher
Data Scientist
Data Journalist

Interests

Data/ML Infrastructure
Venture Capital
Community-Led Growth

My expertise lies in enterprise MLOps/DataOps tooling:

Driving content and partnerships at Superb AI as part of the Go-To-Market strategy.
Writing a Medium publication called Data Notes - covering industry applications, emerging research themes, and advice from luminaries.
Hosting Datacast - featuring raw conversations with data practitioners/researchers to unpack the career lessons they have learned along the way.

Before this, I completed my M.S. degree in Computer Science from the Rochester Institute of Technology - where my research lies at the intersection of Deep Learning and Recommendation Systems, along with cross-functional work on Meta-Learning.

In the past, I have built Data Science solutions for FinTech and E-Commerce startups. I also used to freelance as a data journalist - having written 150+ technical articles for various startup blogs.

Lastly, my entry to the world of Data/ML Infrastructure came from being a Teaching Assistant for the Full-Stack Deep Learning course, which attempts to bridge the gap between Research and Production Machine Learning.

Besides that, my current interests are in Venture Capital and Community-Led Growth - two areas that are quite relevant to the world of Data/ML Infrastructure startups!

4 of 60

About Me

NOW

Data Advocate
Data Writer
Data Podcaster

BEFORE

Machine Learning Researcher
Data Scientist
Data Journalist

INTERESTS

Data/ML Infrastructure
Venture Capital
Community-Led Growth

My expertise lies in enterprise MLOps/DataOps tooling:

Driving content and partnerships at Superb AI as part of the Go-To-Market strategy.
Writing a Medium publication called Data Notes - covering industry applications, emerging research themes, and advice from luminaries.
Hosting Datacast - featuring raw conversations with data practitioners/researchers to unpack the career lessons they have learned along the way.

Before this, I completed my M.S. degree in Computer Science from the Rochester Institute of Technology - where my research lies at the intersection of Deep Learning and Recommendation Systems, along with cross-functional work on Meta-Learning.

In the past, I have built Data Science solutions for FinTech and E-Commerce startups. I also used to freelance as a data journalist - having written 150+ technical articles for various startup blogs.

Lastly, my entry to the world of Data/MLInfrastructure came from being a Teaching Assistant for the Full-Stack Deep Learning course, which attempts to bridge the gap between Research and Production Machine Learning.

Besides that, my current interests are in Venture Capital and Community-Led Growth - two areas that are quite relevant to the world of Data/ML Infrastructure startups!

5 of 60

Agenda

1. What Is DataOps?

2. Why DataOps For Computer Vision?

3. DataOps Key Principles

4. DataOps Pipeline for the Computer Vision Stack

5. Data Challenges for Computer Vision Teams

6. The Future of the Modern Computer Vision Stack

6 of 60

Agenda

1 - What Is DataOps?

2 - Why DataOps for Computer Vision?

3 - DataOps Key Principles

4 - DataOps Pipeline for the Computer Vision Stack

5 - Data Challenges for Computer Vision Teams

6 - The Future of The Modern Computer Vision Stack

7 of 60

What Is

DataOps?

8 of 60

What Is DataOps?

9 of 60

DataOps vs DevOps

Source: DevOps vs DataOps (by Sprinkle Data)

Before we dive deep on how different DataOps is from DevOps, let me clue you in with a crisp explanation: DevOps is the transformation in the delivery capability of development and software teams; whereas DataOps focuses on the transformation of intelligence systems and analytic models by data analysts and data engineers.

DevOps is a synergy of development, IT operations, and engineering teams with the main idea to reduce cost and time spent on the development and release cycle. However, DataOps works one level further. It’s nothing but dealing with Data. The data teams work with teams of various levels to acquire data, transform, model, and obtain actionable insights.

Similar to how DevOps transformed the way in which the software development cycle works, DataOps is also changing the primitive practices of handling data by implementing DevOps principles.

10 of 60

DataOps vs DevOps

Source: DevOps vs DataOps (by Sprinkle Data)

Before we dive deep on how different DataOps is from DevOps, let me clue you in with a crisp explanation: DevOps is the transformation in the delivery capability of development and software teams; whereas DataOps focuses on the transformation of intelligence systems and analytic models by data analysts and data engineers.

DevOps is a synergy of development, IT operations, and engineering teams with the main idea to reduce cost and time spent on the development and release cycle. However, DataOps works one level further. It’s nothing but dealing with Data. The data teams work with teams of various levels to acquire data, transform, model, and obtain actionable insights.

Similar to how DevOps transformed the way in which the software development cycle works, DataOps is also changing the primitive practices of handling data by implementing DevOps principles.

11 of 60

DataOps vs MLOps

Source: DataOps - Adjusting DevOps for Analytics Product Development (by Altexsoft)

12 of 60

DataOps vs MLOps

Source: DataOps - Adjusting DevOps for Analytics Product Development (by Altexsoft)

13 of 60

What Led To The Rise of DataOps?

Massive Volumes of Complex Data
Technology Overload
Diverse Roles and Mandates

Source: Modern Analytics Stack (by Datafold)

Source: Apache Spark DataFrames for Large Scale Data Science (by Databricks)

Source: What is DataOps? (by Atlan)

The 2010s have been widely recognized as the big data decade. Organizations have invested to ensure data teams can continue to scale in productivity, efficiency, and innovation. This is where DataOps comes into the picture.

1. Massive volumes of complex data

It all started with the rise of big data. Any business that you can think of works with large volumes of data coming from various sources in different formats. In large organizations, the data landscape is complex— tens of thousands of data sources and formats. Some examples include:

Financial transactions
CRM data
Online reviews and comments
Customer information

2. Technology overload

To answer your business questions, the data needs to be in a format that you can understand and use for analysis. That’s why all the data you gather undergoes a series of transformations (i.e. data and analytics pipelines). The data is profiled, cleaned, transformed and stored in a secure location to ensure data quality, integrity and relevance.

Now, for each of these processes mentioned above, you might be using various tools from data cataloging and data profiling tools to analytics and reporting tools—leading to technology overload.

3. Diverse roles and mandates

The people using the tools and technologies to work on your data (aka the humans of data) are also diverse.

Data engineers focus on data preparation and transformation
Data scientists worry about getting the right data for their algorithms
Analysts care about building daily/weekly reports and visualizations
IT cares about maintaining data access protocols and guaranteeing data quality, security and integrity
Business managers are keen on finding out whether the business is flourishing

Bringing together diverse technologies, processes and people with different mandates creates collaboration overhead and friction between teams. And that’s why we need a DataOps framework in place.

14 of 60

What Led To The Rise of DataOps?

Source: What is DataOps? (by Atlan)

Massive Volumes of Complex Data
Technology Overload
Diverse Roles and Mandates

Source: Apache Spark DataFrames for Large Scale Data Science (by Databricks)

Source: Modern Analytics Stack (by Datafold)

The 2010s have been widely recognized as the big data decade. Organizations have invested to ensure data teams can continue to scale in productivity, efficiency, and innovation. This is where DataOps comes into the picture.

1. Massive volumes of complex data

It all started with the rise of big data. Any business that you can think of works with large volumes of data coming from various sources in different formats. In large organizations, the data landscape is complex— tens of thousands of data sources and formats. Some examples include:

Financial transactions
CRM data
Online reviews and comments
Customer information

2. Technology overload

To answer your business questions, the data needs to be in a format that you can understand and use for analysis. That’s why all the data you gather undergoes a series of transformations (i.e. data and analytics pipelines). The data is profiled, cleaned, transformed and stored in a secure location to ensure data quality, integrity and relevance.

Now, for each of these processes mentioned above, you might be using various tools from data cataloging and data profiling tools to analytics and reporting tools—leading to technology overload.

3. Diverse roles and mandates

The people using the tools and technologies to work on your data (aka the humans of data) are also diverse.

Data engineers focus on data preparation and transformation
Data scientists worry about getting the right data for their algorithms
Analysts care about building daily/weekly reports and visualizations
IT cares about maintaining data access protocols and guaranteeing data quality, security and integrity
Business managers are keen on finding out whether the business is flourishing

Bringing together diverse technologies, processes and people with different mandates creates collaboration overhead and friction between teams. And that’s why we need a DataOps framework in place.

15 of 60

The DataOps Landscape

Source: What is DataOps? (by Gradient Flow)

This figure provides a partial list of companies providing solutions in the major areas that form the ecosystem of DataOps tools:

The Infrastructure Ops layer monitors the data plane and initiates a response to an event or a failure.
The data plane includes data from operational systems. These are source systems that supply data into data warehouses, data lakes, and lakehouses. ETL and Visualization tools also lie in this layer.
DataOps includes a Metadata stack that creates a map of organizational data flow, tracks the flow of the data, and enforces access and data quality.
Another layer consists of development tools used to manage complex processes that may include multiple experiments and iterations, as well as collaboration that cuts across teams and units. The MLOps and DevOps for data systems reside in this layer.
The Data Product & Services layer includes tools used to track business key performance indicators (KPIs), and machine learning products and services.

16 of 60

The DataOps Landscape

Source: What is DataOps? (by Gradient Flow)

This figure provides a partial list of companies providing solutions in the major areas that form the ecosystem of DataOps tools:

The Infrastructure Ops layer monitors the data plane and initiates a response to an event or a failure.
The data plane includes data from operational systems. These are source systems that supply data into data warehouses, data lakes, and lakehouses. ETL and Visualization tools also lie in this layer.
DataOps includes a Metadata stack that creates a map of organizational data flow, tracks the flow of the data, and enforces access and data quality.
Another layer consists of development tools used to manage complex processes that may include multiple experiments and iterations, as well as collaboration that cuts across teams and units. The MLOps and DevOps for data systems reside in this layer.
The Data Product & Services layer includes tools used to track business key performance indicators (KPIs), and machine learning products and services.

17 of 60

Why DataOps For

Computer Vision?

18 of 60

Why DataOps For Computer Vision?

19 of 60

Data Is More Important Than Models

Why DataOps For

Computer Vision? (⅓)

This sentiment is conveyed by Francois Chollet - the creator of Keras (Source: Twitter)

20 of 60

Why DataOps For Computer Vision? (⅓)

Data Is More Important Than Models

This sentiment is conveyed by Francois Chollet - the creator of Keras (Source: Twitter)

21 of 60

Unstructured Data Preparation Is Challenging

Why DataOps For

Computer Vision? (⅔)

Rareplane dataset that incorporates both real and synthetically generated satellite imagery (Source: Superb AI)

22 of 60

Why DataOps For Computer Vision? (⅔)

Unstructured Data Preparation Is Challenging

Rareplanes dataset that incorporates both real and synthetically generated satellite imagery (Source: Superb AI)

23 of 60

Building Computer Vision Applications Is Iterative

Why DataOps For

Computer Vision? (3/3)

The Two Loops of Building Algorithmic Products (Source: Taivo Pungas)

Taivo Pungas gave an excellent talk about the two loops of building algorithmic products, which include the algorithm development loop and the product development loop.

The algorithm development loop entails (1) building the algorithms using algorithmic frameworks (done by the CV scientists), (2) measuring against dataset with annotation specs (done by the product owners) + large-scale testing tools (done by the ML engineers), and (3) learning from failures with test analysis and manual review tools (done by the CV scientists).

The product development loop entails (1) building products using service and infrastructure frameworks (done by the ML engineers), (2) measuring the products in production via monitoring and logging tools (done by the ML engineers), and (3) learning from failures via dashboards and analyses of live performance data (done by the CV scientists).

DataOps can speed up the iteration of these two loops, which work together in at least two ways:

After building the algorithms, the CV scientists provide those algorithms to the ML engineers to incorporate into the computer vision products (from step 3 to step 4).
After learning from use case failures, the CV scientists sample and annotate only the hard, error-prone data points. Then, they will measure against and possibly combine the new data points with the existing data points for the next iteration of the algorithm development loop (from step 6 to step 1).

24 of 60

Why DataOps For Computer Vision? (3/3)

Building Computer Vision Applications is Iterative

The Two Loops of Building Algorithmic Products (Source: Taivo Pungas)

Taivo Pungas gave an excellent talk about the two loops of building algorithmic products, which include the algorithm development loop and the product development loop.

The algorithm development loop entails (1) building the algorithms using algorithmic frameworks (done by the CV scientists), (2) measuring against dataset with annotation specs (done by the product owners) + large-scale testing tools (done by the ML engineers), and (3) learning from failures with test analysis and manual review tools (done by the CV scientists).

The product development loop entails (1) building products using service and infrastructure frameworks (done by the ML engineers), (2) measuring the products in production via monitoring and logging tools (done by the ML engineers), and (3) learning from failures via dashboards and analyses of live performance data (done by the CV scientists).

DataOps can speed up the iteration of these two loops, which work together in at least two ways:

After building the algorithms, the CV scientists provide those algorithms to the ML engineers to incorporate into the computer vision products (from step 3 to step 4).
After learning from use case failures, the CV scientists sample and annotate only the hard, error-prone data points. Then, they will measure against and possibly combine the new data points with the existing data points for the next iteration of the algorithm development loop (from step 6 to step 1).

25 of 60

DataOps

Key Principles

26 of 60

DataOps Key Principles

27 of 60

Principle 1 - Implement Best Practices for Development

Follow Software Engineering Cycle Guidelines

Version control
Code reviews
Unit testing
Artifacts management
Release automation
Infrastructure as code
OSS Tools: Git, Docker, Terraform

Source: Engineering Best Practices for ML (by Alex Serban)

Source: Rules of ML (by Google)

The first principle is to implement best practices for development. In the data domain, data engineers, data analysts, and data scientists come from different sorts of backgrounds. Given that data engineering and data science are code-heavy, it is imperative for practitioners in such disciplines to follow software development lifecycle guidelines.

In practice, these guidelines include version control (using Git to track code changes), code reviews (pushing or pulling code request), unit testing, artifacts management (with Docker images), release automation, and Infrastructure-as-Code (using Terraform to maintain the infrastructure versions and follow infrastructural changes easily just like code).

Applying these software development best practices for computer vision is straightforward. To get started, I recommend following Google’s Rules of Machine Learning document and the SE4ML checklist. These lists gather a set of engineering best practices for developing software systems with machine learning (ML) components.

28 of 60

Principle 1 - Implement Best Practices for Development

Follow Software Engineering Cycle Guidelines

Version control
Code reviews
Unit testing
Artifacts management
Release automation
Infrastructure as code
OSS Tools: Git, Docker, Terraform

Source: Engineering Best Practices for ML (by Alex Serban)

Source: Rules of ML (by Google)

The first principle is to implement best practices for development. In the data domain, data engineers, data analysts, and data scientists come from different sorts of backgrounds. Given that data engineering and data science are code-heavy, it is imperative for practitioners in such disciplines to follow software development lifecycle guidelines.

In practice, these guidelines include version control (using Git to track code changes), code reviews (pushing or pulling code request), unit testing, artifacts management (with Docker images), release automation, and Infrastructure-as-Code (using Terraform to maintain the infrastructure versions and follow infrastructural changes easily just like code).

Applying these software development best practices for computer vision is straightforward. To get started, I recommend following Google’s Rules of Machine Learning document and the SE4ML checklist. These lists gather a set of engineering best practices for developing software systems with machine learning (ML) components.

29 of 60

Principle 2 - Automate and Orchestrate All Data Flows

Source: Continuous Delivery for Machine Learning (by ThoughtWorks)

Continuous Integration and Continuous Delivery

Automate deployment with CI/CD pipelines
Discourage manual data wrangling
Run the data flows using an orchestrator

Backfilling
Scheduling
Pipeline metrics

OSS Tools: Airflow, Dagster, Prefect

30 of 60

Principle 2 - Automate and Orchestrate All Data Flows

Continuous Integration and Continuous Delivery

Automate deployment with CI/CD pipelines
Discourage manual data wrangling
Run the data flows using an orchestrator

Backfilling
Scheduling
Pipeline metrics

OSS Tools: Airflow, Dagster, Prefect

Source: Continuous Delivery for Machine Learning (by ThoughtWorks)

31 of 60

Principle 3 - Test Data Quality In All Stages of Data Lifecycle

Source: Why Data Quality Is Key to Successful MLOps (by Superconductive)

Continuous Testing

Test the data arriving from sources

Data unit tests
Schema/SQL/Streaming tests

Validate data at different stages in the data flow
Capture and publish metrics
Reuse test tools across projects
OSS Tool: great_expectations

The third principle is to test data quality in all stages of the data life cycle.

In practice, you definitely want to test the data at the source by writing data unit tests or schema/streaming/SQL tests (depending on the data sources your pipeline is connected to). You then want to validate the data as it goes from the sources to a staging environment, as it undergoes transformation, and as it gets promoted to production and goes into a data warehouse. Next, you want to capture and publish metrics across the aforementioned stages. Finally, you want to reuse your test tools for your following projects by building a common testing framework standardized across your data team. A well-developed open-source option is Great Expectations.

For computer vision, continuous testing is trickier. Dataset quality is about accuracy across the overall dataset, which includes things like the raw data distribution on top of the label consistency. Does the work of all of your labelers look the same? Is labeling consistently accurate across your datasets? Given the human-in-the-loop component (the labelers), building a common testing framework for label quality is not for the faint of heart.

32 of 60

Principle 3 - Test Data Quality In All Stages of Data Lifecycle

Continuous Testing

Test the data arriving from sources

Data unit tests
Schema/SQL/Streaming tests

Validate data at different stages in the data flow
Capture and publish metrics
Reuse test tools across projects
OSS Tool: great_expectations

Source: Why Data Quality Is Key to Successful MLOps (by Superconductive)

The third principle is to test data quality in all stages of the data life cycle.

In practice, you definitely want to test the data at the source by writing data unit tests or schema/streaming/SQL tests (depending on the data sources your pipeline is connected to). You then want to validate the data as it goes from the sources to a staging environment, as it undergoes transformation, and as it gets promoted to production and goes into a data warehouse. Next, you want to capture and publish metrics across the aforementioned stages. Finally, you want to reuse your test tools for your following projects by building a common testing framework standardized across your data team. A well-developed open-source option is Great Expectations.

For computer vision, continuous testing is trickier. Dataset quality is about accuracy across the overall dataset, which includes things like the raw data distribution on top of the label consistency. Does the work of all of your labelers look the same? Is labeling consistently accurate across your datasets? Given the human-in-the-loop component (the labelers), building a common testing framework for label quality is not for the faint of heart.

33 of 60

Principle 4 - Monitor Quality and Performance Metrics Across Data Flows

Source: What is Data Observability? (by Monte Carlo)

Source: Beyond Monitoring: The Rise of Observability (by Arize AI)

Improve Observability

Define data quality metrics

Technical metrics
Functional metrics
Performance metrics

Visualize metrics
Configure meaningful alerts

Source: Anatomy of an Enterprise AI Observability Platform (by WhyLabs)

The fourth principle is to monitor quality and performance metrics across the data flows.

In practice, the first step is to define your data quality metrics, which can be broken down into technical metrics (i.e., the number of rows in a data schema after the data ingestion phase), functional metrics (driven by the business), and performance metrics (to capture how much time it takes to reach expected SLAs). The second step is to visualize these metrics to appropriate data stakeholders so that they can take meaningful actions using meaningful alerts.

What does observability look like for computer vision? I think it would be a combination of data observability and model observability.

Monte Carlo defines the five pillars of data observability to be freshness, distribution, volume, schema, and lineage. In the context of computer vision, these translate to (1) how up-to-date the labels are updated, (2) how the labels are distributed within an image (or changing distribution in a video), (3) how big the number of images is, (4) how the labels are visualized and formatted, and (5) how the images are versioned.
Arize AI defines model observability as the process of collecting model evaluations in training, validation, and production environments, then tying them with analytics to allow one to connect these points to solve the ML engineering problems. This translates directly into the computer vision workflow, where the output predictions from one model serve as the input training data for the next model iteration.
Going further, WhyLabs proposes the anatomy of an enterprise AI observability platform - which includes key capabilities such as telemetry collection, time series database, a debugging engine, an anomaly detection engine, and a visualization layer.

34 of 60

Principle 4 - Monitor Quality and Performance Metrics Across Data Flows

Improve Observability

Define data quality metrics

Technical metrics
Functional metrics
Performance metrics

Visualize metrics
Configure meaningful alerts

Source: What is Data Observability? (by Monte Carlo)

Source: Beyond Monitoring: The Rise of Observability (by Arize AI)

Source: Anatomy of an Enterprise AI Observability Platform (by WhyLabs)

The fourth principle is to monitor quality and performance metrics across the data flows.

In practice, the first step is to define your data quality metrics, which can be broken down into technical metrics (i.e., the number of rows in a data schema after the data ingestion phase), functional metrics (driven by the business), and performance metrics (to capture how much time it takes to reach expected SLAs). The second step is to visualize these metrics to appropriate data stakeholders so that they can take meaningful actions using meaningful alerts.

What does observability look like for computer vision? I think it would be a combination of data observability and model observability.

Monte Carlo defines the five pillars of data observability to be freshness, distribution, volume, schema, and lineage. In the context of computer vision, these translate to (1) how up-to-date the labels are updated, (2) how the labels are distributed within an image (or changing distribution in a video), (3) how big the number of images is, (4) how the labels are visualized and formatted, and (5) how the images are versioned.
Arize AI defines model observability as the process of collecting model evaluations in training, validation, and production environments, then tying them with analytics to allow one to connect these points to solve the ML engineering problems. This translates directly into the computer vision workflow, where the output predictions from one model serve as the input training data for the next model iteration.
Going further, WhyLabs proposes the anatomy of an enterprise AI observability platform - which includes key capabilities such as telemetry collection, time series database, a debugging engine, an anomaly detection engine, and a visualization layer.

35 of 60

Principle 5 - Build a Common Data and Metadata Model

Source: Automated Data Versioning (by Pachyderm)

Focus on Data Semantics

Create a common data model
Share the same terminology and schemas

Development teams
Data teams
Business teams

Use a data catalog to share knowledge
OSS Tools: dbt, Amundsen, DataHub, Marquez

The fifth principle is to build a common data and metadata model. Most organizations face a significant challenge because the data comes from different sources, and everybody looks at the system from different angles.

Borrowing from the domain-driven design concept in software development, you want to identify similar terminologies across different areas and share them across different teams (engineering, data, and business). Tools like dbt enable you to combine different data sources and put them into a common data warehouse.
An enterprise-focused approach is to invest in a data catalog to share data knowledge more broadly across the data organization (Amundsen, DataHub, and Marquez for open-source). With a data catalog, each data owner knows exactly what he/she is provided with.

For computer vision, this means creating a repository of all your image and video data from various sources - including details on their structure, quality, definitions, and usage. Ideally, you allow the users to access the metadata (like JSON objects) alongside the data itself. One thing that is still hard, in my opinion, is versioning control for unstructured data. A solution like Pachyderm is a promising option that is on my horizon. Furthermore, a sophisticated solution that provides a graphical representation of the data assets’ lineage can simplify future data governance and compliance concerns.

36 of 60

Principle 5 - Build a Common Data and Metadata Model

Focus on Data Semantics

Create a common data model
Share the same terminology and schemas

Development teams
Data teams
Business teams

Use a data catalog to share knowledge
OSS Tools: dbt, Amundsen, DataHub, Marquez

Source: Automated Data Versioning (by Pachyderm)

The fifth principle is to build a common data and metadata model. Most organizations face a significant challenge because the data comes from different sources, and everybody looks at the system from different angles.

Borrowing from the domain-driven design concept in software development, you want to identify similar terminologies across different areas and share them across different teams (engineering, data, and business). Tools like dbt enable you to combine different data sources and put them into a common data warehouse.
An enterprise-focused approach is to invest in a data catalog to share data knowledge more broadly across the data organization (Amundsen, DataHub, and Marquez for open-source). With a data catalog, each data owner knows exactly what he/she is provided with.

For computer vision, this means creating a repository of all your image and video data from various sources - including details on their structure, quality, definitions, and usage. Ideally, you allow the users to access the metadata (like JSON objects) alongside the data itself. One thing that is still hard, in my opinion, is versioning control for unstructured data. A solution like Pachyderm is a promising option that is on my horizon. Furthermore, a sophisticated solution that provides a graphical representation of the data assets’ lineage can simplify future data governance and compliance concerns.

37 of 60

Principle 6 - Empower Collaboration Among Data Stakeholders

Cross-Functional Teams

Use knowledge in cross-functional teams

Define important metrics and KPIs
Shared-objectives with business goals

Remove bottlenecks for data usage

Self-service data monitoring
Democratize access to the data

The sixth principle is to empower collaboration among data stakeholders. There are different ways to structure your data teams, but the best approach is to embed the data thinking into every functional team. This is similar to the agile principle, where you want to build a cross-functional team with no division between key functions. Such teams define important metrics and KPIs together, and craft shared objectives with the business goals. Furthermore, to remove potential bottlenecks for data usage, you want to set up self-service data monitoring and democratize access to the data.

While building computer vision applications, I believe that an ideal DataOps team should be composed of the core functions outlined in the slide:

Data Labeling Manager (DLM): They often work with in-house or off-shore labeling teams to help scale the throughput of data labeling. They define labeling instructions, inspect the work of the labeling team, and decide how to handle complex or ambiguous scenarios.
Data Engineer (DE): They are responsible for designing, building, and maintaining datasets that can be leveraged in ML projects. As such, data engineers closely work with ML engineers.
Data Curator (DC): They are experts in their respective data domains, visualizing and manipulating ML datasets. Their knowledge of the business goals and the ML capabilities can inform how best to prioritize data curation to improve the ML system.
Head of DataOps: They provide strategic oversight to the DataOps team. Their goal is to create an environment that allows all different parties to access the data they need painlessly, build the skills of the business to draw meaningful insights from the data, and ensure data governance. They also act as a bridge between the DataOps team and other functional units, serving as a visionary and a technical lead.

I believe the collaboration process between them is still ad-hoc at most organizations, and there is room for tools that can make this process more efficient.

38 of 60

Principle 6 - Empower Collaboration Among Data Stakeholders

Cross-Functional Teams

Use knowledge in cross-functional teams

Define important metrics and KPIs
Shared-objectives with business goals

Remove bottlenecks for data usage

Self-service data monitoring
Democratize access to the data

The sixth principle is to empower collaboration among data stakeholders. There are different ways to structure your data teams, but the best approach is to embed the data thinking into every functional team. This is similar to the agile principle, where you want to build a cross-functional team with no division between key functions. Such teams define important metrics and KPIs together, and craft shared objectives with the business goals. Furthermore, to remove potential bottlenecks for data usage, you want to set up self-service data monitoring and democratize access to the data.

While building computer vision applications, I believe that an ideal DataOps team should be composed of the core functions outlined in the slide:

Data Labeling Manager (DLM): They often work with in-house or off-shore labeling teams to help scale the throughput of data labeling. They define labeling instructions, inspect the work of the labeling team, and decide how to handle complex or ambiguous scenarios.
Data Engineer (DE): They are responsible for designing, building, and maintaining datasets that can be leveraged in ML projects. As such, data engineers closely work with ML engineers.
Data Curator (DC): They are experts in their respective data domains, visualizing and manipulating ML datasets. Their knowledge of the business goals and the ML capabilities can inform how best to prioritize data curation to improve the ML system.
Head of DataOps: They provide strategic oversight to the DataOps team. Their goal is to create an environment that allows all different parties to access the data they need painlessly, build the skills of the business to draw meaningful insights from the data, and ensure data governance. They also act as a bridge between the DataOps team and other functional units, serving as a visionary and a technical lead.

I believe the collaboration process between them is still ad-hoc at most organizations, and there is room for tools that can make this process more efficient.

39 of 60

DataOps For

Computer Vision Stack?

40 of 60

DataOps For Computer Vision Stack?

41 of 60

1 - Data Acquisition

When we talk about data acquisition, we tend to think about data collection. There are both physical and operational considerations with this — ranging from where to collect the data to how much to collect initially. If we want to get data collection right, we need a feedback loop aligned with the business context.
Synthetic data generation is a second approach with many use cases ranging from facial recognition and life sciences to eCommerce and autonomous driving. However, getting this synthetic data requires many training data and compute. Furthermore, synthetic data is not realistic for unique use cases, and there is a lack of fundamental research on their impact on model training.
A third approach is known as data ‘scavenging’ — where we scrape the web and/or use open-source datasets (Google Dataset Search, Kaggle, academic benchmarks)

2 - Data Labeling

Data labeling is an industry of its own because we have to answer so many questions:

Who should label it?
How should it be labeled?
What should be labeled?

Indeed, getting the data labeling step right is extremely complicated because it is error-prone, slow, expensive, and often impractical. Efficient labeling operations would require a vetted process, qualified personnel, high-performance tools, its own lifecycle, a versioning system, and a validation process.

3 - Data Debugging

Data debugging entails writing expectation tests to address the data preprocessing and storage system. Essentially, they are unit tests for your data. They are designed to catch data quality issues and bad data before they make their way into the DataOps pipeline.

4 - Data Augmentation

Data augmentation is a scientific process where we can manipulate the data via flipping, rotation, translation, color changes, etc. However, scaling data augmentation to bigger datasets, negating memorization, and handling biases/corner cases are fundamental issues.

5 - Data Transformation

The data transformation phase includes three steps:

Data formatting: This is a small slice of the whole data engineering task that interfaces with data warehouses, data lakes, and data pipelines.
Feature engineering: This includes concepts such as feature stores, management of correlations, management of missing records, and going from feature selection to embeddings.
Data fusion: This last step means fusing data from different modalities, sensors, and timelines.

6 - Data Curation

Data curation is the belief that: because the dataset is so large, we cannot be picky about the types of data that we are going to use. Thus, we need to catalog and structure the data methodologically:

We can sort the data by use cases.
We can make the data searchable by pre-defined tags.
We can cluster the data via specific embeddings and metadata.

Finally, there is the concept of data selection. In any datasets, there will be high-value data (useful), redundant/irrelevant data (useless), and mislabeled/correlated/low-quality data (harmful). We only want to select the useful data samples.

42 of 60

Proposed DataOps for the Modern Computer Vision Stack

1 - Data Acquisition

When we talk about data acquisition, we tend to think about data collection. There are both physical and operational considerations with this — ranging from where to collect the data to how much to collect initially. If we want to get data collection right, we need a feedback loop aligned with the business context.
Synthetic data generation is a second approach with many use cases ranging from facial recognition and life sciences to eCommerce and autonomous driving. However, getting this synthetic data requires many training data and compute. Furthermore, synthetic data is not realistic for unique use cases, and there is a lack of fundamental research on their impact on model training.
A third approach is known as data ‘scavenging’ — where we scrape the web and/or use open-source datasets (Google Dataset Search, Kaggle, academic benchmarks)

2 - Data Labeling

Data labeling is an industry of its own because we have to answer so many questions:

Who should label it?
How should it be labeled?
What should be labeled?

Indeed, getting the data labeling step right is extremely complicated because it is error-prone, slow, expensive, and often impractical. Efficient labeling operations would require a vetted process, qualified personnel, high-performance tools, its own lifecycle, a versioning system, and a validation process.

3 - Data Debugging

Data debugging entails writing expectation tests to address the data preprocessing and storage system. Essentially, they are unit tests for your data. They are designed to catch data quality issues and bad data before they make their way into the DataOps pipeline.

4 - Data Augmentation

Data augmentation is a scientific process where we can manipulate the data via flipping, rotation, translation, color changes, etc. However, scaling data augmentation to bigger datasets, negating memorization, and handling biases/corner cases are fundamental issues.

5 - Data Transformation

The data transformation phase includes three steps:

Data formatting: This is a small slice of the whole data engineering task that interfaces with data warehouses, data lakes, and data pipelines.
Feature engineering: This includes concepts such as feature stores, management of correlations, management of missing records, and going from feature selection to embeddings.
Data fusion: This last step means fusing data from different modalities, sensors, and timelines.

6 - Data Curation

Data curation is the belief that: because the dataset is so large, we cannot be picky about the types of data that we are going to use. Thus, we need to catalog and structure the data methodologically:

We can sort the data by use cases.
We can make the data searchable by pre-defined tags.
We can cluster the data via specific embeddings and metadata.

Finally, there is the concept of data selection. In any datasets, there will be high-value data (useful), redundant/irrelevant data (useless), and mislabeled/correlated/low-quality data (harmful). We only want to select the useful data samples.

43 of 60

Key Data Challenges For

Computer Vision Teams

44 of 60

Key Data Challenges For Computer Vision Teams

45 of 60

Challenge 1: Curate High-Quality Data Points

Pain Points

Require domain knowledge
Can’t deal with the 4 Vs of big data (Volume, Velocity, Variety, Veracity)
Narrow solutions

Solutions

Visualize massive datasets
Discover and retrieve data with ease
Curate diverse scenarios
Integrate seamlessly with existing workflows and tools

Source: The Best Data Curation Tools for Computer Vision (by Siasearch)

Data curation is the process of discovering, examining, and sampling data for a specific analytics/prediction task.

In the context of computer vision, data curation is massively under-rated as there is no streamlined method to understand what kind of data has been collected and curate it into a well-balanced high-quality dataset, besides writing ETL jobs to extract insights.

Unfortunately, data curation remains the most time-consuming and least enjoyable work of data scientists and engineers. Data curation tasks often require substantial domain knowledge and a hefty dose of “common sense.” Existing solutions can’t keep up with volume, velocity, variety, and veracity in the ever-changing data ecosystem. Furthermore, these solutions are narrow because they primarily learn from the correlations present in the training data. However, it is pretty likely that this might not be sufficient, as you won’t be able to encode domain knowledge in general and those specific to domain curation, such as data integrity constraints.

A functional DataOps platform for computer vision should have data curation capabilities that enable data and ML engineers to understand the collected data, identify important subsets and edge cases, and curate custom training datasets to put back into their models. More specifically, this platform should be able to:

Visualize massive datasets: Obtain key metrics, distributions, and classes of the datasets regardless of their format.
Discover and retrieve data with ease: Quickly search, filter, and sort through the entire data warehouse by making all features easily accessible.
Curate diverse scenarios: Identify the most critical dataset slices and manipulate them to create custom training sets.
Integrate seamlessly with existing workflows and tools.

46 of 60

Challenge 1: Curate High-Quality Data Points

Pain Points

Require domain knowledge
Can’t deal with the 4 Vs of big data
Narrow solutions

Solutions

Visualize massive datasets
Discover and retrieve data with ease
Curate diverse scenarios
Integrate seamlessly with existing workflows and tools

Source: The Best Data Curation Tools for Computer Vision (by Siasearch)

Data curation is the process of discovering, examining, and sampling data for a specific analytics/prediction task.

In the context of computer vision, data curation is massively under-rated as there is no streamlined method to understand what kind of data has been collected and curate it into a well-balanced high-quality dataset, besides writing ETL jobs to extract insights.

Unfortunately, data curation remains the most time-consuming and least enjoyable work of data scientists and engineers. Data curation tasks often require substantial domain knowledge and a hefty dose of “common sense.” Existing solutions can’t keep up with volume, velocity, variety, and veracity in the ever-changing data ecosystem. Furthermore, these solutions are narrow because they primarily learn from the correlations present in the training data. However, it is pretty likely that this might not be sufficient, as you won’t be able to encode domain knowledge in general and those specific to domain curation, such as data integrity constraints.

A functional DataOps platform for computer vision should have data curation capabilities that enable data and ML engineers to understand the collected data, identify important subsets and edge cases, and curate custom training datasets to put back into their models. More specifically, this platform should be able to:

Visualize massive datasets: Obtain key metrics, distributions, and classes of the datasets regardless of their format.
Discover and retrieve data with ease: Quickly search, filter, and sort through the entire data warehouse by making all features easily accessible.
Curate diverse scenarios: Identify the most critical dataset slices and manipulate them to create custom training sets.
Integrate seamlessly with existing workflows and tools.

47 of 60

Challenge 2: Label and Audit Data at Massive Scale

Source: Automate Data Preparation for Computer Vision (by Superb AI)

Pain Points

Manual labeling and quality assurance is painfully slow
Label quality is bad when dealing with domain-specific datasets and hard edge cases

Solutions

Automatically label data
Identify and audit hard labels
Use active learning for human verification of labels

Training computer vision models require a constant feed of large and accurately labeled datasets. However, this typically requires a considerable time and capital commitment, especially since most of the labeling and quality assurance is done manually by humans.

Based on our conversations at Superb AI with 100+ data and ML engineers, data labeling and auditing are major bottlenecks within the data preparation pipeline for computer vision systems. Interestingly enough, based on our surveys and interviews, this bottleneck applies to teams that just started a new computer vision project and mature teams with models already in production.

Companies or teams in the early stages of ML development want to utilize AI somehow, but they do not have any models pre-trained on niche or domain-specific datasets.
Mature teams with models in production are well-versed in the ML development and deployment lifecycle. They tend to have sophisticated pre-trained models and are focused on further improving model accuracy. They want to identify where the model is failing and manually prepare datasets to address those edge cases.

An effective DataOps Platform for computer vision must address these two expensive steps of labeling the data and auditing the labels. These issues are managed effectively in Superb AI’s Suite product, which uses a powerful capability called Custom Auto-Label to label large datasets in a short timeframe. Combining Custom Auto-Label with Suite’s Uncertainty Estimation and management capabilities, teams can immediately identify hard labels, build active learning workflows for auditing, and deliver datasets in a matter of days.

48 of 60

Challenge 2: Label and Audit Data at Massive Scale

Pain Points

Manual labeling and quality assurance is painfully slow
Label quality is bad when dealing with (1) domain-specific datasets and (2) hard edge cases

Solutions

Automatically label data
Identify and audit hard labels
Use active learning for human verification of labels

Source: Automate Data Preparation for Computer Vision (by Superb AI)

Training computer vision models require a constant feed of large and accurately labeled datasets. However, this typically requires a considerable time and capital commitment, especially since most of the labeling and quality assurance is done manually by humans.

Based on our conversations at Superb AI with 100+ data and ML engineers, data labeling and auditing are major bottlenecks within the data preparation pipeline for computer vision systems. Interestingly enough, based on our surveys and interviews, this bottleneck applies to teams that just started a new computer vision project and mature teams with models already in production.

Companies or teams in the early stages of ML development want to utilize AI somehow, but they do not have any models pre-trained on niche or domain-specific datasets.
Mature teams with models in production are well-versed in the ML development and deployment lifecycle. They tend to have sophisticated pre-trained models and are focused on further improving model accuracy. They want to identify where the model is failing and manually prepare datasets to address those edge cases.

An effective DataOps Platform for computer vision must address these two expensive steps of labeling the data and auditing the labels. These issues are managed effectively in Superb AI’s Suite product, which uses a powerful capability called Custom Auto-Label to label large datasets in a short timeframe. Combining Custom Auto-Label with Suite’s Uncertainty Estimation and management capabilities, teams can immediately identify hard labels, build active learning workflows for auditing, and deliver datasets in a matter of days.

49 of 60

Challenge 3: Account For Data Drift

Source: Why Should You Care About Data and Concept Drift (by Evidently AI)

Pain Points

Upstream process changes
Data quality issues
Natural drift in the data
Covariate shift

Solutions

Detect data drifts and raise alerts
Analyze where and why drift happens
Adapt to drift and improve model performance

Computer vision systems suffer from a major limitation that constrains their accuracy on real-world visual data captured at a specific moment in the past. They have a built-in assumption where the mapping function of input data used to predict the output data is assumed to be static. In practice, the visual data drifts over time because it comes from a dynamic, time-evolving distribution. This phenomenon is known as data drift. In these cases, predictions made by a model trained on historical data may no longer be valid, and the model performance will begin to decrease over time. As more ML applications move toward streaming data, the potential for model failure due to data drift exacerbates.

There are various causes of data drift:

Upstream process changes - changing user behavior or changing business KPIs.
Data quality issues - systems go down due to increasing web traffic.
Natural drift in the data - temporal changes with seasons.
Covariate shift - change in the relationship between features.

A robust DataOps platform for computer vision should be able to:

Detect when data drift happens and alert the ML engineers to the potential occurrences of concept drift during the running process of their deployed ML models.
Analyze where and why drift happens: A simple numerical measure of drift degree is not sufficient. Additional useful information to explore includes the data distribution and the distribution changes over time. This ties to the concept of observability that has surged recently.
Overcome drift and improve performance: Also known as drift adaptation, this means adapting the model to new data, ideally reusing parts of the old model.

50 of 60

Challenge 3: Account for Data Drift

Pain Points

Upstream process changes
Data quality issues
Natural drift in the data
Covariate shift

Solutions

Detect data drifts and raise alerts
Analyze where and why drift happens
Adapt to drift and improve model performance

Source: Why Should You Care About Data and Concept Drift (by Evidently AI)

Computer vision systems suffer from a major limitation that constrains their accuracy on real-world visual data captured at a specific moment in the past. They have a built-in assumption where the mapping function of input data used to predict the output data is assumed to be static. In practice, the visual data drifts over time because it comes from a dynamic, time-evolving distribution. This phenomenon is known as data drift. In these cases, predictions made by a model trained on historical data may no longer be valid, and the model performance will begin to decrease over time. As more ML applications move toward streaming data, the potential for model failure due to data drift exacerbates.

There are various causes of data drift:

Upstream process changes - changing user behavior or changing business KPIs.
Data quality issues - systems go down due to increasing web traffic.
Natural drift in the data - temporal changes with seasons.
Covariate shift - change in the relationship between features.

A robust DataOps platform for computer vision should be able to:

Detect when data drift happens and alert the ML engineers to the potential occurrences of concept drift during the running process of their deployed ML models.
Analyze where and why drift happens: A simple numerical measure of drift degree is not sufficient. Additional useful information to explore includes the data distribution and the distribution changes over time. This ties to the concept of observability that has surged recently.
Overcome drift and improve performance: Also known as drift adaptation, this means adapting the model to new data, ideally reusing parts of the old model.

51 of 60

The Future Of The

Modern Computer Vision Stack

52 of 60

The Future of The Modern Computer Vision Stack

53 of 60

Following The Footsteps of The Modern Data Stack

The Modern Data Stack is a collection of cloud-native tools centered around a cloud data warehouse.

Benefits:

Ease of Use
Wide Adoption
Automation
Cost

Source: The Modern Data Stack Ecosystem - Spring 2022 Edition (by Continual)

The modern data stack is a collection of cloud-native tools that are centered around a cloud data warehouse and together comprise a data platform. The benefits of adopting a modern data stack are many:

Ease of Use: SaaS technologies allow your team to not worry about installing and maintaining technology. Everything is built for the data warehouse so this minimizes integration pains and siloed data platforms that require lots of effort spent shifting data around.
Wide Adoption: The modern data stack is constructed with the intention of upskilling data workers and removing the barriers between workflows. SQL is the lingua franca that creates a common foundation to work with data across disciplines.
Automation: Tools that don’t focus on automation place a huge technology burden on users when it comes time to operationalize data workflows. Automation needs to be a core feature of data tools.
Cost: In the cloud, you pay for what you use, and nothing more. A side effect of tools having wide adoption and a focus on automation means your data workers can get more done, in less time, with fewer resources. This has benefits in terms of the cost to staff up a data team as well.

As the modern data stack continues to grow and evolve, many new technologies and vendors are entering the conversation. Above is a diagram created by the Continual team that entails the current main functional areas of the modern data stack: Cloud Data Warehouse, Data Integration, Event Tracking, Data Transformation, AI/ML, Data Analytics, Metrics Store, Reverse ETL, Data Governance, and Data Orchestration.

54 of 60

Following The Footstep of “The Modern Data Stack”

Source: The Modern Data Stack Ecosystem - Fall 2021 Edition (by Continual)

The Modern Data Stack is a collection of cloud-native tools centered around a cloud data warehouse.

Benefits:

Ease of Use
Wide Adoption
Automation
Cost

The modern data stack is a collection of cloud-native tools that are centered around a cloud data warehouse and together comprise a data platform. The benefits of adopting a modern data stack are many:

Ease of Use: SaaS technologies allow your team to not worry about installing and maintaining technology. Everything is built for the data warehouse so this minimizes integration pains and siloed data platforms that require lots of effort spent shifting data around.
Wide Adoption: The modern data stack is constructed with the intention of upskilling data workers and removing the barriers between workflows. SQL is the lingua franca that creates a common foundation to work with data across disciplines.
Automation: Tools that don’t focus on automation place a huge technology burden on users when it comes time to operationalize data workflows. Automation needs to be a core feature of data tools.
Cost: In the cloud, you pay for what you use, and nothing more. A side effect of tools having wide adoption and a focus on automation means your data workers can get more done, in less time, with fewer resources. This has benefits in terms of the cost to staff up a data team as well.

As the modern data stack continues to grow and evolve, many new technologies and vendors are entering the conversation. Above is a diagram created by the Continual team that entails the current main functional areas of the modern data stack: Cloud Data Warehouse, Data Integration, Even Tracking, Transformation, AI/ML, Business Intelligence, Reverse ETL, and Governance.

55 of 60

The Canonical Stack for ML

Source: The Rapid Evolution of the Canonical Stack for Machine Learning (by Daniel Jeffries)

At the AI Infrastructure Alliance, we’re dedicated to bringing together the essential building blocks for the Artificial Intelligence applications of today and tomorrow.

Right now, we’re seeing the evolution of a Canonical Stack (CS) for machine learning. It’s coming together through the efforts of many different people, projects and organizations. That’s why we’ve created the Alliance to act as a focal point that brings together many different groups in one place.

The Alliance and its members bring striking clarity to this quickly developing field by establishing clean APIs, integration points, and open standards for how different components of a complete enterprise machine learning stack can and should interoperate. That lets organizations make better decisions about the tools they’ll deploy in the AI/ML application stacks of today and tomorrow.

This diagram in the slide is an early draft of emerging patterns we're seeing across the 50+ members of the AIIA.

56 of 60

The Canonical Stack for Machine Learning

Source: The Rapid Evolution of the Canonical Stack for Machine Learning (by Daniel Jeffries)

At the AI Infrastructure Alliance, we’re dedicated to bringing together the essential building blocks for the Artificial Intelligence applications of today and tomorrow.

Right now, we’re seeing the evolution of a Canonical Stack (CS) for machine learning. It’s coming together through the efforts of many different people, projects and organizations. That’s why we’ve created the Alliance to act as a focal point that brings together many different groups in one place.

The Alliance and its members bring striking clarity to this quickly developing field by establishing clean APIs, integration points, and open standards for how different components of a complete enterprise machine learning stack can and should interoperate. That lets organizations make better decisions about the tools they’ll deploy in the AI/ML application stacks of today and tomorrow.

This diagram in the slide is an early draft of emerging patterns we're seeing across the 50+ members of the AIIA.

57 of 60

Startup Opportunities in ML Infrastructure

Source: Startup Opportunities in ML Infrastructure (by Leigh-Marie Braswell)

58 of 60

Startup Opportunities in Machine Learning Infrastructure

Source: Startup Opportunities in ML Infrastructure (by Leigh-Marie Braswell)

59 of 60

Thank you!

James Le

Website: jameskle.com

Twitter: @le_james94

Email: james.le@superb-ai.com

60 of 60

Thank you!

James Le

Website: jameskle.com

Twitter: @le_james94

Email: james.le@superb-ai.com