ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
NameCatSubCatSeries$$$StartedWebsiteOSSDescriptionNoteIF ACQ
2
DataRobotAll-in-oneE430.62012
https://www.datarobot.com/
DataRobot combines a trusted enterprise AI platform and a trusted AI-native strategic partnership for global enterprises that want to harness the power of AI and their existing teams to succeed in today's Intelligence Revolution.
"We lived and breathed data science,"
Forbes 50 AI companies 2019
3
LuigiAll-in-one
Workflow orchestration
Spotify2012OSS
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
4
H2OAll-in-oneFrameworkD146.12012
https://www.h2o.ai/
OSS
H2O.ai is the creator of H2O the leading open source machine learning and artificial intelligence platform trusted by data scientists across 14K enterprises
5
HIVEAll-in-oneLabelingB20.22013
https://thehive.ai/
Hive is a full-stack deep learning company focused on solving visual intelligence problems. Let us help you join the AI Revolution. End-To-End Solutions. Full-Stack Approach.
6
DatabricksAll-in-oneData managementF8972013
https://databricks.com/
Unified Data Analytics Platform - One cloud platform for massive scale data engineering and collaborative data science.
7
IguazioAll-in-oneC722014
https://www.iguazio.com/
The Iguazio Data Science Platform automates your machine learning pipeline, transforming AI projects into real-world business outcomes.
8
AirflowAll-in-one
Workflow orchestration
Airbnb2015
https://airflow.apache.org/
OSS
Airflow is a platform created by community to programmatically author, schedule and monitor workflows.
9
PolyaxonAll-in-oneServing2016
https://polyaxon.com/
OSS
A platform for reproducing and managing the whole life cycle of machine learning and deep learning applications.
10
DessaAll-in-oneMonitoringSquare92016
https://www.dessa.com/
Create more with machine learning. Build, run & monitor 1000s of ML experiments with Foundations
ACQ
11
PetuumAll-in-oneData managementB1082016
https://petuum.com/
Petuum accelerates and simplifies AI solutions so your enterprise can deploy it easily and maintain it effortlessly.
12
SuperviselyAll-in-oneComputer vision2017
https://supervise.ly/
First available ecosystem to cover all aspects of training data development. Manage, annotate, validate and experiment with your data without coding.
13
CadenceAll-in-one
Workflow orchestration
Uber2017OSS
Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
14
MichelangeloAll-in-one
Workflow orchestration
Uber2015
Michelangelo, Uber’s machine learning (ML) platform, supports the training and serving of thousands of models in production across the company. Designed to cover the end-to-end ML workflow, the system currently supports classical machine learning, time series forecasting, and deep learning models that span a myriad of use cases ranging from generating marketplace forecasts, responding to customer support tickets, to calculating accurate estimated times of arrival (ETAs) and powering our One-Click Chat feature using natural language processing (NLP) models on the driver app.
15
MLFlowAll-in-oneExperiment trackingDatabricks2018
https://mlflow.org/
OSS
An open source platform for the machine learning lifecycle
16
AibleAll-in-one2018
https://www.aible.com/
Create AI that delivers impact, not accuracy, with cost-benefit tradeoffs & operational constraints, in a friendly, intuitive UI designed for real business.
17
dotDataAll-in-oneFeature engineering432018
https://dotdata.com/
When AutoML is enhanced with AI-powered feature engineering, the result is dotData. We focus on delivering data science automation for the enterprise. End-to-end data science automation platform accelerates, democratizes, and operationalizes the entire data science process.
18
PrefectAll-in-one
Workflow orchestration
2018
https://www.prefect.io/
OSS
The Global Leader in Dataflow Automation
19
MetaflowAll-in-one
Workflow orchestration
Netflix2019
https://metaflow.org/
OSS
Metaflow makes it quick and easy to build and manage real-life data science projects. Metaflow is built for data scientists, not just for machines.
metaflow.org
20
FlyteAll-in-one
Workflow orchestration
Lyft2019
https://flyte.org/
OSS
Lyft’s Cloud Native Machine Learning and Data Processing Platform, Now Open Sourced
21
Noodle.ai All-in-oneAI-as-a-serviceB512016
We're on a mission to create a world without waste. We push the limits of data science, helping plan, make, and move goods and resources for manufacturers and complex supply chains.
addresses each failure point in the data pipeline from edge device to on-prem and cloud
Forbes 50 AI companies 2019
22
kedroAll-in-one
Workflow orchestration
McKinsey2019OSS
Kedro is an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines.
23
ValohaiAll-in-one
Workflow orchestration
A2016
https://valohai.com/
The MLOps platform for the whole team. Valohai takes you from POC to production while managing the whole model lifecycle.
Focus on deep learning. Tooling, technology, framework, and cloud-agnostic.
24
TectonAll-in-oneDeploymentA252019
https://tecton.ai/
The Data Platform for Machine Learning. Build a library of great features. Serve them in production. Do it at scale.
From the creators of Michaelangelo
25
DatagrokAll-in-oneData processing
https://datagrok.ai/
Datagrok: Swiss Army Knife for Data. A platform for turning data into actionable insights
can interactively visualize datasets with millions of rows completely in the browse
26
Figure EightData pipelineLabelingAppen2008
https://www.figure-eight.com/
Figure Eight combines the best of human and machine intelligence to provide high-quality annotated training data that powers the world's most innovative machine learning and business solutions
ACQ
27
SparkData pipelineData processing2009
https://spark.apache.org/
OSS
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
28
ScrapinghubData pipelineData generation2010
https://scrapinghub.com/
OSS
Turn websites into data with the world's leading web scraping services & tools from the creators of Scrapy. Data extraction trusted by industry leaders.
Web crawlingELI5
29
AlteryxData pipelineData managementIPO1632011
https://www.alteryx.com/
We are a leader in the self-service data analytics movement with a platform that can discover, prep, and analyze all your data, then deploy and share analytics at scale for deeper insights faster than you ever thought possible.
Control Meets Freedom: Unlock the Data Vault and Unleash Your Data Gurus in a Secure Way.
30
TamrData pipelineData management69.22012
https://www.tamr.com/
Tamr's leading data management system and services work to create a data migration strategy that simplifies your data unification process. Talk with us today.
Forbes 50 AI companies 2019
31
AircloakData pipelinePrivacy1.32012
https://aircloak.com/
Aircloak's unique approach ensures the existing primary database is not modified in any way. Aircloak handles all data types including unstructured text.
GDPR compliant
32
PrometheusData pipelineMonitoring2012
https://prometheus.io/
OSS
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
33
iMeritData pipelineLabelingB23.52012
https://imerit.net/
iMerit specializes in data labeling and annotation for purposes of training models for Machine Learning and Artificial Intelligence.
34
PrestoData pipelineDatabase/Query2012
https://prestodb.io/
OSS
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
35
Amazon RedshiftData pipelineData warehouseAmazon2012
https://aws.amazon.com/redshift/
Amazon Redshift is a fast, fully managed, and cost-effective data warehouse that gives you petabyte scale data warehousing and exabyte scale data lake analytics together in one service. Amazon Redshift is up to ten times faster than traditional on-premises data warehouses.
36
Apache DruidData pipelineDatabaseImply2012
https://druid.apache.org/
OSS
Apache Druid is a high performance real-time analytics database
column-oriented database
37
Waterline DataData pipelineData management
Hitachi Vantara
37.52013
https://www.waterlinedata.com/
Waterline's enterprise data catalog enables data professionals to discover, govern, and rationalize an organization's data lake.
ACQ
38
IncortaData pipelineData processingC72.62013
https://incorta.com/
Incorta aggregates large complex business data in real time, eliminating the need to reshape it. No Data Warehouse. No Transformations. Real-Time Insight.
39
IgneousData pipelineData managementC67.52013
https://www.igneous.io/
Igneous Unstructured Data Protection offers the scalability to handle hundreds of file systems, billions of files, and exabytes of enterprise data requiring backup
Unstructured data
40
RubrikData pipelineData managementE5532013
https://www.rubrik.com/en
We provide a powerful, policy-driven platform to simplify recovery and unlock insights from data residing in the data center and cloud.
41
QuobyteData pipelineStorage2013
https://www.quobyte.com/
Quobyte is software defined storage that turns commodity servers into a reliable and highly automated data center file system.
42
ElastifileData pipelineStorageGoogle2013
https://www.elastifile.com/
Elastifile's cloud-native file storage helps organizations adapt and accelerate their business in the cloud era. Powered by a scalable, enterprise-grade distributed file system with intelligent object tiering, Elastifile augments existing public cloud services with a scalable, POSIX-compliant NAS, facilitating frictionless cloud adoption. With Elastifile, organizations enjoy low-touch file storage services, or deploy and manage cloud-native file storage themselves, eliminating the need for manual storage management and IT forecasting. Elastifile's unique combination of features and flexibility empowers organizations to seamlessly integrate cloud resources, with no application refactoring… thereby modernizing their infrastructure and achieving IT agility and efficiency goals.
43
DateraData pipelineStorageC63.92013
https://datera.io/
Get sub-200µS latency & millions of IOPS with 100% software-defined data automation. Save up to 70% on data infrastructure total-cost-of-ownership.
44
CohesityData pipelineData managementD4102013
https://www.cohesity.com/
Eliminate mass data fragmentation with Cohesity's modern approach to data management, beginning with backup. Gain instant recovery. Learn more today.
45
AtScaleData pipelineData managementC952013
https://www.atscale.com/
Freedom of choice for the enterprise. Break free the complexities and security risks associated with cloud migration and self-service analytics with Intelligent Data Virtualization—no matter where dat.
46
Apache ORCData pipelineFile format2013
https://orc.apache.org/
OSS
the smallest, fastest columnar storage for Hadoop workloads.
47
ParquetData pipelineFile format
Twitter, Cloudera
2013OSS
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
48
CazenaData pipelineData management382014
https://www.cazena.com/
First Data Lake with a SaaS Experience. Cazena empowers enterprises to collect, store and analyze any data in the cloud, without any DevOps resources or admin time. Cazena's Data Lake as a Service includes everything, and is delivered as secure SaaS, ready to load, store and analyze data with any method: SQL, Spark, R, Python, and many more.
49
ConfluentData pipelineRealtime data streamD205.92014
https://www.confluent.io/
Confluent is a fully managed Kafka service and enterprise stream processing platform. Real-time data streaming for AWS, GCP, Azure or serverless. Try free!
founded by the original creators of Apache Kafka
50
Yellowbrick DataData pipelineData warehouseC1732014
https://yellowbrick.com/
The ultimate solution for your data warehouse. Quick to deploy, easy to expand, and simple to manage. Yellowbrick Data can solve your data problems.
51
NaveegoData pipelineData processingSeed0.52014
https://www.naveego.com/
A leading provider of cloud-first, distributed data accuracy solutions for seamless, end-to-end data cleansing, Naveego enables organizations to proactively manage, detect and eliminate data accuracy issues across all enterprise data sources in real-time–regardless of structure or schema.
52
GluentData pipelineVisualizationSeed5.72014
https://gluent.com/
Data virtualization software eliminates data silos. Gluent's transparent data virtualization provides virtual access to all enterprise data, with zero code changes.
53
VexataData pipelineStorageStorCentric542014
https://www.vexata.com/
Vexata is an active data infrastructure company that accelerates database and analytic platforms via groundbreaking storage solutions.
ACQ
54
StorbyteData pipelineStorage2014
http://storbyte.com/
Storbyte designs and manufactures all-flash & hybrid flash enterprise storage arrays that offer performance, power management, availability, reliability, density, efficiency, flexibility, expandability, and affordability. Storbyte is providing innovative data storage solutions and has not lost sight of what is important to end users: a responsible, cost-correct price point.
NOT AI
55
KompriseData pipelineStorageC422014
https://www.komprise.com/
In 15 minutes, our free data management software trial will show you how you can save 70% on data management costs, on-premises and in the cloud.
56
ExceleroData pipelineStorageB352014
https://www.excelero.com/
Local NVMe performance at data center scale through true convergence. Software-defined block storage for Cloud and Enterprise applications at any scale.
57
ClearSky DataData pipelineStorageB592014
https://www.clearskydata.com/
ClearSky Data offers enterprise storage as a hybrid cloud service delivering on-demand primary storage, offsite backup, and DR as a single service.
58
PachydermData pipelineVersioningA12.12014
https://www.pachyderm.com/
OSS
Data Lineage with End-to-End Pipelines on Kubernetes, engineered for the enterprise. And… It's open source!
59
Kimono LabsData pipelineData generationPalantir52014
http://www.kimonolabs.com/
Kimono Labs is an online platform that allows its users to convert their websites into APIs.
Web scraping
60
Git LFSData pipelineVersioning
Atlassian, GitHub
2014
https://git-lfs.github.com/
OSS
Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.
An open source Git extension for versioning large files - 7.9k stars
61
AlluxioData pipelineData management162015
https://www.alluxio.io/
OSS
an open source data orchestration layer that brings data close to compute for big data and AI/ML workloads in the cloud.
62
DremioData pipelineData management452015
https://www.dremio.com/
Get more value from your data, faster. Dremio makes your data engineers more productive, and your data consumers more self-sufficient.
Data lake
founders of the Apache Arrow and Apache Drill
63
HammerspaceData pipelineDatabase/Query2015
https://hammerspace.com/
Hammerspace allows data to move freely, like the air you breathe, across clouds and services. Make data accessible exactly where you need it, when you need it – on demand.
Data-as-a-Service
64
OctopaiData pipelineData managementB6.22015
https://www.octopai.com/
An automated, centralized, cross-platform metadata search engine that enables BI groups to quickly and precisely discover and govern shared metadata.
65
Kyvos InsightsData pipelineDatabase/Query2015
https://www.kyvosinsights.com/
Kyvos accelerates BI on trillions of rows of data on the cloud and on-premise platforms with a semantic layer powered by its next-generation OLAP technology.
It pre-calculates aggregates at multiple levels of dimensional hierarchies to improve query response times as compared to SQL-on-Hadoop platforms
66
Gemini DataData pipelineData management2015
https://www.geminidata.com/
Gemini Data provides Data Availability for AI/ML driven analysis and applications to enable unified enterprise knowledge and access.
67
DefinedCrowdData pipelineData generationA13.12015
https://www.definedcrowd.com/
Leverage machine learning technology and human intelligence to source, structure, and enrich high quality training data in speech, NLP, and computer vision.
Forbes 50 AI companies 2019
68
Ascend.ioData pipelineData managementA192015
https://www.ascend.io/
Experience continuously optimized data pipelines with less code and fewer breakages. Enter the new era of data engineering with Ascend's autonomous dataflow service.
69
DaskData pipelineData processing2015
https://dask.org/
OSS
Dask natively scales Python. Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love
70
QuiltData pipelineVersioningSeed4.22015
https://quiltdata.com/
OSS
Quilt is a versioned data portal for AWS
71
ImplyData pipelineData managementB45.32015
https://imply.io/
Imply delivers real-time analytics powered by Apache Druid. ... Stream or batch load data into Druid for high performance, ad-hoc analytic queries.
72
VaexData pipelineData processing2015
https://vaex.io/
OSS
Power up your business with our data driven solutions. With our unique, state-of-the-art technology, we provide fast and scalable solutions that will make you more agile, while limiting unnecessary resources.
fast pandasLink
73
erwinData pipelineData management
Parallax Capital Partners
2016
https://erwin.com/
Integrated enterprise architecture, business process and data modeling with data cataloging and data literacy for risk management and digital transformation.
Data governanceACQ
74
AparaviData pipelineData management2016
https://www.aparavi.com/
Aparavi's highly scalable data intelligence and automation solutions enable organizations to easily discover, classify, protect, and optimize their data.
backup solution
75
Scale AIData pipelineData generationC122.62016
https://scale.com
Trusted by world class companies, Scale delivers high quality training data for AI applications such as self-driving cars, mapping, AR/VR, robotics, and more.
76
LabelImgData pipelineLabelingAmazon2016OSS
LabelImg is a graphical image annotation tool and label object bounding boxes in images
Independent tool
77
Segments.aiData pipelineLabeling2020
https://segments.ai/
Deep learning-fueled labeling technology with a focus on instance and semantic segmentation.
78
PlaymentData pipelineLabelingSeed2.52015
https://playment.io/
Build high-quality ground truth datasets with ML-assisted tools, sophisticated project management software, expert human workforce, and much more.
79
SnorkelData pipelineLabeling2016OSS
Programmatically Building and Managing Training Data
80
QriData pipelineVersioning2016https://qri.io/OSS
Bigger than a spreadsheet, smaller than a database, datasets are all around us. Use Qri to browse, download, create, fork, & publish datasets across a network of peers.
81
Apache HudiData pipelineData warehouseUber2016
https://hudi.apache.org/
OSS
Apache Hudi ingests & manages storage of large analytical datasets over DFS (hdfs or cloud stores)
Data lake
82
Starburst DataData pipelineDatabase/Query222017
https://www.starburstdata.com/
Limitless Queries. Break boundaries and harness the power of the world's fastest SQL query engine.
83
FlureeData pipelineDatabaseSeed4.72017
https://flur.ee/
Welcome to better data management. The Fluree platform organizes blockchain-secured data in a highly-scalable, highly-insightful graph database.
84
DVCData pipelineVersioning2017
https://dvc.org/
OSS
Open-source version control system for Data Science and Machine Learning projects. Git-like experience to organize your data, models, and experiments.
85
PilosaData pipelineDatabase/QuerySeed3.72017
https://www.pilosa.com/
OSS
Pilosa is an open source, distributed bitmap index that dramatically accelerates continuous analysis across multiple, massive data sets.
Molecula
86
ProdigyData pipelineLabelingExplosion2017
https://prodi.gy/
Prodigy is a scriptable annotation tool so efficient that data scientists can do the annotation themselves, enabling a new level of rapid iteration. ... With Prodigy you can take full advantage of modern machine learning by adopting a more agile approach to data collection.
87
DatatableData pipelineData processingh2o2017OSS
Python library for efficient multi-threaded data processing, with the support for out-of-memory datasets.
88
HYCUData pipelineData management--2018
https://www.hycu.com/
Keep hyper-converged infrastructure running with HYCU's powerful, simple backup & recovery and monitoring solutions. Deploy in seconds for superior results.
89
DoltData pipelineVersioningSeed22018
https://www.liquidata.co/
Liqiudata's mission is to make data move more efficiently. We built Dolt, an an open-source version-controlled SQL database with Git-like semantics.
SQL database: We have a SQL database with Git versioning semantics called Dolt. As far as we know it's the only database with branch and merge functionality.
90
DataturksData pipelineLabelingWalmart2018
https://dataturks.com/
ML data annotations made super easy for teams. Just upload data, add your team and build training/evaluation dataset in hours.
ACQ
91
Voxel51 // ScoopData pipelineLabelingSeed3.32018
https://voxel51.com/scoop/
Quickly Build Insights into Your Video Datasets. Scoop enables you to make sense of your video datasets quickly and effectively. Scoop's faceted search is one-of-a-kind in the industry to let you quickly distill large amounts of video into the answers you need.
92
Label StudioData pipelineLabelingSeed0.152018OSS
Label Studio is a multi-type data labeling and annotation tool with standardized output format
93
DoccanoData pipelineLabeling2018OSS
Text annotation for Human. Just create project, upload data and start annotation. You can build dataset in hours.
94
LabelboxData pipelineLabelingA13.92018
https://labelbox.com/
A complete solution for your training data problem with fast labeling tools, human workforce, data management, a powerful API and automation features.
95
cuDFData pipelineData processingNVIDIA2018
https://rapids.ai/
Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
96
ModinData pipelineData processing2018
https://github.com/modin-project/modin
OSS
Modin uses Ray to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.
97
FEASTData pipelineFeature engineering2019
https://feast.dev/
OSS
Feast (Feature Store) is a tool for managing and serving machine learning features. Feast is the bridge between models and data.
Feature store
98
Tumult LabsData pipelinePrivacy2019
https://www.tmlt.io/
Unleashing the power of data with ironclad privacy protection
Differential privacy
99
AresDBData pipelineDatabase/QueryUber2019
https://github.com/uber/aresdb
OSS
A GPU-powered real-time analytics storage and query engine.
100
SQLFlowData pipelineDatabase/Query2019
https://sql-machine-learning.github.io/
Extends SQL to support AI. Extract knowledge from Data. Currently support MySQL, Apache Hive, Alibaba MaxCompute, XGBoost and TensorFlow.