ABCDEFGHIJKLMNOPQRSTUVWXYZAAABAC
1
Created and Maintained by Joseph Jacks (OSS Capital) and Justin Cormack (Docker) in early December 2020
2
Criteria: This sheet lists FOSS technologies (and COSS companies if they exist) which focus on creating "git for data" solutions in various domains.
3
COSS Company
Funding (M)
FOSS TechnologyStarsLaunchedStatusFocusDescriptionLead Creator/PM
4
--https://github.com/datahuborg/datahub200~September 2013Dormantdata collaborationData collaboration platformhttps://twitter.com/anantpb
5
AERGO$30https://github.com/aergoio/litetree1,400~August 2018DormantsqliteBranch and merge SQLitehttps://twitter.com/aergo_io
6
Attic Labs$8https://github.com/attic-labs/noms7,300~May 2018DormantDatabaseDeclarative content addressed databasehttps://twitter.com/aboodman
7
DoltHub$5https://github.com/dolthub/dolt2,000~December 2018ActiveSQLVersion SQL tables, merge, branch. Hosted hub for public data.https://twitter.com/timsehn
8
Dotscience$10
https://github.com/dotmesh-oss/dotmesh
500~February 2018DormantMLOriginally general purpose versioned data, pivoted to replicatable experimentshttps://twitter.com/lmarsden
9
GitLab$434https://gitlab.com/meltano/meltano400~July 2018ActiveETLOrchestration of ELT pipelineshttps://twitter.com/DouweM
10
Gretel.ai$16
https://github.com/gretelai/gretel-synthetics
100~March 2020Activedata generationSynthetic privacy preserving data generation
https://twitter.com/AlexWatson405
11
Grist Labs-https://github.com/paulfitz/daff550~January 2013DormantDiffData diff toolhttps://twitter.com/fitzyfitzyfitzy
12
Grist Labs-https://github.com/gristlabs/grist-core20~May 2020ActiveSpreadsheetVersioned spreadsheethttps://twitter.com/fitzyfitzyfitzy
13
Iterative$4https://github.com/iterative/dvc6,800~March 2017ActiveMLGit/Git LFS and Makefiles for ML and data sciencehttps://twitter.com/rkuprieiev
14
Pachyderm$28
https://github.com/pachyderm/pachyderm
4700~October 2014Activedata scienceVersion controlled data ingestion and processing pipelinehttps://twitter.com/jdoliner
15
Qri-https://github.com/qri-io/qri1,000~October 2016Active
data management
Dataset version controlhttps://twitter.com/b_fiive
16
Quilt Data$4https://github.com/quiltdata/quilt1,000~Febrary 2017ActiveML/data
Versioning for small and large data that don't fit in git eg ML models. S3/AWS based.
https://twitter.com/akarve
17
Replicate-https://github.com/replicate/replicate500~August 2020ActiveML
Version ML models, focus on simpler workflows and introducing people to version control
https://twitter.com/bfirsh
18
Tarides-https://github.com/mirage/irmin1,400~August 2017ActiveBlockchain/generalGit for merging distributed data models. OCaml. Used by Tezos.https://twitter.com/eriangazag
19
TerminusDB$1
https://github.com/terminusdb/terminusdb
1,000~May 2019ActiveDatabaseRevision controlled graph database
https://twitter.com/GavinMGleason
20
Treeverse-https://github.com/treeverse/lakeFS500~September 2019Active
data management
Versioned data lake for ETL and data sciencehttps://twitter.com/lakeFS
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100