ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster and Generalize Framework
August 28th, 2024
Joohyung Yun, Byungchul Tak, Wook-Shin Han
POSTECH
1
Data Systems Lab @POSTECH
Welcome to JSON World
Data exchange
{
“id”: “joohyung_1”, “pswd”: “1234”,
”info”: {“age”: 23, “gender”: “male”}
}
JavaScript Object Notation (JSON)
Data storage & query
AI & data analysis
Data Systems Lab @POSTECH
Welcome to JSON World
High degree of freedom in data modeling
Problem: An accompanying schema is often absent!
2
{“id”: “joohyung_1”, “pswd”: “1234”,
”info”: {“age”: 23, “gender”: “male”}}
{“id”: “wook-shin_2”, “pswd”: “qwer”,
”info”: {“age”: 52, “gender”: “male”}}
id | pswd |
joohyung_1 | 1234 |
wook-shin_2 | qwer |
{“id”: “byungchul_3”, “pswd”: “asdf”,
”info”: {“degree”: “PhD”, “place”: “KNU”}}
id | age | gender |
joohyung_1 | 23 | male |
wook-shin_2 | 52 | male |
id | pswd |
joohyung_1 | 1234 |
wook-shin_2 | qwer |
byungchul_3 | asdf |
id | degree | place |
byungchul_3 | PhD | KNU |
PEOPLE
INFO
ANOTHER_INFO
Data Systems Lab @POSTECH
JSON Schema
3
JSON Schema
JSON document
validation
JSON parsing
acceleration
Data migration
Data analytics
Data Systems Lab @POSTECH
JSON Schema: JSON Document Validation
3
JSON validator
JSON
document
rejected
Data Systems Lab @POSTECH
JSON Schema Discovery
4
JSON schema
JSON documents
Data Systems Lab @POSTECH
Background�JSON Document and Tree Representation
For visibility, we express a JSON document with a node-labeled, node-typed, edge-labeled tree called an instance tree.
5
{
“id”: “jhyun”,
“info”: [2000, true, null]
}
An example JSON document
An equivalent instance tree
id
info
“jhyun”
2000
true
null
Data Systems Lab @POSTECH
Background�JSON Document and Tree Representation
Primitive type document is
5
{
“id”: “jhyun”,
“info”: [2000, true, null]
}
id
info
“jhyun”
2000
true
null
Data Systems Lab @POSTECH
Background�JSON Document and Tree Representation
Array type document is
5
{
“id”: “jhyun”,
“info”: [2000, true, null]
}
element
element
element
element
element
element
id
info
“jhyun”
2000
true
null
Data Systems Lab @POSTECH
Background�JSON Document and Tree Representation
Object type document is
5
{
“id”: “jhyun”,
“info”: [2000, true, null]
}
key
value
key
value
id
info
“jhyun”
2000
true
null
key
value
key
value
Data Systems Lab @POSTECH
Background�JSON Schema
JSON schema
6
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Primitive type schemas trivially impose JSON documents to be a document of corresponding type.
7
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Primitive type schemas trivially impose JSON documents to be a document of corresponding type.
7
id
info
“jhyun”
2000
true
null
STR
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Primitive type schemas trivially impose JSON documents to be a document of corresponding type.
7
id
info
“jhyun”
2000
true
null
STR
NUM
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Primitive type schemas trivially impose JSON documents to be a document of corresponding type.
7
id
info
“jhyun”
2000
true
null
STR
NUM
BOOL
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Primitive type schemas trivially impose JSON documents to be a document of corresponding type.
7
id
info
“jhyun”
2000
true
null
STR
NUM
BOOL
NULL
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Array schemas additionally impose constraints on the number and schemas of elements.
7
id
info
“jhyun”
2000
true
null
STR
NUM
BOOL
NULL
ARR
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Array schemas additionally impose constraints on the number and schemas of elements.
7
STR
id
info
“jhyun”
2000
true
null
NUM
BOOL
NULL
ARR
ARR
accepted by
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Array schemas additionally impose constraints on the number and schemas of elements.
7
STR
id
info
“jhyun”
2000
true
null
NUM
BOOL
NULL
ARR
accepted by
ANYOF
*
*
ANYOF
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Object schemas additionally impose constraints on each keys and schema of each values.
7
id
info
“jhyun”
true
null
STR
NUM
BOOL
NULL
OBJ
ARR
ANYOF
*
accepted by
2000
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Object schemas additionally impose constraints on each keys and schema of each values.
7
id
info
“jhyun”
true
null
STR
OBJ
accepted by
2000
id
info
NUM
BOOL
NULL
ARR
ANYOF
*
“hyung”
nickname
STR
nickname
optional
rejected by
required
Data Systems Lab @POSTECH
Background�JSON Schema and Tree Representation
Object schemas additionally impose constraints on each keys and schema of each values.
7
id
info
“jhyun”
true
null
STR
OBJ
accepted by
2000
NUM
BOOL
NULL
ARR
ANYOF
*
ANYOF
*
foo
“beef”
true
bar
Data Systems Lab @POSTECH
JSON Schema Discovery Problem
8
Data Systems Lab @POSTECH
Challenges of JSON Schema Discovery
9
2019
id
“qwer”
pswd
2023
id
“Yun”
name
NUM
STR
OBJ
id
name
NUM
STR
OBJ
id
pswd
NUM
STR
OBJ
id
pswd
ANYOF
STR
NUM
STR
OBJ
ANYOF
name
*
Specific
General
Data Systems Lab @POSTECH
Challenges of JSON Schema Discovery
9
“item”
type
“Q147”
id
What if there are many, highly nested JSON documents?
claims
P373
P279
“item”
type
“Q732”
id
aliases
Ru
“item”
type
“Q117”
id
claims
P31
P361
“Q117”
title
P17
Data Systems Lab @POSTECH
Global Example
10
“jhyun”
id
“qwer”
pswd
info
23
“male”
age
“wshan”
name
info
52
“male”
age
“w@a.com”
gender
gender
Data Systems Lab @POSTECH
Motivation 1: Insufficient Information to Derive Schema
11
“jhyun”
id
“qwer”
pswd
info
“wshan”
name
info
23
“male”
age
gender
52
“male”
age
null
gender
OBJ
id
pswd
name
info
OBJ
pswd
info
OBJ
name
info
ANYOF
id
23
“male”
age
gender
52
“male”
age
“w@a.com”
gender
Same schema
Different schema
Ambiguous!
Data Systems Lab @POSTECH
Key Idea 1: Bottom-up Schema Generation
We propose deriving schema in a bottom-up manner.
12
id
pswd
info
name
info
23
“male”
52
“male”
“w@a.com”
“jhyun”
“qwer”
“wshan”
age
gender
age
gender
Schema
derivation
order
NUM
STR
NUM
STR
STR
Similar
OBJ
STR
STR
STR
OBJ
NUM
STR
STR
NUM
STR
NULL
age
age
gender
gender
Similar
Data Systems Lab @POSTECH
Motivation 2: Many Candidate Derivations
In most cases, it may be ambiguous to pick one set of schemas to derive.
13
“jhyun”
id
“qwer”
pswd
info
“wshan”
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
id
pswd
info
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
id
pswd
info
name
info
age
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
STR
gender
id
pswd
info
name
info
NUM
STR
STR
STR
OBJ
STR
OBJ
ANYOF
NUM
STR
ANYOF
*
*
Most specific
Moderate
Most general
Data Systems Lab @POSTECH
Key Idea 2: Candidate Schema Exploration
ReCG explores multiple schema derivation candidates rather than settling on just one.
“jhyun”
id
“qwer”
pswd
info
“wshan”
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
id
pswd
info
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
id
pswd
info
name
info
age
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
STR
gender
id
pswd
info
name
info
NUM
STR
STR
STR
OBJ
STR
OBJ
ANYOF
NUM
STR
ANYOF
*
*
State
: a bag of partially
derived schemas
Transition
: deriving schema nodes
of an upper level
Data Systems Lab @POSTECH
Key Idea 2: Candidate Schema Exploration
ReCG explores multiple schema derivation candidates rather than settling on just one.
“jhyun”
id
“qwer”
pswd
info
“wshan”
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
id
pswd
info
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
id
pswd
info
name
info
age
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
STR
gender
id
pswd
info
name
info
NUM
STR
STR
STR
OBJ
STR
OBJ
ANYOF
NUM
STR
ANYOF
*
*
Data Systems Lab @POSTECH
Key Idea 2: Candidate Schema Exploration
ReCG explores multiple schema derivation candidates rather than settling on just one.
14
Candidate final states:
Fully derived schemas
* The schemas within the candidate final states are not expressed for brevity.
It is likely that a promising schema will
eixst within the candidate final states!
Data Systems Lab @POSTECH
Motivation 3: Ambiguity in Choosing the Best Schema
We need a metric that correctly measure the “goodness” of a schema.
15
* The schemas within the candidate final states are not expressed for brevity.
Data Systems Lab @POSTECH
Motivation 3: Ambiguity in Choosing the Best Schema
We need a metric that correctly measure the “goodness” of a schema.
15
id
pswd
info
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
id
pswd
name
info
age
gender
NUM
STR
STR
STR
STR
STR
OBJ
STR
OBJ
NUM
STR
ANYOF
*
ANYOF
OBJ
*
OBJ
ANYOF
OBJ
OBJ
Data Systems Lab @POSTECH
Key Idea 3: Minimum Description Length Cost
Minimum description length principle: minimize the total amount of information required to describe the input data
16
SRC: Schema representation cost
DRC: Data representation cost
transform
transform
Data Systems Lab @POSTECH
Key Idea 3: Minimum Description Length Cost
16
id
pswd
info
name
info
age
gender
age
gender
NUM
STR
NUM
STR
STR
STR
STR
OBJ
STR
OBJ
id
pswd
name
info
age
gender
NUM
STR
STR
STR
STR
STR
OBJ
STR
OBJ
NUM
STR
ANYOF
*
ANYOF
OBJ
*
OBJ
ANYOF
OBJ
OBJ
SRC ↑
DRC ↓
SRC -
DRC -
SRC ↓
DRC ↑
Data Systems Lab @POSTECH
Contributions of ReCG
17
Data Systems Lab @POSTECH
Experiment
Algorithms compared
Machine
Datasets
18
Data Systems Lab @POSTECH
Accuracy Comparison
ReCG outperforms all competitors in F1 score, especially shows 46% increase compared to the SOTA
19
46%
18%
47%
Data Systems Lab @POSTECH
MDL Cost Analysis
20
3.8×
4.9×
14.2×
Data Systems Lab @POSTECH
Algorithm Execution Time
ReCG’s performance is positioned in the middle of KReduce and Jxplain
21
More experiments in our paper!
Data Systems Lab @POSTECH
Conclusion
22
Data Systems Lab @POSTECH
THANK YOU
23
Data Systems Lab @POSTECH