1 of 43

ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster and Generalize Framework

August 28th, 2024

Joohyung Yun, Byungchul Tak, Wook-Shin Han

POSTECH

1

Data Systems Lab @POSTECH

2 of 43

Welcome to JSON World

Data exchange

{

“id”: “joohyung_1”, “pswd”: “1234”,

”info”: {“age”: 23, “gender”: “male”}

}

JavaScript Object Notation (JSON)

Data storage & query

AI & data analysis

Data Systems Lab @POSTECH

3 of 43

Welcome to JSON World

High degree of freedom in data modeling

Problem: An accompanying schema is often absent!

2

{“id”: “joohyung_1”, “pswd”: “1234”,

”info”: {“age”: 23, “gender”: “male”}}

{“id”: “wook-shin_2”, “pswd”: “qwer”,

”info”: {“age”: 52, “gender”: “male”}}

id

pswd

joohyung_1

1234

wook-shin_2

qwer

{“id”: “byungchul_3”, “pswd”: “asdf”,

”info”: {“degree”: “PhD”, “place”: “KNU”}}

id

age

gender

joohyung_1

23

male

wook-shin_2

52

male

id

pswd

joohyung_1

1234

wook-shin_2

qwer

byungchul_3

asdf

id

degree

place

byungchul_3

PhD

KNU

PEOPLE

INFO

ANOTHER_INFO

Data Systems Lab @POSTECH

4 of 43

JSON Schema

3

JSON Schema

JSON document

validation

JSON parsing

acceleration

Data migration

Data analytics

Data Systems Lab @POSTECH

5 of 43

JSON Schema: JSON Document Validation

3

JSON validator

JSON

document

rejected

Data Systems Lab @POSTECH

6 of 43

JSON Schema Discovery

4

JSON schema

JSON documents

Data Systems Lab @POSTECH

7 of 43

Background�JSON Document and Tree Representation

For visibility, we express a JSON document with a node-labeled, node-typed, edge-labeled tree called an instance tree.

  • 6 data types in total

5

{

“id”: “jhyun”,

“info”: [2000, true, null]

}

An example JSON document

An equivalent instance tree

id

info

“jhyun”

2000

true

null

Data Systems Lab @POSTECH

8 of 43

Background�JSON Document and Tree Representation

Primitive type document is

  • one of string, number, boolean, null.
  • expressed with a primitive-typed node with its value as label.

5

{

“id”: “jhyun”,

“info”: [2000, true, null]

}

id

info

“jhyun”

2000

true

null

Data Systems Lab @POSTECH

9 of 43

Background�JSON Document and Tree Representation

Array type document is

  • a sequence of elements (JSON documents).
  • expressed with an array-typed node with elements as ordered children.

5

{

“id”: “jhyun”,

“info”: [2000, true, null]

}

element

element

element

element

element

element

id

info

“jhyun”

2000

true

null

Data Systems Lab @POSTECH

10 of 43

Background�JSON Document and Tree Representation

Object type document is

  • a set of key, value (JSON document) pairs.
  • expressed with an object-typed node with keys as edge labels and values as unordered children.

5

{

“id”: “jhyun”,

“info”: [2000, true, null]

}

key

value

key

value

id

info

“jhyun”

2000

true

null

key

value

key

value

Data Systems Lab @POSTECH

11 of 43

Background�JSON Schema

JSON schema

  • Either accepts or rejects a JSON document
  • Imposes constraints to JSON documents
  • Is recursively defined

6

Data Systems Lab @POSTECH

12 of 43

Background�JSON Schema and Tree Representation

Primitive type schemas trivially impose JSON documents to be a document of corresponding type.

7

Data Systems Lab @POSTECH

13 of 43

Background�JSON Schema and Tree Representation

Primitive type schemas trivially impose JSON documents to be a document of corresponding type.

  • String schema

7

id

info

“jhyun”

2000

true

null

STR

accepted by

Data Systems Lab @POSTECH

14 of 43

Background�JSON Schema and Tree Representation

Primitive type schemas trivially impose JSON documents to be a document of corresponding type.

  • String schema, number schema

7

id

info

“jhyun”

2000

true

null

STR

NUM

accepted by

Data Systems Lab @POSTECH

15 of 43

Background�JSON Schema and Tree Representation

Primitive type schemas trivially impose JSON documents to be a document of corresponding type.

  • String schema, number schema, boolean schema

7

id

info

“jhyun”

2000

true

null

STR

NUM

BOOL

accepted by

Data Systems Lab @POSTECH

16 of 43

Background�JSON Schema and Tree Representation

Primitive type schemas trivially impose JSON documents to be a document of corresponding type.

  • String schema, number schema, boolean schema, null schema

7

id

info

“jhyun”

2000

true

null

STR

NUM

BOOL

NULL

accepted by

Data Systems Lab @POSTECH

17 of 43

Background�JSON Schema and Tree Representation

Array schemas additionally impose constraints on the number and schemas of elements.

7

id

info

“jhyun”

2000

true

null

STR

NUM

BOOL

NULL

ARR

accepted by

Data Systems Lab @POSTECH

18 of 43

Background�JSON Schema and Tree Representation

Array schemas additionally impose constraints on the number and schemas of elements.

  • Homogeneous array schema constrains arrays to have
    • Fixed length
    • Fixed subschema for element at each index

7

STR

id

info

“jhyun”

2000

true

null

NUM

BOOL

NULL

ARR

ARR

accepted by

Data Systems Lab @POSTECH

19 of 43

Background�JSON Schema and Tree Representation

Array schemas additionally impose constraints on the number and schemas of elements.

  • Heterogeneous array schema constrains arrays to have
    • Random lengths
    • Fixed single subschema for every element

7

STR

id

info

“jhyun”

2000

true

null

NUM

BOOL

NULL

ARR

accepted by

ANYOF

*

*

ANYOF

Data Systems Lab @POSTECH

20 of 43

Background�JSON Schema and Tree Representation

Object schemas additionally impose constraints on each keys and schema of each values.

7

id

info

“jhyun”

true

null

STR

NUM

BOOL

NULL

OBJ

ARR

ANYOF

*

accepted by

2000

Data Systems Lab @POSTECH

21 of 43

Background�JSON Schema and Tree Representation

Object schemas additionally impose constraints on each keys and schema of each values.

  • Homogeneous object schema constrains objects to have
    • Homogeneous keys (either required or optional)
    • One subschema per key

7

id

info

“jhyun”

true

null

STR

OBJ

accepted by

2000

id

info

NUM

BOOL

NULL

ARR

ANYOF

*

“hyung”

nickname

STR

nickname

optional

rejected by

required

Data Systems Lab @POSTECH

22 of 43

Background�JSON Schema and Tree Representation

Object schemas additionally impose constraints on each keys and schema of each values.

  • Heterogeneous object schema constrains objects to have
    • Heterogeneous keys & random number of key-value pairs
    • Same subschema for every value

7

id

info

“jhyun”

true

null

STR

OBJ

accepted by

2000

NUM

BOOL

NULL

ARR

ANYOF

*

ANYOF

*

foo

“beef”

true

bar

Data Systems Lab @POSTECH

23 of 43

JSON Schema Discovery Problem

  •  

8

Data Systems Lab @POSTECH

24 of 43

Challenges of JSON Schema Discovery

  •  

9

2019

id

“qwer”

pswd

2023

id

“Yun”

name

NUM

STR

OBJ

id

name

NUM

STR

OBJ

id

pswd

NUM

STR

OBJ

id

pswd

ANYOF

STR

NUM

STR

OBJ

ANYOF

name

*

 

 

 

Specific

General

 

Data Systems Lab @POSTECH

25 of 43

Challenges of JSON Schema Discovery

  •  

9

“item”

type

“Q147”

id

What if there are many, highly nested JSON documents?

claims

P373

P279

“item”

type

“Q732”

id

aliases

Ru

“item”

type

“Q117”

id

claims

P31

P361

“Q117”

title

P17

 

Data Systems Lab @POSTECH

26 of 43

Global Example

  •  

10

“jhyun”

id

“qwer”

pswd

info

23

“male”

age

“wshan”

name

info

52

“male”

age

“w@a.com”

gender

 

gender

email

Data Systems Lab @POSTECH

27 of 43

Motivation 1: Insufficient Information to Derive Schema

  • Existing works derive schema in a top (root) – down (leaf) manner.
  • Information may be insufficient to correctly derive a schema
    • Schemas of children instances are not derived.

11

“jhyun”

id

“qwer”

pswd

info

“wshan”

name

info

23

“male”

age

gender

52

“male”

age

null

gender

email

OBJ

id

pswd

name

info

OBJ

pswd

info

OBJ

name

info

ANYOF

id

23

“male”

age

gender

52

“male”

age

“w@a.com”

gender

email

Same schema

Different schema

Ambiguous!

Data Systems Lab @POSTECH

28 of 43

Key Idea 1: Bottom-up Schema Generation

We propose deriving schema in a bottom-up manner.

  • Previously derived schema nodes provide information for correct schema derivation.

12

id

pswd

info

name

info

23

“male”

52

“male”

“w@a.com”

“jhyun”

“qwer”

“wshan”

age

gender

age

gender

email

Schema

derivation

order

NUM

STR

NUM

STR

STR

Similar

OBJ

STR

STR

STR

OBJ

NUM

STR

STR

NUM

STR

NULL

age

age

gender

email

gender

email

Similar

Data Systems Lab @POSTECH

29 of 43

Motivation 2: Many Candidate Derivations

In most cases, it may be ambiguous to pick one set of schemas to derive.

13

“jhyun”

id

“qwer”

pswd

info

“wshan”

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

id

pswd

info

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

email

id

pswd

info

name

info

age

email

age

gender

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

STR

gender

id

pswd

info

name

info

NUM

STR

STR

STR

OBJ

STR

OBJ

ANYOF

NUM

STR

ANYOF

*

*

Most specific

Moderate

Most general

Data Systems Lab @POSTECH

30 of 43

Key Idea 2: Candidate Schema Exploration

ReCG explores multiple schema derivation candidates rather than settling on just one.

“jhyun”

id

“qwer”

pswd

info

“wshan”

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

id

pswd

info

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

email

id

pswd

info

name

info

age

email

age

gender

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

STR

gender

id

pswd

info

name

info

NUM

STR

STR

STR

OBJ

STR

OBJ

ANYOF

NUM

STR

ANYOF

*

*

State

: a bag of partially

derived schemas

Transition

: deriving schema nodes

of an upper level

Data Systems Lab @POSTECH

31 of 43

Key Idea 2: Candidate Schema Exploration

ReCG explores multiple schema derivation candidates rather than settling on just one.

“jhyun”

id

“qwer”

pswd

info

“wshan”

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

id

pswd

info

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

email

id

pswd

info

name

info

age

email

age

gender

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

STR

gender

id

pswd

info

name

info

NUM

STR

STR

STR

OBJ

STR

OBJ

ANYOF

NUM

STR

ANYOF

*

*

Data Systems Lab @POSTECH

32 of 43

Key Idea 2: Candidate Schema Exploration

ReCG explores multiple schema derivation candidates rather than settling on just one.

14

 

Candidate final states:

Fully derived schemas

* The schemas within the candidate final states are not expressed for brevity.

It is likely that a promising schema will

eixst within the candidate final states!

Data Systems Lab @POSTECH

33 of 43

Motivation 3: Ambiguity in Choosing the Best Schema

We need a metric that correctly measure the “goodness” of a schema.

15

* The schemas within the candidate final states are not expressed for brevity.

 

 

 

Data Systems Lab @POSTECH

34 of 43

Motivation 3: Ambiguity in Choosing the Best Schema

We need a metric that correctly measure the “goodness” of a schema.

15

 

 

 

id

pswd

info

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

email

id

pswd

name

info

age

gender

NUM

STR

STR

STR

STR

STR

OBJ

STR

OBJ

NUM

STR

ANYOF

*

ANYOF

OBJ

*

OBJ

ANYOF

OBJ

OBJ

Data Systems Lab @POSTECH

35 of 43

Key Idea 3: Minimum Description Length Cost

Minimum description length principle: minimize the total amount of information required to describe the input data

16

 

 

SRC: Schema representation cost

DRC: Data representation cost

transform

transform

Data Systems Lab @POSTECH

36 of 43

Key Idea 3: Minimum Description Length Cost

  •  

16

 

 

 

id

pswd

info

name

info

age

gender

age

gender

email

NUM

STR

NUM

STR

STR

STR

STR

OBJ

STR

OBJ

email

id

pswd

name

info

age

gender

NUM

STR

STR

STR

STR

STR

OBJ

STR

OBJ

NUM

STR

ANYOF

*

ANYOF

OBJ

*

OBJ

ANYOF

OBJ

OBJ

SRC ↑

DRC ↓

SRC -

DRC -

SRC ↓

DRC ↑

 

Data Systems Lab @POSTECH

37 of 43

Contributions of ReCG

  • 46% higher F1 score while also performing 2.11× faster on average against the state-of-the-art algorithm
  • Bottom-up schema generation
  • Search for schema candidate
  • Minimum description length cost

17

Data Systems Lab @POSTECH

38 of 43

Experiment

Algorithms compared

  • Jxplain (SOTA), KReduce, LReduce, KSS (Kletteke et al.), FMC (Frozza et al.)

Machine

  • Intel Xeon E5-2680 v4 @2.40GHz CPU and 756 GB RAM

Datasets

  • .

18

Data Systems Lab @POSTECH

39 of 43

Accuracy Comparison

ReCG outperforms all competitors in F1 score, especially shows 46% increase compared to the SOTA

  • Shows upto 47% higher recall compared to SOTA
  • Shows upto 18% higher precision compared to SOTA

19

46%

18%

47%

Data Systems Lab @POSTECH

40 of 43

MDL Cost Analysis

  • ReCG’s MDL cost is 3.8×, 4.9× and 14.2× smaller than Jxplain, KReduce, and LReduce, respectively.
  • MDL costs show a strong negative correlation (-0.77) with F1 scores.

20

3.8×

4.9×

14.2×

Data Systems Lab @POSTECH

41 of 43

Algorithm Execution Time

ReCG’s performance is positioned in the middle of KReduce and Jxplain

  • ReCG outperforms Jxplain by 1.54× to 2.11× on average varying on dataset size
  • ReCG is slower than KReudce by about 2×
    • ReCG spends time searching for various candidates of schema

21

More experiments in our paper!

Data Systems Lab @POSTECH

42 of 43

Conclusion

  • We present an accurate and efficient JSON schema discovery algorithm called ReCG
  • ReCG solves the accuracy problems of existing algorithms
  • We propose the novel concept of bottom-up schema generation to provide more information within the schema derivation process
  • We solve the JSD problem as a search problem, while applying the minimum description length principle to choose the most promising schema

22

Data Systems Lab @POSTECH

43 of 43

THANK YOU

23

Data Systems Lab @POSTECH