1 of 25

Challenges and Progress in Dataset Search

Zhiyu Chen

Lehigh University

2 of 25

Background

  • Today many people rely on datasets for their work
    • data journalists need datasets to tell a good story
    • researchers use datasets for their research experiments

2

Challenges and Progress in Dataset Search

3 of 25

Background

  • Today many people rely on datasets for their work
    • data journalists need datasets to tell a good story
    • researchers use datasets for their research experiments
  • An increasing number of datasets available online making it a non-trivial task to find the targeting datasets

3

Challenges and Progress in Dataset Search

4 of 25

What is a dataset ?

  • Dataset = data content + metadata
    • data content: text, audio, images and video
    • metadata : summarizes data content in a high level or illustrates its the background such as timestamp, file format and owners.

4

Challenges and Progress in Dataset Search

5 of 25

What is a dataset ?

  • Dataset = data content + metadata
    • data content: text, audio, images and video
    • metadata : summarizes data content in a high level or illustrates its the background such as timestamp, file format and owners.
  • Text data in tabular format is one of the most important among all types of data content

5

Challenges and Progress in Dataset Search

6 of 25

Why dataset search is a challenging task ?

  • Heterogeneity of schema labels makes it hard to connect different datasets or to queries
    • no naming standard for column names

6

Challenges and Progress in Dataset Search

7 of 25

Why dataset search is a challenging task ?

  • Heterogeneity of schema labels makes it hard to connect different datasets or to queries
    • no naming standard for column names
    • non-dictionary words (NDWs) commonly appear in data tables�

7

Challenges and Progress in Dataset Search

8 of 25

Solution: Schema Label Generation[1]

8

Challenges and Progress in Dataset Search

[1] Chen et al, “Generating Schema Labels through Dataset Content Analysis”, Companion of the The Web Conference 2018 on The Web Conference 2018.

9 of 25

Features: Schema Label Generation

  • the maximum value and minimal value for possible numerical columns
    • Observation: numbers appearing in different schema content are quite different
    • For non-numeric columns, the averages of the maximum value (or minimum value) are used�

9

Challenges and Progress in Dataset Search

10 of 25

Features: Schema Label Generation

  • content unique ratio describes the distribution of cell values.
    • Definition: the proportion of the number of unique cells over the number of all cells
    • Example:
      • the content unique ratio is 1/102 ≈ 0.01 for “State” if the table has 102 rows and all cell values under this schema label are all “NY”
      • In contrast, the content unique ratio is 102/102 = 1 for “Farmers Market Name” if all cell values under this schema label are different�

10

Challenges and Progress in Dataset Search

11 of 25

Features: Schema Label Generation

  • content histogram contains more accurate information about�the content distribution
    • First rank the unique cell values by frequencies (low-frequency first) and generate a vector where the i-th dimension is the frequency of the i-th ranked cell value
    • Resample the vector to a 20-dimensional vector using fast Fourier transform (FFT)

11

Challenges and Progress in Dataset Search

Content histogram of of “Zip Code”

Content histogram of of “Farmers Market Name”

12 of 25

Features: Schema Label Generation

  • Bag-of-Words (BoWs) features
    • We treat each schema content as a document, then the schema label prediction can be seen as a document classification task.
    • We only use character-level BoWs

12

Challenges and Progress in Dataset Search

...

4

5

6

7

8

9

0

-

...

...

100

0

0

100

0

0

0

100

...

BoWs features of column “Longitude”

( Assume 100 data rows in total )

13 of 25

Features: Schema Label Generation

  • Other features borrowed from the study of table header detection (Jing Fang et al. AAAI 2012)
    • number of characters
    • percentage of numeric characters
    • percentage of alphabetic characters
    • percentage of symbolic characters
    • percentage of numeric cells
    • average cell length
    • maximum cell length
    • minimum cell length�

13

Challenges and Progress in Dataset Search

14 of 25

Results: Top-N accuracy

  • Datasets
    • Data.gov (7485 csv files)
    • WikiTables (1.6M tables extracted from Wikipedia)
  • Random Forests are trained based on different features (curated/BoWs/combined)

14

Challenges and Progress in Dataset Search

15 of 25

Why dataset search is a challenging task ?

  • Heterogeneity of schema labels makes it hard to connect different datasets or to queries

  • Lack of dataset metadata
    • Higher level semantics in data content can only be inferred (e.g. quantity names )

15

Challenges and Progress in Dataset Search

16 of 25

Recognizing Quantity Names for Tabular Data[2]

  • A quantity name (also known as quantity kind) is a kind of quantity that can be measured using defined and unrestricted units of measurement.�

16

Challenges and Progress in Dataset Search

[2] Yi et al, “Recognizing Quantity Names for Tabular Data”, International Workshop on Data Search (DATA:SEARCH'18)

17 of 25

Recognizing Quantity Names for Tabular Data

  • Quantity names provide for a broader search scope than simply units�

17

Challenges and Progress in Dataset Search

Query asks for data in feet

System recognizes it as a length

Match to datasets containing lengths/meters/yards...

18 of 25

Recognizing Quantity Names for Tabular Data

  • Objective
    • Design and implement a model to recognize and recommend quantity names for numeric columns which could have units based on features extracted from column name and column content.�

18

Challenges and Progress in Dataset Search

19 of 25

Recognizing Quantity Names for Tabular Data

19

Challenges and Progress in Dataset Search

Elevation, ft

1155

0

203

204

204

...

1074

1100

1354

1090

1090

duration_seconds

30.24

30.56

247.52

97.34

30.11

...

36.76

49.52

81.23

198.53

49.82

Total income (dollars in millions)

342.1

2279.1

3995.9

5978.8

8431.3

...

20034.5

28997

134038.4

230468.1

Confidence_limit_High

23.6

35

38

15.4

7.4

...

41.3

57

57.2

22.7

87.3

CO2 (tons)

26601.04

29448.39

9932.26

15689.41

23015.94

...

7324.18

0

0

928126.66

0

Length

Time

Weight

Percent

Currency

20 of 25

Dataset

20

Challenges and Progress in Dataset Search

Extract Data from data.gov and give ID

Retain numeric columns only

Label column with 0-5

Remove duplicate column names within the same dataset

Quantity Name

# of Instances

Length

896

Time

352

Percent

1031

Currency

875

Weight

233

Total

3387

21 of 25

Features

21

Challenges and Progress in Dataset Search

Column Name: Canopy Height in meters

Column Name: Trip duration

22 of 25

Results: Recognizing Quantity Names

22

Challenges and Progress in Dataset Search

  • Upsample classes to handle imbalanced class distribution
  • Random forest with 10-Fold Cross Validation
  • Overall Accuracy: 89.5%
  • Confusion Matrix showing counts

23 of 25

Dataset Search

  • Schema label generation and quantity name recommendation are two ways to annotate datasets with semantic information

  • Indexing additional inferred metadata has the potential to improve dataset search results

23

Challenges and Progress in Dataset Search

24 of 25

Thank you

25 of 25

Dataset

25

Challenges and Progress in Dataset Search

  • Terms associated with five common quantity names

in parentheses

Perimeter (m)

after “in”

Dist. from Coop in miles

after a dash or underscore

segment_length_ft

tie with context terms

time seconds

  • Another class label “other”
    • Other quantity names, compound units, not quantity names, ...