Challenges and Progress in Dataset Search
Zhiyu Chen
Lehigh University
Background
2
Challenges and Progress in Dataset Search
Background
3
Challenges and Progress in Dataset Search
What is a dataset ?
4
Challenges and Progress in Dataset Search
What is a dataset ?
5
Challenges and Progress in Dataset Search
Why dataset search is a challenging task ?
�
6
Challenges and Progress in Dataset Search
Why dataset search is a challenging task ?
�
7
Challenges and Progress in Dataset Search
Solution: Schema Label Generation[1]
�
�
8
Challenges and Progress in Dataset Search
[1] Chen et al, “Generating Schema Labels through Dataset Content Analysis”, Companion of the The Web Conference 2018 on The Web Conference 2018.
Features: Schema Label Generation
9
Challenges and Progress in Dataset Search
Features: Schema Label Generation
10
Challenges and Progress in Dataset Search
Features: Schema Label Generation
11
Challenges and Progress in Dataset Search
Content histogram of of “Zip Code”
Content histogram of of “Farmers Market Name”
Features: Schema Label Generation
12
Challenges and Progress in Dataset Search
... | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | ... |
... | 100 | 0 | 0 | 100 | 0 | 0 | 0 | 100 | ... |
BoWs features of column “Longitude”
( Assume 100 data rows in total )
Features: Schema Label Generation
13
Challenges and Progress in Dataset Search
Results: Top-N accuracy
14
Challenges and Progress in Dataset Search
Why dataset search is a challenging task ?
�
15
Challenges and Progress in Dataset Search
Recognizing Quantity Names for Tabular Data[2]
16
Challenges and Progress in Dataset Search
[2] Yi et al, “Recognizing Quantity Names for Tabular Data”, International Workshop on Data Search (DATA:SEARCH'18)
Recognizing Quantity Names for Tabular Data
�
17
Challenges and Progress in Dataset Search
Query asks for data in feet
System recognizes it as a length
Match to datasets containing lengths/meters/yards...
Recognizing Quantity Names for Tabular Data
�
18
Challenges and Progress in Dataset Search
Recognizing Quantity Names for Tabular Data
�
19
Challenges and Progress in Dataset Search
Elevation, ft |
1155 |
0 |
203 |
204 |
204 |
... |
1074 |
1100 |
1354 |
1090 |
1090 |
duration_seconds |
30.24 |
30.56 |
247.52 |
97.34 |
30.11 |
... |
36.76 |
49.52 |
81.23 |
198.53 |
49.82 |
Total income (dollars in millions) |
342.1 |
2279.1 |
3995.9 |
5978.8 |
8431.3 |
... |
20034.5 |
28997 |
134038.4 |
230468.1 |
Confidence_limit_High |
23.6 |
35 |
38 |
15.4 |
7.4 |
... |
41.3 |
57 |
57.2 |
22.7 |
87.3 |
CO2 (tons) |
26601.04 |
29448.39 |
9932.26 |
15689.41 |
23015.94 |
... |
7324.18 |
0 |
0 |
928126.66 |
0 |
Length
Time
Weight
Percent
Currency
Dataset
20
Challenges and Progress in Dataset Search
Extract Data from data.gov and give ID
Retain numeric columns only
Label column with 0-5
Remove duplicate column names within the same dataset
Quantity Name | # of Instances |
Length | 896 |
Time | 352 |
Percent | 1031 |
Currency | 875 |
Weight | 233 |
Total | 3387 |
Features
�
�
21
Challenges and Progress in Dataset Search
Column Name: Canopy Height in meters
Column Name: Trip duration
Results: Recognizing Quantity Names
�
�
22
Challenges and Progress in Dataset Search
Dataset Search
�
23
Challenges and Progress in Dataset Search
Thank you
Dataset
�
25
Challenges and Progress in Dataset Search
in parentheses
Perimeter (m)
after “in”
Dist. from Coop in miles
after a dash or underscore
segment_length_ft
tie with context terms
time seconds