1 of 73

Data Commons

Guha

2 of 73

Outline of Talk

Why the excitement around Data Science / Learning
Data and the motivation for Data Commons
Data Modeling and Knowledge Representation
Introduction to Data Commons
Demos
Schema.org and Data Commons
Conclusion and a call for participation

3 of 73

Evolution of Models

Much of the advance in the last 300 years has come from building models.

There was engineering before models, but building anything complex requires models

4 of 73

Analytic Models

Basic equations of continuum mechanics, materials, heat transfer, fluids, …. that capture the phenomenon in a mathematical form
System is modelled with these equations

5 of 73

Building complex artifacts

Finite element methods for complex cases

Manually built models using small number of equations capturing underlying phenomenon

26ft

6 of 73

Limits of Analytic models

We don’t have ‘basic equations’ for social, medical, behavioral, economic and other complex phenomena

?

7 of 73

Empirical Modelling

Take lots of data and fit the curve (i.e., m/c learning)

No causal equations required

Lots of data and compute power

Massively successful in the last 10 years

8 of 73

Success of Empirical Modelling

Spell Correction

Web search and advertising

News feed

Perception: Vision, speech

Mostly web-ecosystem products

9 of 73

So much more can be done …

What do we need to apply these modelling techniques more widely?

10 of 73

Doing more with data science

What is holding us back from applying data science to 10x more problems?

Three pillars of data science / machine learning

Algorithms: Regression, SVMs, … DNNs, LSTMs, GANs, …

Compute: GPUs, TPUs, ...

Data --- this is why advances have come from web companies

11 of 73

Follow the Data

Progress comes from large datasets: Google, FB, et. al.

Datasets set research direction

Skyserver: Sloan Digital Sky Survey
ImageNet

M/c learning flourishes with more quantity & variety of data

12 of 73

There is lot of data

There are a lot of datasets

data.gov: 177,928 data sets

dataMed: 1,541,00 data sets

dataverse: 48,112 data sets, …

+ Lots of private datasets

Very hard to use: One Web vs 100k FTP sites/Million word docs

13 of 73

Google for data

Google allows user to pretend that the Web is one site

Google for data, for use by programs: Enable developer to pretend all this data is in one database

Example use cases:

Understanding the Opioid epidemic
Explaining the impact of offering retail discounts

14 of 73

Issues in building a Google for data

Some design issues in building a ‘Google for data’

What is the data model?
What is the schema?
Where does the data come from?
How do we reconcile data from different sources?
Who do we believe?
How do we get others to give us the data?
What should be the first applications?

15 of 73

Data Model, Schema, etc.

Stepping back ...

We are trying to build intelligent systems

To be intelligent requires knowledge about the domain

Two very different approaches towards this knowledge

--- Telling the system (analytic models)

--- Having the system learn from examples (empirical models)

16 of 73

Learning and data models

Almost all learning system use fairly rudimentary representations for the training data

--- Feature vectors, simple tables

--- Why? Learnability has dictated the representation

What if the domain data can’t easily be expressed as a table?

And if you can’t represent the data, how can you learn?

`Knowledge Representation’ to the rescue?

17 of 73

The Knowledge Representation Program

Focus on representing what the system needs to know

1.`Representation language’ for representing the facts about the domain that the system needs to know

2. Algorithms for drawing conclusions from these facts

3. Figure out how to most effectively construct a database containing the facts. Often the hardest step.

18 of 73

The KR research program

Foundations from mathematical logic

Mostly built on ideas from first order predicate logic

Many variations

Formal systems, Frames, Production Systems, Expert Systems, Probabilistic systems

Feature vectors, RDBMS, Data Cubes are all specializations of the general class of models studied in KR

19 of 73

Formal Systems: McCarthy 1958

"…programs to manipulate in a suitable formal language (most likely a part of the predicate calculus) common instrumental statements. The basic program will draw immediate conclusions from a list of premises. These conclusions will be either declarative or imperative sentences. When an imperative sentence is deduced the program takes a corresponding action."

Work that influenced everything from functional

programming to program verification and databases

Deeply influenced by formal logic and philosophy

20 of 73

Frames: Minsky 1974

A frame is a data-structure for representing a stereotyped situation, like being in a certain kind of living room, or going to a child's birthday party. Attached to each frame are several kinds of information. Some of this information is about how to use the frame. Some is about what one can expect to happen next. Some is about what to do if these expectations are not confirmed.

Drew inspiration from Cognitive Science, Logic, …

Introduced defaults, taxonomies, classes, inheritance

Impacted not just AI, but also programming languages, etc.

21 of 73

Newell-Simon School

Started with the Logic Theorist (1958)

Deep roots in Psychology and Cognitive Science

Cognitively plausible, models of learning, problem solving, etc.

Reasoning as search, GPS/means-end-analysis, heuristic search

Knowledge level

GPS, SOAR, ...

22 of 73

Example: Simple Facts

Chris Smith is a student at UC Berkeley

typeOf(ChrisSmith, UCBerkeleyStudent)

or should it be

a. studiesAt(ChrisSmith, UCBerkeley)

b. typeOf(ChrisSmith, Student)

Should (b) follow from (a). What does `follow’ mean?

What relations/properties/attributes (aka predicate) should we use?

What is a ‘type’? Set of students wearing red socks in DS100 who ...

23 of 73

Example: Temporal Representation

ChrisSmith’s address in in 2016 was xxx

ChrisSmith’s gender in 2016 was male

address(ChrisSmith, xxx, 2016)

gender(ChrisSmith, Male, 2016)

What is 2016? A point, an interval, …?

What was ChrisSmith’s address in 2017? gender in 2017?

The frame problem, frame axiom

24 of 73

Example: Defaults, frames, approximations

Undergrads live near their university.

What about study abroad? How about leave of absence?

Defaults, non-monotonic logic, ...

Undergrad students are between the ages of 18-22, live at or near the university, … frames

How do we model ‘near’: near(x, y)? Is near transitive? How far?

25 of 73

Probability

Most probabilistic systems are propositional

Combining first order logic with probabilities is hard

Probabilities over populations vs over possible worlds

Frequentists vs Bayesians

26 of 73

Expert Systems

Focus on using these tools to solve practical problems

Started with production systems, evolved to incorporate ideas from frames, uncertainty, formal systems, …

Dendral, Mycin ….

Many industrial systems, some still around

Overhype lead to last AI winter

27 of 73

What happened to all this?

AI winter of the 90s

Too much had been promised
The web happened and …
Cyc, Yale Shooting Problem

But influences remain

In applications, CS theory, rule based systems, on the web

28 of 73

The rise of KR on the web

Basis for structured data on the web --- RDF, Linked Open Data, Schema.org, …

Used the flexibility of KR representations, without the lofty goals

Restricted use to ground atomic facts using binary relations

--- knowledge graphs

The big impetus from search

29 of 73

Knowledge Graphs are now ubiquitous

Search, Personal Assistants and other consumer apps

We reached the limits of what can be done with text

More form factors and more interaction modalities → Structured data is becoming more important …

Google (KG), Microsoft (Satori), Facebook (OGP), Amazon (Alexa), Apple …

Each has their own ‘knowledge graph’

30 of 73

Knowledge Graphs in search

31 of 73

In Personal assistants

Google

Now

Microsoft

Cortana

32 of 73

Data Commons: Google for data

Google for data, for use by programs: Enable developer to pretend all this data is in one Knowledge Graph

33 of 73

Data Commons

Pull together a number of interesting datasets into a single coherent knowledge graph
Schema from Schema.org + vocab for time, statistics, etc.
Enough of a core to enable interesting applications
Automated resolution of entities, etc. across datasets
Enable others to add to the knowledge graph
Expose data via apis to everyone

34 of 73

Many similarities with Web & Google

Anyone can publish: Some public, some for fee, some private

Analog of search engine:

Services that crawl, aggregate index and provide apis

The best ‘search engine’ will win. Fastest, latest, biggest index, best quality, …

Anyone can build applications on top of these APIs

35 of 73

Provenance: Access, not truth

GetData(< >, property, [W1, W2, W3, …])

GetData(<X-Desc>, property)

X

Like the web, there will be wrong data, spam, etc.

Caller has to decide who to trust. Over time, system can help

W1

W2

W3

GetData(< >, property, [‘good sources’])

36 of 73

Github for data

Projects provide data

Data can be public or private (ala internet & intranet)

Projects can import data from other projects

Unlike Github for code

- bring code to data

- data, unlike code, can be joined

37 of 73

DataCommons.org

First version, with data from

Census (American Community Survey)
Bureau of Labor Statistics
NOAA
FBI
CollegeScoreCard
CDC
Voter data
Wikipedia

38 of 73

Demo

39 of 73

How to use DataCommons

Ask a question / identify a topic
Explore dataCommons and find pertinent data
Query dataCommons to extract a table representation
Proceed with analysis

40 of 73

How to use DataCommons

Ask a question / identify a topic

How does the distribution of income differ between genders?

Explore dataCommons and find pertinent data

Search the browser

Query dataCommons to extract a table representation

Use the Python query API

Proceed with analysis

How does the distribution of income differ between genders:

There’s ample studies supporting a gender income gap in the United States. Is there data collected nationally that manifests this pattern?

Explore dataCommons and find pertinent data:

We have built a browser to explore the dataCommons graph in a more human-readable manner
Start at home page

From this page you can search for entities within the dataCommons graph.
Let’s search for income data split by gender sampled across cities in the United States

Let’s distill it to the city level

Search “Berkeley” in the search bar

These are the results searching for entities related to the term “Berkeley”

Click “Berkeley in California”

We’ve found the entity related to Berkeley, CA
At the top you’ll see some basic information:

Lat, lon, elevation
What county it’s contained in
Foreign identifiers for the same entity in other knowledge graphs and databases

The DCID

This is the unique identifier for the entity Berkeley: this is important since it used by the query API to search for this entity (you’ll see more about this later)

Provenance

This is where the statement is sourced from (clicking on the link brings you to the exact entity in the foreign DB or documentation about the source)

Let’s find the population related to males and females
Click on the population then click on the observation
Ignore the details about populations and observations, they’ll be in the lab

Query dataCommons and build a data frame

Now that you’ve explored for the right data, it’s time to ingest the data into an analytical environment
We have a Python Query API that allows you to convert the data in this graph into a Pandas DataFrame

This means that after querying the data and performing some clean-up, you can use all the tools that you’ve become familiar with over the course of this semester

Go to the notebook and step through the code and show the final data frame.

Point out that you have columns with DCIDs related to entities that you’re interested in and so you’ll want to do some final cleanup
After the cleanup, you get a nice table that you can work with

Proceed with the analysis

Once done, you can explore the data (show the plotted histogram)

41 of 73

Now it’s your turn

This week’s lab...

Train a linear model for predicting prevalence of obesity using statistics from three different datasets.

Learn more about knowledge graphs
Learn how to query dataCommons for statistics

42 of 73

Lecture Check-in

Please fill out the google form to check into lecture!

yellkey.com/three

43 of 73

DataCommons

Internals

Storage
Query

Technical Challenges

Making datacommons ‘web scale’

44 of 73

DataCommons Mixer

Places

Statistical Populations Table

Observations Table

Triples Table

Browser

iPython

Other apps

Organizations

45 of 73

Data Commons Internals

Storage: as a set of relational tables

Why?

Graph query/update → SQL (translation demo)

Higher level APIs

Browser

46 of 73

Technical Challenges

2 main challenges

Technical challenge of ‘stitching’ the graph
Preserving Schema integrity

47 of 73

Technical Challenge: The problem of names

48 of 73

As a communication problem

“John son Jane”

Ok, got

“John son Jane”

but,

Which John?

What is son?

Which Jane?

“John son Jane”

I think you mean

johns@ son janeFoo@

“John, <desc of John> son

Jane, <desc of Jane>”

49 of 73

An old problem

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.

Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem

--- Shannon, in ‘Mathematical Theory of Communication’

50 of 73

More recent variants

51 of 73

Coordinating Names

~1000s of terms like Actor, birthdate

10s for most sites

~1b-100b terms like Tori Amos and Newton, NC

Cannot expect 1000s of sites to coordinate on these

Problem not in generating URIs, Problem in coordination costs

Need to reduce shared vocabulary to minimum!

Agreements O(#cols) not O(#rows)

Tori Amos

Newton, NC

8/22/1963

birthplace

Musician

type

citizenOf

USA

birthdate

52 of 73

Alternate point of view

Look at it as a communication problem.

We need to communicate a reference to something

that we don’t share a name for.

How do we do it?

Design data schemas/formats that are optimized for ease of integration (as opposed to consistency, compactness, etc.)

53 of 73

Reference in human communication

How does this work in human communication?

We disambiguate using descriptions

Ambiguous names: John McCarthy, Stanford CS Prof, his son Tim,

Entities with no names: McCarthy’s first car, awm@’s left shoe, …

Complex descriptions: X who is married to Y whose mother went to school with principal of X’s high school who is ....

54 of 73

Solution: Reference by description

Humans: Reference by Description (RBD)

‘Jane Smith’, teacher at ‘Gunn HS’, located in city `Palo Alto’

Programmatic version of reference by description

example here

55 of 73

Reference by Description

Use some shared vocabulary and shared knowledge of the underlying domain to communicate references

Where does the shared vocabulary come from?

example: Person named Perez who teaches a course named DS100 in a University ...

56 of 73

Schema.org

Schema for embedding structured data in web pages, email, etc.

Collaborative effort started by Google, Microsoft, et. al. in 2010

Data used by Google, Bing, Cortana, Siri, …

Today about 2000 core terms, in use by over 20m sites.

Gives us our core bootstrapping vocabulary

57 of 73

58 of 73

Schema.org applications: search

59 of 73

Reservations ➔ Personal Assistant

Open Table → confirmation email → Now/Cortana Reminder

60 of 73

Schema.org … the numbers

In use by ~20 million sites: 20% growth over last 18 months

Roughly 40% of pages in search index have markup

~50% of US/EU ecommerce emails

Vocab: Core (~ 2k terms) + extensions (real estate, finance, etc.)

Supported by most major web publishing platforms

61 of 73

Schema.org: Major sites

News: Nytimes, guardian, bbc,

Movies: imdb, rottentomatoes, movies.com

Products: ebay, alibaba, sears, cafepress, sulit, fotolia

Local: yelp, allmenus, urbanspoon

Events: wherevent, meetup, zillow, eventful

….

62 of 73

Schema.org and DataCommons

Including all this data, expressed in Schema.org into datacommons

Scientific data sets:

NYBG,

NOOA, EPA, NASA

Census, Public

Statistics

Structured data from

X million sites using Schema.org

Wikipedia

Events

Jobs

Venues

Offers

Publishers

...

63 of 73

Data Commons futures

Current focus on students

Data as a service for ML/DS courses, etc.

python APIs, python notebooks

Next set of applications

Platform for data journalism
Research data platform

64 of 73

Concluding

Data is a critical to Data Science.

There are many, richer representation data models

Data Commons is an effort to bring together a very large collection of data into a single coherent representation

Many technical challenges, but we have made some progress

Much more work needs to be done ...

65 of 73

A long long time ago ...

The context: 1990-1994

The idea of `online’ was in the air

Contenders from universities

Gopher, Archie, Wais, Usenet, Jughead, WWW

From private sector

AOL, Prodigy, Compuserve MSN

Most had more features than WWW

66 of 73

The Arrival of Mosaic

NCSA Mosaic was released in fall 1993

It had images, forms, cgi-bin, support for multiple protocols, …

The Web had 350,000% growth rate in 1993

Why WWW and why this growth?

67 of 73

TimBL’s Web was more flexible

TimBL did not present it to the world as a finished product, but as a ‘work in progress’ and invited others to contribute
Many groups came up with different derivatives
Andreessen, Bina & McCool extended it to create Mosaic by adding images, forms, cgi, mime-types, …

In that spirit we invite to take DataCommons in new directions!

68 of 73

We need your help!

Use it in your next project

Research project

Hobby project, ...

Add data

Climate

Elections: Voting data, polling data

So much more …

69 of 73

We need your help: APIs

Higher level APIs:

Currently built on datalog. Other graph query languages?
Higher level apis for specific tasks
Web interface for generating tables

Tools for checking sanity of data

Help improve the tutorial, documentation, lab ...

70 of 73

We need your help: Visualization tools

Palo Alto

Berkeley

71 of 73

Natural language tools

Better search
What is the average age of veterans in …
Response as chart, Response as table

‘Show me a plot of median age vs population for ...

72 of 73

Going beyond ...

From curve fitting to explanations … what happened in Palo Alto?

73 of 73

Questions?