1 of 73

Data Commons

Guha

2 of 73

Outline of Talk

  • Why the excitement around Data Science / Learning
  • Data and the motivation for Data Commons
  • Data Modeling and Knowledge Representation
  • Introduction to Data Commons
  • Demos
  • Schema.org and Data Commons
  • Conclusion and a call for participation

3 of 73

Evolution of Models

Much of the advance in the last 300 years has come from building models.

There was engineering before models, but building anything complex requires models

4 of 73

Analytic Models

  • Basic equations of continuum mechanics, materials, heat transfer, fluids, …. that capture the phenomenon in a mathematical form
  • System is modelled with these equations

5 of 73

Building complex artifacts

Finite element methods for complex cases

Manually built models using small number of equations capturing underlying phenomenon

26ft

6 of 73

Limits of Analytic models

We don’t have ‘basic equations’ for social, medical, behavioral, economic and other complex phenomena

?

7 of 73

Empirical Modelling

Take lots of data and fit the curve (i.e., m/c learning)

No causal equations required

Lots of data and compute power

Massively successful in the last 10 years

8 of 73

Success of Empirical Modelling

Spell Correction

Web search and advertising

News feed

Perception: Vision, speech

Mostly web-ecosystem products

9 of 73

So much more can be done …

What do we need to apply these modelling techniques more widely?

10 of 73

Doing more with data science

What is holding us back from applying data science to 10x more problems?

Three pillars of data science / machine learning

Algorithms: Regression, SVMs, … DNNs, LSTMs, GANs, …

Compute: GPUs, TPUs, ...

Data --- this is why advances have come from web companies

11 of 73

Follow the Data

Progress comes from large datasets: Google, FB, et. al.

Datasets set research direction

  • Skyserver: Sloan Digital Sky Survey
  • ImageNet

M/c learning flourishes with more quantity & variety of data

12 of 73

There is lot of data

There are a lot of datasets

data.gov: 177,928 data sets

dataMed: 1,541,00 data sets

dataverse: 48,112 data sets, …

+ Lots of private datasets

Very hard to use: One Web vs 100k FTP sites/Million word docs

13 of 73

Google for data

Google allows user to pretend that the Web is one site

Google for data, for use by programs: Enable developer to pretend all this data is in one database

Example use cases:

  • Understanding the Opioid epidemic
  • Explaining the impact of offering retail discounts

14 of 73

Issues in building a Google for data

Some design issues in building a ‘Google for data’

  • What is the data model?
  • What is the schema?
  • Where does the data come from?
  • How do we reconcile data from different sources?
  • Who do we believe?
  • How do we get others to give us the data?
  • What should be the first applications?

15 of 73

Data Model, Schema, etc.

Stepping back ...

We are trying to build intelligent systems

To be intelligent requires knowledge about the domain

Two very different approaches towards this knowledge

--- Telling the system (analytic models)

--- Having the system learn from examples (empirical models)

16 of 73

Learning and data models

Almost all learning system use fairly rudimentary representations for the training data

--- Feature vectors, simple tables

--- Why? Learnability has dictated the representation

What if the domain data can’t easily be expressed as a table?

And if you can’t represent the data, how can you learn?

`Knowledge Representation’ to the rescue?

17 of 73

The Knowledge Representation Program

Focus on representing what the system needs to know

1.`Representation language’ for representing the facts about the domain that the system needs to know

2. Algorithms for drawing conclusions from these facts

3. Figure out how to most effectively construct a database containing the facts. Often the hardest step.

18 of 73

The KR research program

  • Foundations from mathematical logic
    • Mostly built on ideas from first order predicate logic

  • Many variations
    • Formal systems, Frames, Production Systems, Expert Systems, Probabilistic systems

  • Feature vectors, RDBMS, Data Cubes are all specializations of the general class of models studied in KR

19 of 73

Formal Systems: McCarthy 1958

"…programs to manipulate in a suitable formal language (most likely a part of the predicate calculus) common instrumental statements. The basic program will draw immediate conclusions from a list of premises. These conclusions will be either declarative or imperative sentences. When an imperative sentence is deduced the program takes a corresponding action."

Work that influenced everything from functional

programming to program verification and databases

Deeply influenced by formal logic and philosophy

20 of 73

Frames: Minsky 1974

A frame is a data-structure for representing a stereotyped situation, like being in a certain kind of living room, or going to a child's birthday party. Attached to each frame are several kinds of information. Some of this information is about how to use the frame. Some is about what one can expect to happen next. Some is about what to do if these expectations are not confirmed.

Drew inspiration from Cognitive Science, Logic, …

Introduced defaults, taxonomies, classes, inheritance

Impacted not just AI, but also programming languages, etc.

21 of 73

Newell-Simon School

Started with the Logic Theorist (1958)

Deep roots in Psychology and Cognitive Science

Cognitively plausible, models of learning, problem solving, etc.

Reasoning as search, GPS/means-end-analysis, heuristic search

Knowledge level

GPS, SOAR, ...

22 of 73

Example: Simple Facts

Chris Smith is a student at UC Berkeley

typeOf(ChrisSmith, UCBerkeleyStudent)

or should it be

a. studiesAt(ChrisSmith, UCBerkeley)

b. typeOf(ChrisSmith, Student)

Should (b) follow from (a). What does `follow’ mean?

What relations/properties/attributes (aka predicate) should we use?

What is a ‘type’? Set of students wearing red socks in DS100 who ...

23 of 73

Example: Temporal Representation

ChrisSmith’s address in in 2016 was xxx

ChrisSmith’s gender in 2016 was male

address(ChrisSmith, xxx, 2016)

gender(ChrisSmith, Male, 2016)

What is 2016? A point, an interval, …?

What was ChrisSmith’s address in 2017? gender in 2017?

The frame problem, frame axiom

24 of 73

Example: Defaults, frames, approximations

Undergrads live near their university.

What about study abroad? How about leave of absence?

Defaults, non-monotonic logic, ...

Undergrad students are between the ages of 18-22, live at or near the university, … frames

How do we model ‘near’: near(x, y)? Is near transitive? How far?

25 of 73

Probability

Most probabilistic systems are propositional

Combining first order logic with probabilities is hard

Probabilities over populations vs over possible worlds

Frequentists vs Bayesians

26 of 73

Expert Systems

Focus on using these tools to solve practical problems

Started with production systems, evolved to incorporate ideas from frames, uncertainty, formal systems, …

Dendral, Mycin ….

Many industrial systems, some still around

Overhype lead to last AI winter

27 of 73

What happened to all this?

AI winter of the 90s

  • Too much had been promised
  • The web happened and …
  • Cyc, Yale Shooting Problem

But influences remain

  • In applications, CS theory, rule based systems, on the web

28 of 73

The rise of KR on the web

Basis for structured data on the web --- RDF, Linked Open Data, Schema.org, …

Used the flexibility of KR representations, without the lofty goals

Restricted use to ground atomic facts using binary relations

--- knowledge graphs

The big impetus from search

29 of 73

Knowledge Graphs are now ubiquitous

Search, Personal Assistants and other consumer apps

We reached the limits of what can be done with text

More form factors and more interaction modalities → Structured data is becoming more important …

Google (KG), Microsoft (Satori), Facebook (OGP), Amazon (Alexa), Apple …

Each has their own ‘knowledge graph’

30 of 73

Knowledge Graphs in search

31 of 73

In Personal assistants

Google

Now

Microsoft

Cortana

32 of 73

Data Commons: Google for data

Google for data, for use by programs: Enable developer to pretend all this data is in one Knowledge Graph

33 of 73

Data Commons

  • Pull together a number of interesting datasets into a single coherent knowledge graph
  • Schema from Schema.org + vocab for time, statistics, etc.
  • Enough of a core to enable interesting applications
  • Automated resolution of entities, etc. across datasets
  • Enable others to add to the knowledge graph
  • Expose data via apis to everyone

34 of 73

Many similarities with Web & Google

Anyone can publish: Some public, some for fee, some private

Analog of search engine:

Services that crawl, aggregate index and provide apis

The best ‘search engine’ will win. Fastest, latest, biggest index, best quality, …

Anyone can build applications on top of these APIs

35 of 73

Provenance: Access, not truth

GetData(< >, property, [W1, W2, W3, …])

GetData(<X-Desc>, property)

GetData(<X-Desc>, property)

GetData(<X-Desc>, property)

X

Like the web, there will be wrong data, spam, etc.

Caller has to decide who to trust. Over time, system can help

W1

W2

W3

GetData(< >, property, [‘good sources’])

36 of 73

Github for data

Projects provide data

Data can be public or private (ala internet & intranet)

Projects can import data from other projects

Unlike Github for code

- bring code to data

- data, unlike code, can be joined

37 of 73

DataCommons.org

First version, with data from

  • Census (American Community Survey)
  • Bureau of Labor Statistics
  • NOAA
  • FBI
  • CollegeScoreCard
  • CDC
  • Voter data
  • Wikipedia

38 of 73

Demo

39 of 73

How to use DataCommons

  1. Ask a question / identify a topic
  2. Explore dataCommons and find pertinent data
  3. Query dataCommons to extract a table representation
  4. Proceed with analysis

40 of 73

How to use DataCommons

  • Ask a question / identify a topic
    1. How does the distribution of income differ between genders?
  • Explore dataCommons and find pertinent data
  • Query dataCommons to extract a table representation
  • Proceed with analysis

41 of 73

Now it’s your turn

This week’s lab...

Train a linear model for predicting prevalence of obesity using statistics from three different datasets.

  • Learn more about knowledge graphs
  • Learn how to query dataCommons for statistics

42 of 73

Lecture Check-in

Please fill out the google form to check into lecture!

yellkey.com/three

43 of 73

DataCommons

Internals

  • Storage
  • Query

Technical Challenges

Making datacommons ‘web scale’

44 of 73

DataCommons Mixer

Places

Statistical Populations Table

Observations Table

Triples Table

Browser

iPython

Other apps

Organizations

45 of 73

Data Commons Internals

Storage: as a set of relational tables

  • Why?

Graph query/update → SQL (translation demo)

Higher level APIs

Browser

46 of 73

Technical Challenges

2 main challenges

  • Technical challenge of ‘stitching’ the graph
  • Preserving Schema integrity

47 of 73

Technical Challenge: The problem of names

48 of 73

As a communication problem

“John son Jane”

Ok, got

“John son Jane”

but,

Which John?

What is son?

Which Jane?

“John son Jane”

I think you mean

johns@ son janeFoo@

“John, <desc of John> son

Jane, <desc of Jane>”

49 of 73

An old problem

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.

Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem

--- Shannon, in ‘Mathematical Theory of Communication’

50 of 73

More recent variants

More recent versions of the problem

  • Record linkage
  • Database entity resolution
  • Feed ingestion

Most common solution is to have global ids for eveything

All cases of ‘should have had the same key/id, but because of some noise/error, there is a mismatch that we have to fix.’

51 of 73

Coordinating Names

~1000s of terms like Actor, birthdate

10s for most sites

~1b-100b terms like Tori Amos and Newton, NC

Cannot expect 1000s of sites to coordinate on these

Problem not in generating URIs, Problem in coordination costs

Need to reduce shared vocabulary to minimum!

Agreements O(#cols) not O(#rows)

Tori Amos

Newton, NC

8/22/1963

birthplace

Musician

type

citizenOf

USA

birthdate

52 of 73

Alternate point of view

Look at it as a communication problem.

We need to communicate a reference to something

that we don’t share a name for.

How do we do it?

Design data schemas/formats that are optimized for ease of integration (as opposed to consistency, compactness, etc.)

53 of 73

Reference in human communication

How does this work in human communication?

We disambiguate using descriptions

Ambiguous names: John McCarthy, Stanford CS Prof, his son Tim,

Entities with no names: McCarthy’s first car, awm@’s left shoe, …

Complex descriptions: X who is married to Y whose mother went to school with principal of X’s high school who is ....

54 of 73

Solution: Reference by description

Humans: Reference by Description (RBD)

‘Jane Smith’, teacher at ‘Gunn HS’, located in city `Palo Alto’

Programmatic version of reference by description

example here

55 of 73

Reference by Description

Use some shared vocabulary and shared knowledge of the underlying domain to communicate references

Where does the shared vocabulary come from?

example: Person named Perez who teaches a course named DS100 in a University ...

56 of 73

Schema.org

Schema for embedding structured data in web pages, email, etc.

Collaborative effort started by Google, Microsoft, et. al. in 2010

Data used by Google, Bing, Cortana, Siri, …

Today about 2000 core terms, in use by over 20m sites.

Gives us our core bootstrapping vocabulary

57 of 73

58 of 73

Schema.org applications: search

59 of 73

Reservations ➔ Personal Assistant

Open Table → confirmation email → Now/Cortana Reminder

60 of 73

Schema.org … the numbers

In use by ~20 million sites: 20% growth over last 18 months

Roughly 40% of pages in search index have markup

~50% of US/EU ecommerce emails

Vocab: Core (~ 2k terms) + extensions (real estate, finance, etc.)

Supported by most major web publishing platforms

61 of 73

Schema.org: Major sites

News: Nytimes, guardian, bbc,

Movies: imdb, rottentomatoes, movies.com

Products: ebay, alibaba, sears, cafepress, sulit, fotolia

Local: yelp, allmenus, urbanspoon

Events: wherevent, meetup, zillow, eventful

….

62 of 73

Schema.org and DataCommons

Including all this data, expressed in Schema.org into datacommons

Scientific data sets:

NYBG,

NOOA, EPA, NASA

Census, Public

Statistics

Structured data from

X million sites using Schema.org

Wikipedia

Events

Jobs

Venues

Offers

Publishers

...

...

...

63 of 73

Data Commons futures

Current focus on students

Data as a service for ML/DS courses, etc.

    • python APIs, python notebooks

Next set of applications

  • Platform for data journalism
  • Research data platform

64 of 73

Concluding

Data is a critical to Data Science.

There are many, richer representation data models

Data Commons is an effort to bring together a very large collection of data into a single coherent representation

Many technical challenges, but we have made some progress

Much more work needs to be done ...

65 of 73

A long long time ago ...

The context: 1990-1994

The idea of `online’ was in the air

Contenders from universities

    • Gopher, Archie, Wais, Usenet, Jughead, WWW

From private sector

    • AOL, Prodigy, Compuserve MSN
  • Most had more features than WWW

66 of 73

The Arrival of Mosaic

  • NCSA Mosaic was released in fall 1993
    • It had images, forms, cgi-bin, support for multiple protocols, …
  • The Web had 350,000% growth rate in 1993
  • Why WWW and why this growth?

67 of 73

TimBL’s Web was more flexible

  • TimBL did not present it to the world as a finished product, but as a ‘work in progress’ and invited others to contribute
  • Many groups came up with different derivatives
  • Andreessen, Bina & McCool extended it to create Mosaic by adding images, forms, cgi, mime-types, …

In that spirit we invite to take DataCommons in new directions!

68 of 73

We need your help!

Use it in your next project

Research project

Hobby project, ...

Add data

Climate

Elections: Voting data, polling data

So much more …

69 of 73

We need your help: APIs

Higher level APIs:

  • Currently built on datalog. Other graph query languages?
  • Higher level apis for specific tasks
  • Web interface for generating tables

Tools for checking sanity of data

Help improve the tutorial, documentation, lab ...

70 of 73

We need your help: Visualization tools

Palo Alto

Berkeley

71 of 73

Natural language tools

  • Better search
  • What is the average age of veterans in …
  • Response as chart, Response as table

‘Show me a plot of median age vs population for ...

72 of 73

Going beyond ...

From curve fitting to explanations … what happened in Palo Alto?

73 of 73

Questions?