Data Commons
Guha
Outline of Talk
Evolution of Models
Much of the advance in the last 300 years has come from building models.
There was engineering before models, but building anything complex requires models
Analytic Models
Building complex artifacts
Finite element methods for complex cases
Manually built models using small number of equations capturing underlying phenomenon
26ft
Limits of Analytic models
We don’t have ‘basic equations’ for social, medical, behavioral, economic and other complex phenomena
?
Empirical Modelling
Take lots of data and fit the curve (i.e., m/c learning)
No causal equations required
Lots of data and compute power
Massively successful in the last 10 years
Success of Empirical Modelling
Spell Correction
Web search and advertising
News feed
Perception: Vision, speech
Mostly web-ecosystem products
So much more can be done …
What do we need to apply these modelling techniques more widely?
Doing more with data science
What is holding us back from applying data science to 10x more problems?
Three pillars of data science / machine learning
Algorithms: Regression, SVMs, … DNNs, LSTMs, GANs, …
Compute: GPUs, TPUs, ...
Data --- this is why advances have come from web companies
Follow the Data
Progress comes from large datasets: Google, FB, et. al.
Datasets set research direction
M/c learning flourishes with more quantity & variety of data
There is lot of data
There are a lot of datasets
data.gov: 177,928 data sets
dataMed: 1,541,00 data sets
dataverse: 48,112 data sets, …
+ Lots of private datasets
Very hard to use: One Web vs 100k FTP sites/Million word docs
Google for data
Google allows user to pretend that the Web is one site
Google for data, for use by programs: Enable developer to pretend all this data is in one database
Example use cases:
Issues in building a Google for data
Some design issues in building a ‘Google for data’
Data Model, Schema, etc.
Stepping back ...
We are trying to build intelligent systems
To be intelligent requires knowledge about the domain
Two very different approaches towards this knowledge
--- Telling the system (analytic models)
--- Having the system learn from examples (empirical models)
Learning and data models
Almost all learning system use fairly rudimentary representations for the training data
--- Feature vectors, simple tables
--- Why? Learnability has dictated the representation
What if the domain data can’t easily be expressed as a table?
And if you can’t represent the data, how can you learn?
`Knowledge Representation’ to the rescue?
The Knowledge Representation Program
Focus on representing what the system needs to know
1.`Representation language’ for representing the facts about the domain that the system needs to know
2. Algorithms for drawing conclusions from these facts
3. Figure out how to most effectively construct a database containing the facts. Often the hardest step.
The KR research program
Formal Systems: McCarthy 1958
"…programs to manipulate in a suitable formal language (most likely a part of the predicate calculus) common instrumental statements. The basic program will draw immediate conclusions from a list of premises. These conclusions will be either declarative or imperative sentences. When an imperative sentence is deduced the program takes a corresponding action."
Work that influenced everything from functional
programming to program verification and databases
Deeply influenced by formal logic and philosophy
Frames: Minsky 1974
A frame is a data-structure for representing a stereotyped situation, like being in a certain kind of living room, or going to a child's birthday party. Attached to each frame are several kinds of information. Some of this information is about how to use the frame. Some is about what one can expect to happen next. Some is about what to do if these expectations are not confirmed.
Drew inspiration from Cognitive Science, Logic, …
Introduced defaults, taxonomies, classes, inheritance
Impacted not just AI, but also programming languages, etc.
Newell-Simon School
Started with the Logic Theorist (1958)
Deep roots in Psychology and Cognitive Science
Cognitively plausible, models of learning, problem solving, etc.
Reasoning as search, GPS/means-end-analysis, heuristic search
Knowledge level
GPS, SOAR, ...
Example: Simple Facts
Chris Smith is a student at UC Berkeley
typeOf(ChrisSmith, UCBerkeleyStudent)
or should it be
a. studiesAt(ChrisSmith, UCBerkeley)
b. typeOf(ChrisSmith, Student)
Should (b) follow from (a). What does `follow’ mean?
What relations/properties/attributes (aka predicate) should we use?
What is a ‘type’? Set of students wearing red socks in DS100 who ...
Example: Temporal Representation
ChrisSmith’s address in in 2016 was xxx
ChrisSmith’s gender in 2016 was male
address(ChrisSmith, xxx, 2016)
gender(ChrisSmith, Male, 2016)
What is 2016? A point, an interval, …?
What was ChrisSmith’s address in 2017? gender in 2017?
The frame problem, frame axiom
Example: Defaults, frames, approximations
Undergrads live near their university.
What about study abroad? How about leave of absence?
Defaults, non-monotonic logic, ...
Undergrad students are between the ages of 18-22, live at or near the university, … frames
How do we model ‘near’: near(x, y)? Is near transitive? How far?
Probability
Most probabilistic systems are propositional
Combining first order logic with probabilities is hard
Probabilities over populations vs over possible worlds
Frequentists vs Bayesians
Expert Systems
Focus on using these tools to solve practical problems
Started with production systems, evolved to incorporate ideas from frames, uncertainty, formal systems, …
Dendral, Mycin ….
Many industrial systems, some still around
Overhype lead to last AI winter
What happened to all this?
AI winter of the 90s
But influences remain
The rise of KR on the web
Basis for structured data on the web --- RDF, Linked Open Data, Schema.org, …
Used the flexibility of KR representations, without the lofty goals
Restricted use to ground atomic facts using binary relations
--- knowledge graphs
The big impetus from search
Knowledge Graphs are now ubiquitous
Search, Personal Assistants and other consumer apps
We reached the limits of what can be done with text
More form factors and more interaction modalities → Structured data is becoming more important …
Google (KG), Microsoft (Satori), Facebook (OGP), Amazon (Alexa), Apple …
Each has their own ‘knowledge graph’
Knowledge Graphs in search
In Personal assistants
Now
Microsoft
Cortana
Data Commons: Google for data
Google for data, for use by programs: Enable developer to pretend all this data is in one Knowledge Graph
Data Commons
Many similarities with Web & Google
Anyone can publish: Some public, some for fee, some private
Analog of search engine:
Services that crawl, aggregate index and provide apis
The best ‘search engine’ will win. Fastest, latest, biggest index, best quality, …
Anyone can build applications on top of these APIs
Provenance: Access, not truth
GetData(< >, property, [W1, W2, W3, …])
GetData(<X-Desc>, property)
GetData(<X-Desc>, property)
GetData(<X-Desc>, property)
X
Like the web, there will be wrong data, spam, etc.
Caller has to decide who to trust. Over time, system can help
W1
W2
W3
GetData(< >, property, [‘good sources’])
Github for data
Projects provide data
Data can be public or private (ala internet & intranet)
Projects can import data from other projects
Unlike Github for code
- bring code to data
- data, unlike code, can be joined
DataCommons.org
First version, with data from
Demo
How to use DataCommons
How to use DataCommons
Now it’s your turn
This week’s lab...
Train a linear model for predicting prevalence of obesity using statistics from three different datasets.
Lecture Check-in
Please fill out the google form to check into lecture!
yellkey.com/three
DataCommons
Internals
Technical Challenges
Making datacommons ‘web scale’
DataCommons Mixer
Places
Statistical Populations Table
Observations Table
Triples Table
Browser
iPython
Other apps
Organizations
Data Commons Internals
Storage: as a set of relational tables
Graph query/update → SQL (translation demo)
Higher level APIs
Browser
Technical Challenges
2 main challenges
Technical Challenge: The problem of names
As a communication problem
“John son Jane”
Ok, got
“John son Jane”
but,
Which John?
What is son?
Which Jane?
“John son Jane”
I think you mean
johns@ son janeFoo@
“John, <desc of John> son
Jane, <desc of Jane>”
An old problem
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.
Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem
--- Shannon, in ‘Mathematical Theory of Communication’
More recent variants
More recent versions of the problem
Most common solution is to have global ids for eveything
All cases of ‘should have had the same key/id, but because of some noise/error, there is a mismatch that we have to fix.’
Coordinating Names
~1000s of terms like Actor, birthdate
10s for most sites
~1b-100b terms like Tori Amos and Newton, NC
Cannot expect 1000s of sites to coordinate on these
Problem not in generating URIs, Problem in coordination costs
Need to reduce shared vocabulary to minimum!
Agreements O(#cols) not O(#rows)
Tori Amos
Newton, NC
8/22/1963
birthplace
Musician
type
citizenOf
USA
birthdate
Alternate point of view
Look at it as a communication problem.
We need to communicate a reference to something
that we don’t share a name for.
How do we do it?
Design data schemas/formats that are optimized for ease of integration (as opposed to consistency, compactness, etc.)
Reference in human communication
How does this work in human communication?
We disambiguate using descriptions
Ambiguous names: John McCarthy, Stanford CS Prof, his son Tim,
Entities with no names: McCarthy’s first car, awm@’s left shoe, …
Complex descriptions: X who is married to Y whose mother went to school with principal of X’s high school who is ....
Solution: Reference by description
Humans: Reference by Description (RBD)
‘Jane Smith’, teacher at ‘Gunn HS’, located in city `Palo Alto’
Programmatic version of reference by description
example here
Reference by Description
Use some shared vocabulary and shared knowledge of the underlying domain to communicate references
Where does the shared vocabulary come from?
example: Person named Perez who teaches a course named DS100 in a University ...
Schema.org
Schema for embedding structured data in web pages, email, etc.
Collaborative effort started by Google, Microsoft, et. al. in 2010
Data used by Google, Bing, Cortana, Siri, …
Today about 2000 core terms, in use by over 20m sites.
Gives us our core bootstrapping vocabulary
Schema.org applications: search
Reservations ➔ Personal Assistant
Open Table → confirmation email → Now/Cortana Reminder
Schema.org … the numbers
In use by ~20 million sites: 20% growth over last 18 months
Roughly 40% of pages in search index have markup
~50% of US/EU ecommerce emails
Vocab: Core (~ 2k terms) + extensions (real estate, finance, etc.)
Supported by most major web publishing platforms
Schema.org: Major sites
News: Nytimes, guardian, bbc,
Movies: imdb, rottentomatoes, movies.com
Products: ebay, alibaba, sears, cafepress, sulit, fotolia
Local: yelp, allmenus, urbanspoon
Events: wherevent, meetup, zillow, eventful
….
Schema.org and DataCommons
Including all this data, expressed in Schema.org into datacommons
Scientific data sets:
NYBG,
NOOA, EPA, NASA
Census, Public
Statistics
Structured data from
X million sites using Schema.org
Wikipedia
Events
Jobs
Venues
Offers
Publishers
...
...
...
Data Commons futures
Current focus on students
Data as a service for ML/DS courses, etc.
Next set of applications
Concluding
Data is a critical to Data Science.
There are many, richer representation data models
Data Commons is an effort to bring together a very large collection of data into a single coherent representation
Many technical challenges, but we have made some progress
Much more work needs to be done ...
A long long time ago ...
The context: 1990-1994
The idea of `online’ was in the air
Contenders from universities
From private sector
The Arrival of Mosaic
TimBL’s Web was more flexible
In that spirit we invite to take DataCommons in new directions!
We need your help!
Use it in your next project
Research project
Hobby project, ...
Add data
Climate
Elections: Voting data, polling data
So much more …
We need your help: APIs
Higher level APIs:
Tools for checking sanity of data
Help improve the tutorial, documentation, lab ...
We need your help: Visualization tools
Palo Alto
Berkeley
Natural language tools
‘Show me a plot of median age vs population for ...
Going beyond ...
From curve fitting to explanations … what happened in Palo Alto?
Questions?