Intro to Data Science
Juan E. Vargas
Outline
Who am I?
Where are we?
Where do we need to be?
How do we get there?
How can we ensure sustainable success?
Dr. George Heilmeier
Who are you?
What are you trying to accomplish?
What is the state of the art? What are its limitations?
What is the new approach you propose?
Who Am I?
Cloud Computing Academic Research Initiatives ��NSF-CLuE NSF-CiC, EU-VenusC, JPN, AUS�
Goals
Accelerate scientific exploration and discovery
Build partnerships with government-sponsored research agencies and university consortia to provide cloud services to academic and research communities worldwide
Broaden the research capabilities of scientists, foster collaborative research communities to accelerate scientific discovery at a global scale
Help researchers interact with massively scalable data analysis tools and services directly accessible from their desktops, laptop or mobile devices in the same way they interact with Web search and online resources
80+ Projects
National Science Foundation
Florida
Georgia
Mass.
Virginia
North Carolina
South Carolina
Indiana
Delaware
Japan
InfoPlosion
Europe
WA DC
Seattle
Project HQ
Penn
Louisiana
Washington
New York
New Mexico
North Dakota
California
Colorado
Michigan
Texas
Taiwan- starting
China
Australia
Partners
NSF (29 projects)
EU (28 projects)
Innovation
… where are we, where are we going?
Advances in STEM+C
propelled our existence into the digital
age. Smart sensors embedded in
devices everywhere feed data 24/7 into
cloud computing clusters where
new types of data analysis, modeling and simulation
are performed at scales and security levels
never seen before.
Science and Engineering are multidisciplinary and
data-centric, unifying theory,
experimentation, modeling, and simulation.
When well orchestrated to nurture each other,
these synergistic interactions open
innovation opportunities for those who
are ready
In the 2020’s a new breed
of students who were born
during this century is coming to
college … living a continuously
connected, computer-based,
social network existence.
The new generations must learn
to operate in and out that existence.
As educators and professionals we must
find innovative ways to transform that
existence into the learning
experiences and products
of the future
Once again we are at
an inflection point
Of the 7.2 billion people on Earth, only about
45% enjoy the benefits of the internet.
The number of “things” (smart devices) getting
connected to the internet is orders of magnitude larger.
Conservative estimates predict that
by year 2025 there will be about
50 billion “things” connected to the internet.
Those “things” will be designed and produced by
professionals with the knowledge, skills
and resources ready for such a huge and
transformative
innovation opportunity.
Science, Technology,
Engineering, Math …
and Computing
are at the center of these trends.
The promises and challenges are real,
but we must find innovative ways to
build and operate new types of
systems at scales never
seen before.
The STEM+C Professionals of the
future will define and build new technology
and software platforms, from device to cloud,
covering Financials, Transportation, Healthcare,
Manufacturing, Aviation, Energy Management,
Power & Water, Oil & Gas, etc.
There are many opportunities but also significant challenges.
I will outline some of these
challenges and opportunities
Invention, Innovation, Invisible Technologies
Innovation is the outcome of inventions leading to the creation of economic impact and social value.
Successful technologies are those that become “invisible” . . . .
Invention, Innovation, Invisible Technologies
Invention, Innovation, Simpler Technologies
Pace of Innovation
Pace of Innovation
In most cases, computing has been at the center of innovation during the last 50 years
ORG | Founded | Valuation |
APPLE | 1976 | $ 2.76 T USD |
MSFT | 1975 | $ 2.57 T USD |
1998 | $ 1.6 T USD | |
AMZN | 1994 | $ 1.42 T USD |
2004 | $ 797 B USD | |
NVIDIA | 1993 | $1.07 T USD |
TESLA | 2003 | $ 678 B USB |
Where are we?
CORE
FDS
SDS
EDS
FDS = Foundations of Data Science
SDS = Software for Data Science
EDS = Engineering of Data Science
“… applications must drive R&D to provide concrete goals and metrics to evaluate progress …”
David Patterson
Artificial
Intelligence
Data
Science
Machine
Learning
Deep
Learning
Virtuous Circles
FAI
SAI
EAI
FDS
SDS
EDS
CRA Report
The 3V’s
Where are we?
Acquisition & Recording
Extraction Cleaning Annotation
Integration Aggregation Representation
Analysis & Modeling
Interpretation & Reports
Scale
Heterogeneity
Time
Privacy
Human Collaboration
Variety
Velocity
Volume
Where are we?
Serving
Infrastructure
Configuration
Data Collection
Feature
Extraction
Data
Verification
Machine
Resource
Management
Process
Management Tools
Monitoring
ML/DS
Code
Hardest part of Data Science is getting data ready
Analysis
Tools
Acquisition & Recording
Extraction Cleaning Annotation
Integration Aggregation Representation
Analysis & Modeling
Interpretation & Reports
Where are we?
Proof of Concept
80%
Trying to Scale
~15%
Industrialized for Growth
< 5%
Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *
* Accenture
Data Science as the Fourth Paradigm of Science
Jim Gray said "...everything about science is changing because of the impact of IT and the data deluge...” He called DS the "fourth paradigm" of science:
Empirical -> Theoretical -> Computational -> Data-driven
-> Cloud
-> Quantum
-> AI
DS is an interdisciplinary process; algorithms and systems are used to extract knowledge and insights from structured or unstructured data.
DS unifies Statistics, Probability, EDA, Machine Learning, Neural Newts, to "understand and analyze actual phenomena" with data.
DS Employs methods from several fields (math, statistics, probability, IS, CS).
Where are we?
Sensors
Experiments
Hadron Collider
15 PB/year
Modeling and Simulations
Molecular Dynamics
Anton
Smart sensors feed data 24/7 into clusters where new types of data analysis need to be performed at levels of variety, velocity, volume, security never seen before.
Algorithms
Given a set of data (D) and some measurement of certainty (C), find statements (S) or patterns (P) that describe relationships among subsets of D with certainty C.
Interesting patterns with sufficient certainty become new pieces of knowledge that is added into a knowledge base.
Machine Learning
Where are we?
John Tukey
Rehashing EDA as Data Science
Where are we?
Serving
Infrastructure
Configuration
Data Collection
Feature
Extraction
Data
Verification
Machine
Resource
Management
Process
Management Tools
Monitoring
ML
Code
Hardest part of Data Science is getting data ready
Hidden Technical Debt in ML Systems, Google NIPS 2015
Analysis
Tools
Where are we?
Database Management Systems (DBMS) (~1990’s to ~2000)
Where are we?
Data Warehouse: First Gen Data Analytics Platform (~2000 - 2009)
Multiple
Relational
Tables
Where were we?
Where are we?
Warehouse + Data Lake: Second Gen Data Analytics Platform (~2010 - 2019)
Where are we?
Proof of Concept
80%
Trying to Scale
~15%
Industrialized for Growth
< 5%
Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *
* Accenture
Where do we need to be?
How do we get there?
Critical success factors of those who progressed beyond Proof of Concept *
How do we get there?
How do we get there?
Lake House: Third Gen Data Analytics Platform (~2020)
Data Science Foundations
As Richard Hamming said:
“ ...do not learn as if you walk through a picture gallery without learning how to mix paints, how to compose pictures... “
“...we must learn and teach how knowledge was generated, not just how to retrieve it… …so that we may generate the results that we need, even if no one has ever done it before us...”
Many courses and tutorials in DS focus on the use of tools, not on a deep understanding of the methods. Academia only covers the math foundations.
Yet, in practice, DS is a set of skills based on a deep understanding of methods from calculus, probability and statistics.
Statistics
The goal of statistics is to make inferences on data. Statistical Inference consists of�
We postulate a set of hypotheses, then carry on experiments to collect data, then describe the results, then make inferences from the results to draw conclusions concerning our hypotheses.
Note resemblance to the scientific method.
Probability
P(x=6) = [10 c 6] * 0.5^6 * 0.5 ^4
= 210*0.000977 = 0.205
Counting events, trials, permutations, combinations; set theory, contingency tables, probability distribution functions, conditional probability, joint probabilities, Bayes Theorem, maximum likelihood estimation (MLE), Bayesian networks, junction trees, other graphical models, ...
Probability of encountering 6
consecutive heads in 10 tosses?
Statistics and Probability
In Probability we go from a model to what we expect to see in the data.
In Statistics we use the observations to obtain estimates of the model parameters.
learn how to mix paints, how to compose pictures
Data is typically organized as a set of P parameters and a set of N records. Hence, data is represented as a matrix having N rows and P columns, or X[N,P]. The values in the cells may be numerical (continuous or discrete) or categorical (binary or multiple categories).
Problem Setup
Yes, Linear Algebra is a must*
* When dealing with continuous data
Yes, Linear Algebra is a must.
The set of observed data typically is configured as
{ ( x1,y1),
( x2,y2),
. . .
( xn,yn) }
where each row in X[N,P] is configured as a row a vector xi of length p. Y is a column vector of length n
Problem Setup
Given a set of p predictors X = (x1, x2, …, xp ) we would like to predict the value of a response Y.
We assume the existence of a function f such as
Y = f(x) + e
where e is error. In principle, f is in a form that captures the systematic nature of the data.
Problem Setup
The figure shows income vs years of education.
The figure suggests that a function fitting the data might exists.
Problem Setup
We need to find the best estimate of f.
Y = f(x) + e
Given that f is a function of x and e, we must also find e. The accuracy of our prediction depends on the reducible error and the irreducible error.
Reducible errors are inaccuracies that we can try to reduce.
The other element, known as the irreducible error, is part of f and can not be reduced.
Problem Setup
Let f be a function of the data, and Y ’ = f ’ (X) be an estimate. The expected value is simply
reducible error
Problem Setup
irreducible error
The focus of statistical learning is to estimate the best possible f, i.e., the function that minimizes the reducible error.
To be successful, we must determine:
Which predictors are associated to the outcome?
Which predictors are NOT associated to the outcome?
Are some predictors more important than others? If so how can we measure their importance?
What type of models can we use to answer these questions:
Parametric models?
Linear (first degree),
higher degrees?
Non-parametric models?
Prediction accuracy vs Model interpretability
Methods:
Classification vs regression
Supervised Learning ?
Unsupervised Learning?
Tasks
Is this e-Mail to be trusted?
Who is more likely to respond to these mails?
Will this group of patients health improve or deteriorate?
Is this transaction OK, suspicious, or fraudulent?
Is network traffic normal, low, too high?
Tasks
Is this record an outlier?
What are the features in my records that will help me partition the data?
Tasks
What are the features in my records that will help me partition the data?
How can I partition/classify/segment these data records?
Tasks
Which events occur together?
Which services are used together?
Which events never occur together?
Which set of remedies should I recommend given that we are under red, green or whatever color?
Tasks
What will the risk of a cyber attack be in October?
What is the risk of running out of beer in Oktoberfest?
What type of risk events might occur in Christmas, Cinco De Mayo?
Which type of flu will be the most dangerous next season? Which one the most disseminated?
Tasks
Analyze unstructured data:
How can we use web feedback?
How can we classify e-Mail?
How can we identify spam?
How can I improve help desk (centers) calls?
Tasks
Use visualization tools to further explore relations and patterns not seen at first
Use IR to reveal relations and patterns not visible with more traditional methods
Algorithms
Tasks Algorithms
Decision Trees
Neural Networks
Naïve Bayes
Logistic Regression
Linear Discriminant Analysis
Tasks Algorithms
Decision Trees
Neural Networks
Logistic Regression
Linear Regression
Tasks Algorithms
Clustering
Sequence Clustering
Tasks Algorithms
Association Rules
Decision Trees
Tasks Algorithms
Time Series
Kalman Filters
Hidden Markov Models
Tasks Algorithms
Information Retrieval Algos
Latent Dirichlet Allocations
Tasks Algorithms
Now we are talking...
… E D A !
“R” Rstudio
Python/numpy
Java/Kafka
Go/Gonum
Julia
Data Science (Software) Tools
Spark MLib (Python, Java, “R”)
Mahout
TensorFlow
Keras
PyTorch
Nia
Catapult
Maana
Relatively simple algorithm for supervised learning.
Useful because it illustrates the basic working premises of ML
A solid understanding of LR provides a solid foundation to understand the general concepts, processes and steps involved in more complex ML algorithms.
Example: Linear Regression
The steps include how to set up the data, how to apply an algorithm to the data in order to determine a good fit for F(X,Y), and how to use F to perform predictions with previously unseen data sets.
Example: Linear Regression
We will use 200 data records collected to assess the effect
of advertising media (TV, Radio, Newspaper) on Sales.
Three input variables and one output variable, or { X1, X2, X3, Y }.
Note the different inclinations of the blue lines
In the case of a single variable, the regression model F is a function of one independent variable X
Therefore p = 1 and
We could estimate
Sales | TV
Sales | Radio
Sales | NewsPaper
Using residual sum of squares, or RSS
Solving for the Betas gives us intercept and the slope of the line,
which we can use to make predictions on data never seen before.
Now that we know how to obtain model F for a single variable X
How could we examine any combination of X parameters, including
X1 & X2
X2 & X3
X1 & X3
X1 & X2 & X3
In the multivariate linear regression case, F is a function of a set of p explanatory variables
X = [ x1 x2 , . . . , xp ]
Multi-Variable Linear Regression
Now we have an expression in which
X is a matrix of n rows and p columns.
Beta is now a row vector of size p
Y is still a column vector of size n
Using gradient descent we find the parameters of the model (the B’s ).
The solution in the matrix form and the summation form are:
By expressing the problem in terms of linear algebra, we can compute the B’s using the linear algebra libraries in Python’s numpy, Go’s Gonum, R, Julia, etc.
Pseudo Inverse Matrix
The solution in essence involves finding a “pseudo-inverse” or Moore-Penrose Matrix (MPM) X+ matrix, which is a (n,p+1) matrix:
Should I use “R” . . . or Python/numpy, . . . or something else ?
Why Go?
"Go combines the ease of programming of an interpreted, dynamically typed language with the efficiency and safety of a statically typed, compiled language. Go is modern, with support for networked and multicore computing."
https://golang.org/doc/faq
Why Go?
language features
Why Go?
go tools
Scientific Computing Around 2008
Scientific Computing Around 2021
Julia
Go
DEMO
Code in GoLang
DEMO
DEMO
DEMO
DEMO
DEMO
There are several Web Servers and Internet products developed in “Pure Go” : Kubernetes, OpenShift, Hugo, Caddy, ...
In fact there is a fantastic library for writing your own web server and/or your own MicroService, sooooo,
Engineering of Data Science:
Go as a System Architecture Tool
Web Client
Web Server
makinacognica
Engineering of Data Science:
Go as a System Architecture Tool
GoWaMain
GoWa
DEMO
GoWa (Web Server)
GoWaMain (Web Client)
makinacognica
DEMO
GoWa (Web Server)
GoWaMain (Web Client)
makinacognica
DEMO
Scientific Computing Around 2021
Julia
Go
ISL Book Ch03 LinReg (in Julia)
https://github.com/JuanVargas/ISL/blob/main/chap_03/isl_ch03_linReg_jl.ipynb
https://bit.ly/3dZgqZu
MovieLens in Go and Julia
Spark was the first unified analytics engine that facilitated large scale data processing, SQL analytics, and ML. Spark was 100x faster than Hadoop.
The Berkeley View
“… An unfortunate academic tradition is that we build research prototypes, then wonder why people don’t use them. Applications must drive research and provide concrete goals and metrics to evaluate progress …”
Field of Dreams: If you built it, they will come
102
Problem Solving and Active Learning
Increases engagement by encouraging a collaborative learning experience in which instructors and students work together towards learning about real problems, propose solutions, and take action
Who will continue the journey?
References
R. Hamming: “The Art of Probability”
R. Hamming: “Methods of Mathematics Applied to Calculus, Probability and Statistics”
John Tukey: “Exploratory Data Analysis”
James, Witten, Hastie, Tibshirani: “Introduction to Statistical Learning” (ILSR Book)
Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)
References as .PDF
James, Witten, Hastie, Tibshirani: “An Introduction to Statistical Learning” (ILSR Book)
Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)
Thank you !
Concurrency is a first-class citizen in Go
GoRoutines : https://tour.golang.org/concurrency/1
Channels : https://tour.golang.org/concurrency/2
Buffered Channels : https://tour.golang.org/concurrency/3
Select lets a goroutine wait on multiple communication ops
Mutex: https://tour.golang.org/concurrency/9
Go as a System Architecture Tool
Go Cloud
Modern Systems are Massive, … and Complex
111
112
MSFT Catapult
113
114
115
116
117
118
119
TensorFlow
Who will continue the journey?