JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 120

Intro to Data Science

Juan E. Vargas

New url to the slides: https://bit.ly/3CzCTV5

Juan Vargas Site: https://goo.gl/Ww2HN6

2 of 120

Outline

Who am I?

Where are we?

Where do we need to be?

How do we get there?

How can we ensure sustainable success?

Dr. George Heilmeier

Who are you?

What are you trying to accomplish?

What is the state of the art? What are its limitations?

What is the new approach you propose?

3 of 120

BSEE from UTEP
MS from CINVESTAV-IPN (MEX) (that included the basic sciences of the medical program)
Prof of Neurophysiology, School of Medicine, UNAM (MEX)
PhD from the Applied AI Program in Biomedical Engineering @ Vanderbilt University
Prof of CSCE at USC, research on applied ML, AI, Data Science, embedded and distributed systems, sensor networks, biomedical engineering. Published 60+ papers, several book chapters, and professional conferences.
MSFT Academic Relations Manager (2004-2007)
2008 Gold Nugget Award, given each year to one exceptional graduate from UTEP College of Engineering.
Google Sr University Relation Manager (2007-2009)
MSFT Principal Research Manager (MSR/XCG, 2009-2012)
Assoc Dean of Engineering and Prof of CS at Georgia Southern Univ (2013-2017)
VP of Innovation and Education at Infosys, part of team who founded 6 Innovation Centers in the USA (Indianapolis, Raleigh, Providence, Stamford, Dallas, Phoenix) (2017-2019)
Consultant/Advisor to Databricks (2021 - June 2022)
Director of Data Science, Vericast (June 2022 to date)

Who Am I?

4 of 120

Cloud Computing Academic Research Initiatives ��NSF-CLuE NSF-CiC, EU-VenusC, JPN, AUS�

Goals

Accelerate scientific exploration and discovery

Build partnerships with government-sponsored research agencies and university consortia to provide cloud services to academic and research communities worldwide

Broaden the research capabilities of scientists, foster collaborative research communities to accelerate scientific discovery at a global scale

Help researchers interact with massively scalable data analysis tools and services directly accessible from their desktops, laptop or mobile devices in the same way they interact with Web search and online resources

5 of 120

80+ Projects

National Science Foundation

Florida

Georgia

Mass.

Virginia

North Carolina

South Carolina

Indiana

Delaware

Japan

InfoPlosion

Tokyo
Kyoto

Europe

Brussels
Venus-C
England - University of Nottingham
Inria in France
Plus Italy, Spain, Greece, Denmark, Switzerland, Germany

WA DC

Seattle

Project HQ

Penn

Louisiana

Washington

New York

New Mexico

North Dakota

California

Colorado

Michigan

Texas

Taiwan- starting

China

Australia

Partners

NICTA
ANU
CSIRO

Goals ( tactic -> strategic )

The aim of this project is to engage in partnerships with national government research agencies and with university consortia worldwide in order to provide cloud services to academic and research communities

Enable researchers to interact with massively scalable data analysis tools and services directly from their desktops, laptop or other mobile devices, in the same way they interact today with Web search and other online resources.

Broaden the research capabilities of scientists and scholars, foster collaborative research communities and, in doing so, accelerate scientific discovery globally.

Transform how research is conducted, accelerating scientific exploration, discovery and results.

AUS 08

JPN 08

NSF 29

USA 04

UK 01

Venus 33

TOTAL = 83

NSF

Inferring Pattern and Processes of Genome Evolution through Cloud Computing
GIS Vector Data Overlay Processing on Azure Platform
Porting the Structure-Adaptive Materials Prediction Project to the Microsoft Azure Platform
Cooperative Developer Testing with Test Intentions
Towards automated and assurable enterprise network migration
Data Intensive Grid Computing on Active Storage Clusters
Moving Polarizable Force Field Simulations to the Microsoft Azure Platform
Maximizing the Utility of Orthologs and Phylogenetic Profiles for Systems-Scale Comparative Genomics.
Web-scale Language Modeling Features for Machine Translation
Stork Data Scheduler for Azure
Exploring Social Classification on Microsoft Azure
Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis and Collaboration Over Massive Mesh Data
Transforming Morphological Systematics From Desktop to Web Applications: Development of the Online Workspace Morphobank.org 3.0
Semantic Web Informatics for Species in Space and Time

EU and VENUS-C

Systems Biology
Drug Discovery
Bioinformatics
Civil Engineering: Building Information Management.
Civil Engineering: Structural Analysis of Buildings.
Civil Protection and Emergencies.
Data for Science: Aqua maps.
UK: Project Horizon
France: INRIA

Chinese Academy of Sciences Computer Network Information Center

CNIC boasts seven research and service divisions, including China Science and Technology Network Center (CSTNet), Scientific Data Center, Supercomputing Center, Collaboration Environment Research Center (CERC), Academia Resource Planning Operation Support Center, Internet-based Science Communication Center and the China Internet Network Information Center (CNNIC). In addition, CNIC also set up three support departments, including General Group for Advancing e-Science Applications

The agreement with CNIC has two parts

To enable CNIC to experiment with porting their current 4^th paradigm scientific data services to Azure, Microsoft will donate 2 million core hours per year for two years of computing capability and 15 TBytes of permanent disk space and an average of 150GBytes of daily bandwidth between the Azure data centers and the CNIC servers
To allow researchers in China with interesting cloud applications the ability to experiment, Microsoft will donate an additional 3 million core hours per year for two years and 25 TBytes of disk space and 25GBytes of daily bandwidth from the researcher’s location to the Azure data centers. This resource will be divided among 6 to 12 research projects. CNIC will solicit proposals from the Chinese academic research community

6 of 120

Inferring Pattern and Processes of Genome Evolution through Cloud Computing
GIS Vector Data Overlay Processing on Azure Platform
Porting the Structure-Adaptive Materials Prediction (SAMP) to the Azure Platform
Cooperative Developer Testing with Test Intentions
Towards automated and assurable enterprise network migration
Data Intensive Grid Computing on Active Storage Clusters
Moving Polarizable Force Field Simulations to the Microsoft Azure Platform
Maximizing the Utility of Orthologs and Phylogenetic Profiles for Systems-Scale Comparative Genomics.
Web-scale Language Modeling Features for Machine Translation
Stork Data Scheduler for Azure
Exploring Social Classification on Microsoft Azure
Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis and Collaboration Over Massive Mesh Data
Transforming Morphological Systematics to Web Applications: Development of the Workspace Morphobank.org 3.0
Semantic Web Informatics for Species in Space and Time

Systems Biology
Drug Discovery
Bioinformatics
Civil Engineering: Building Information Management.
Civil Engineering: Structural Analysis of Buildings.
Civil Protection and Emergencies.
Data for Science: Aqua maps.
UK: Project Horizon
France: INRIA

NSF (29 projects)

EU (28 projects)

7 of 120

Innovation

… where are we, where are we going?

8 of 120

Advances in STEM+C

propelled our existence into the digital

age. Smart sensors embedded in

devices everywhere feed data 24/7 into

cloud computing clusters where

new types of data analysis, modeling and simulation

are performed at scales and security levels

never seen before.

Science and Engineering are multidisciplinary and

data-centric, unifying theory,

experimentation, modeling, and simulation.

When well orchestrated to nurture each other,

these synergistic interactions open

innovation opportunities for those who

are ready

9 of 120

In the 2020’s a new breed

of students who were born

during this century is coming to

college … living a continuously

connected, computer-based,

social network existence.

The new generations must learn

to operate in and out that existence.

As educators and professionals we must

find innovative ways to transform that

existence into the learning

experiences and products

of the future

10 of 120

Once again we are at

an inflection point

Of the 7.2 billion people on Earth, only about

45% enjoy the benefits of the internet.

The number of “things” (smart devices) getting

connected to the internet is orders of magnitude larger.

Conservative estimates predict that

by year 2025 there will be about

50 billion “things” connected to the internet.

Those “things” will be designed and produced by

professionals with the knowledge, skills

and resources ready for such a huge and

transformative

innovation opportunity.

11 of 120

Science, Technology,

Engineering, Math …

and Computing

are at the center of these trends.

The promises and challenges are real,

but we must find innovative ways to

build and operate new types of

systems at scales never

seen before.

12 of 120

The STEM+C Professionals of the

future will define and build new technology

and software platforms, from device to cloud,

covering Financials, Transportation, Healthcare,

Manufacturing, Aviation, Energy Management,

Power & Water, Oil & Gas, etc.

There are many opportunities but also significant challenges.

13 of 120

I will outline some of these

challenges and opportunities

14 of 120

Invention, Innovation, Invisible Technologies

Innovation is the outcome of inventions leading to the creation of economic impact and social value.

Successful technologies are those that become “invisible” . . . .

15 of 120

Invention, Innovation, Invisible Technologies

16 of 120

Invention, Innovation, Simpler Technologies

17 of 120

Pace of Innovation

It took about 55 years to spread the use of automobiles to ¼ of the US population…
… 35 years for the telephone …
… 22 years for the radio …
… 16 years for the PC …
… 13 years for the cell phone …
… 7 years for the Internet…
... 4 years for Google to become a verb…
… 3.5 years for FB to reach staggering numbers…
… 3 years for “cloud computing” …
… 2 years for Big Data and Machine Learning, to take center stage…

18 of 120

Pace of Innovation

2 years for Airbnb to reach millions of users
1.5 year for Uber to do the same
1.5 years for the “renaissance of AI...”
1 Year for Generative AI
ChatGPT (2023) is the fastest adopted technology in human history
. . . wait, there is more, let's have a different perspective ….
~3000 years for India and China to reach ~ 1.4 billion
~18 years for Facebook to reach ~ 3.0 billion

In most cases, computing has been at the center of innovation during the last 50 years

19 of 120

ORG	Founded	Valuation
APPLE	1976	$ 2.76 T USD
MSFT	1975	$ 2.57 T USD
GOOGLE	1998	$ 1.6 T USD
AMZN	1994	$ 1.42 T USD
Facebook	2004	$ 797 B USD
NVIDIA	1993	$1.07 T USD
TESLA	2003	$ 678 B USB

20 of 120

Where are we?

CORE

FDS

SDS

EDS

FDS = Foundations of Data Science

SDS = Software for Data Science

EDS = Engineering of Data Science

“… applications must drive R&D to provide concrete goals and metrics to evaluate progress …”

David Patterson

21 of 120

Artificial

Intelligence

Data

Science

Machine

Learning

Deep

Learning

Virtuous Circles

FAI

SAI

EAI

FDS

SDS

EDS

22 of 120

CRA Report

The 3V’s

Where are we?

Acquisition & Recording

Extraction Cleaning Annotation

Integration Aggregation Representation

Analysis & Modeling

Interpretation & Reports

Scale

Heterogeneity

Time

Privacy

Human Collaboration

Variety

Velocity

Volume

23 of 120

Where are we?

Serving

Infrastructure

Configuration

Data Collection

Feature

Extraction

Data

Verification

Machine

Resource

Management

Process

Management Tools

Monitoring

ML/DS

Code

Hardest part of Data Science is getting data ready

Analysis

Tools

Acquisition & Recording

Extraction Cleaning Annotation

Integration Aggregation Representation

Analysis & Modeling

Interpretation & Reports

24 of 120

Where are we?

Proof of Concept

80%

Trying to Scale

~15%

Industrialized for Growth

< 5%

Orgs conduct experiments and pilots as siloed efforts confined within a lab, a department or a team
Unable to extract real value
Struggle to scale due to unrealistic expectations on resources and time required
No connection to biz OKRS or to the long-term strategy connected to the org priorities
Effort and investment to scale are underestimated, yielding low ROI and poor results

Long-term strategy and operating model linked to OKRS
Processes and accountability identified or being defined
Multi-dimensional team(s) active and supported across the org
Championed by Chief Data or Analytics Officer
However: scaled DS is still not mature, not fully adopted

These orgs have a digital platform mindset and a DS culture already in place
Data and analytics democratized across organization drive biz decisions
Have scaled models with a responsible DS framework
Promote product and service innovation
Realize benefits from increased visibility into customer and employee expectations

Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *

* Accenture

25 of 120

Data Science as the Fourth Paradigm of Science

Jim Gray said "...everything about science is changing because of the impact of IT and the data deluge...” He called DS the "fourth paradigm" of science:

Empirical -> Theoretical -> Computational -> Data-driven

-> Cloud

-> Quantum

-> AI

DS is an interdisciplinary process; algorithms and systems are used to extract knowledge and insights from structured or unstructured data.

DS unifies Statistics, Probability, EDA, Machine Learning, Neural Newts, to "understand and analyze actual phenomena" with data.

DS Employs methods from several fields (math, statistics, probability, IS, CS).

26 of 120

Where are we?

Sensors

Experiments

Hadron Collider

15 PB/year

Modeling and Simulations

Molecular Dynamics

Anton

Smart sensors feed data 24/7 into clusters where new types of data analysis need to be performed at levels of variety, velocity, volume, security never seen before.

27 of 120

Algorithms

Linear Regression
Logistic Regression
Linear Discriminant Analysis (LDA)
PCA, SVD
Clustering
Decision Trees
Genetic Algorithms
Naïve Bayes
Sequence Clustering
Time Series
Association Rules
Support Vector Machines
Bayesian Networks
Neural Networks
Deep Neural Networks

Given a set of data (D) and some measurement of certainty (C), find statements (S) or patterns (P) that describe relationships among subsets of D with certainty C.

Interesting patterns with sufficient certainty become new pieces of knowledge that is added into a knowledge base.

Machine Learning

28 of 120

Where are we?

John Tukey

Rehashing EDA as Data Science

29 of 120

Where are we?

Serving

Infrastructure

Configuration

Data Collection

Feature

Extraction

Data

Verification

Machine

Resource

Management

Process

Management Tools

Monitoring

Code

Hardest part of Data Science is getting data ready

Hidden Technical Debt in ML Systems, Google NIPS 2015

Analysis

Tools

30 of 120

Where are we?

Database Management Systems (DBMS) (~1990’s to ~2000)

Multiple data sources, with a variety of integration and management levels
Structured data encoded as relational DBs and SQL schemas
No BI, simple analytics via SQL queries
Some ACID transactions
Compute and storage in single, on-premises appliance
Data reporting only on data already stored in system
Limited commute and scalability
Proprietary formats
High Cost

31 of 120

Where are we?

Data Warehouse: First Gen Data Analytics Platform (~2000 - 2009)

Integrate data from multiple sources to reduce data silos
Structured data encoded as relational DBs and SQL schemas
Optimized for downstream BI consumption
Full support for ACID transactions
Fast reporting of data already stored in system

Compute and storage coupled in one or more on-premises appliances
Unable to store or query unstructured data
Only support BI and reports
Limited compute and storage scalability
Limited data flexibility (no video, audio, text, raw data, streams)
Proprietary formats, lock-in vendors
High maintenance cost
Expensive, especially at large scale
No support for DS, ML, AI, . . .

Multiple

Relational

Tables

32 of 120

Where were we?

33 of 120

Where are we?

Warehouse + Data Lake: Second Gen Data Analytics Platform (~2010 - 2019)

Clusters running open data standards (ORC, Parquet)
Hadoop -> Spark (100x faster)
Elastic compute and storage
Data flexibility (video, audio, text, raw, stream …)
Access to structured data in warehouse via ETL/ELT
Can work in-house, in cloud, mixed mode

Lakes don’t support transactions
Lakes don't enforce data quality, consistency, isolation
When combined with warehouses, additional pipelines for ETL/ELT are needed, resulting in “accidental complexity” (Brooks), delays, failure modes
Added complexity for users and developers
Non-uniform availability of data
Some but limited support for DS, ML, AI
Increased total cost of ownership
Major warehouses added support for external tables (Parquet, ORC) but connectors perform poorly

34 of 120

Where are we?

Proof of Concept

80%

Trying to Scale

~15%

Industrialized for Growth

< 5%

Orgs conduct experiments and pilots as siloed efforts confined within a lab, a department or a team
Unable to extract real value
Struggle to scale due to unrealistic expectations on resources and time required
No connection to biz OKRS or to the long-term strategy connected to the org priorities
Effort and investment to scale are underestimated, yielding low ROI and poor results

Long-term strategy and operating model linked to OKRS
Processes and accountability identified or being defined
Multi-dimensional team(s) active and supported across the org
Championed by Chief Data or Analytics Officer
However: scaled DS is still not mature, not fully adopted

These orgs have a digital platform mindset and a DS culture already in place
Data and analytics democratized across organization drive biz decisions
Have scaled models with a responsible DS framework
Promote product and service innovation
Realize benefits from increased visibility into customer and employee expectations

Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *

* Accenture

35 of 120

Where do we need to be?

Establish and maintain a digital platform mindset and DS/AI culture
Let DS/AI be a “first-class” priority and endeavor the org to drive biz decisions
Democratize data and analytics across the entire org
Have scaled models with a responsible DS/AI framework
Promote product and service innovation
Realize benefits from increased visibility into customer and employee expectations

Define a long-term strategy and operating model linked to OKRS
Define processes, metrics, accountability
Set and nurture multidisciplinary team(s) across the entire org

36 of 120

How do we get there?

Critical success factors of those who progressed beyond Proof of Concept *

Drive intentionally

There is an understanding that DS is a long-term journey to a dynamic destination
Have structure and governance in place
Define processes, metrics, accountability

Improve data quality, management, governance

Have clear operating models for end-to-end data management (generation, custody, consumption)

DS is a (multidisciplinary) “Team Sport”

It is not about a single leader, it is about…
Data {scientists, engineers, modelers, GUI and visualization experts, product/process Engineers}

Focus on the “I” of ROI (it is a long-term Investment, not a cost)
Adopt a digital platform mindset to scale

Platforms drive scale, accelerate and extend value, break down silos, foster collaboration

Build Trust through responsible DS

Ethical, transparent, accountable practices, consistent with org values, laws, social norms

37 of 120

How do we get there?

38 of 120

How do we get there?

Lake House: Third Gen Data Analytics Platform (~2020)

Based on open standards (ORC, Parquet, DataFrames, Pandas)
Elastic workloads (compute and storage)
Data flexibility (video, audio, text, raw, stream …)
Support transactions, zero-copy cloning
Enforce data quality, consistency, isolation
Schema enforcement
Access control via constraints API and audit logging
Simpler data management for DS, ML, AI
Performance comparable to major warehouse architectures
Low (operational) cost

39 of 120

Data Science Foundations

As Richard Hamming said:

“ ...do not learn as if you walk through a picture gallery without learning how to mix paints, how to compose pictures... “

“...we must learn and teach how knowledge was generated, not just how to retrieve it… …so that we may generate the results that we need, even if no one has ever done it before us...”

Many courses and tutorials in DS focus on the use of tools, not on a deep understanding of the methods. Academia only covers the math foundations.

Yet, in practice, DS is a set of skills based on a deep understanding of methods from calculus, probability and statistics.

40 of 120

Statistics

The goal of statistics is to make inferences on data. Statistical Inference consists of�

Collecting data
Describing data
Analyzing data
Draw new conclusions from results (via inference) to validate (or not) hypothesis

We postulate a set of hypotheses, then carry on experiments to collect data, then describe the results, then make inferences from the results to draw conclusions concerning our hypotheses.

Note resemblance to the scientific method.

41 of 120

Probability

P(x=6) = [10 c 6] * 0.5^6 * 0.5 ^4

= 210*0.000977 = 0.205

Counting events, trials, permutations, combinations; set theory, contingency tables, probability distribution functions, conditional probability, joint probabilities, Bayes Theorem, maximum likelihood estimation (MLE), Bayesian networks, junction trees, other graphical models, ...

Probability of encountering 6

consecutive heads in 10 tosses?

42 of 120

Statistics and Probability

In Probability we go from a model to what we expect to see in the data.

In Statistics we use the observations to obtain estimates of the model parameters.

learn how to mix paints, how to compose pictures

43 of 120

Data is typically organized as a set of P parameters and a set of N records. Hence, data is represented as a matrix having N rows and P columns, or X[N,P]. The values in the cells may be numerical (continuous or discrete) or categorical (binary or multiple categories).

Problem Setup

44 of 120

Yes, Linear Algebra is a must*

* When dealing with continuous data

45 of 120

Yes, Linear Algebra is a must.

46 of 120

The set of observed data typically is configured as

{ ( x1,y1),

( x2,y2),

. . .

( xn,yn) }

where each row in X[N,P] is configured as a row a vector x_i of length p. Y is a column vector of length n

Problem Setup

47 of 120

Given a set of p predictors X = (x₁, x₂, …, x_p ) we would like to predict the value of a response Y.

We assume the existence of a function f such as

Y = f(x) + e

where e is error. In principle, f is in a form that captures the systematic nature of the data.

Problem Setup

48 of 120

The figure shows income vs years of education.

The figure suggests that a function fitting the data might exists.

Problem Setup

49 of 120

We need to find the best estimate of f.

Y = f(x) + e

Given that f is a function of x and e, we must also find e. The accuracy of our prediction depends on the reducible error and the irreducible error.

Reducible errors are inaccuracies that we can try to reduce.

The other element, known as the irreducible error, is part of f and can not be reduced.

Problem Setup

50 of 120

Let f be a function of the data, and Y ’ = f ’ (X) be an estimate. The expected value is simply

reducible error

Problem Setup

irreducible error

The focus of statistical learning is to estimate the best possible f, i.e., the function that minimizes the reducible error.

51 of 120

To be successful, we must determine:

Which predictors are associated to the outcome?

Which predictors are NOT associated to the outcome?

Are some predictors more important than others? If so how can we measure their importance?

What type of models can we use to answer these questions:

Parametric models?

Linear (first degree),

higher degrees?

Non-parametric models?

Prediction accuracy vs Model interpretability

Methods:

Classification vs regression

Supervised Learning ?

Unsupervised Learning?

52 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Is this e-Mail to be trusted?

Who is more likely to respond to these mails?

Will this group of patients health improve or deteriorate?

Is this transaction OK, suspicious, or fraudulent?

Is network traffic normal, low, too high?

53 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Is this record an outlier?

What are the features in my records that will help me partition the data?

54 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

What are the features in my records that will help me partition the data?

How can I partition/classify/segment these data records?

55 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Which events occur together?

Which services are used together?

Which events never occur together?

Which set of remedies should I recommend given that we are under red, green or whatever color?

56 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

What will the risk of a cyber attack be in October?

What is the risk of running out of beer in Oktoberfest?

What type of risk events might occur in Christmas, Cinco De Mayo?

Which type of flu will be the most dangerous next season? Which one the most disseminated?

57 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Analyze unstructured data:

search keywords and phrases
convert to structured data
feed into algorithms for classification, segmentation, association

How can we use web feedback?

How can we classify e-Mail?

How can we identify spam?

How can I improve help desk (centers) calls?

58 of 120

Tasks

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Use visualization tools to further explore relations and patterns not seen at first

Use IR to reveal relations and patterns not visible with more traditional methods

59 of 120

Algorithms

Linear Regression
Logistic Regression
Linear Discriminant Analysis (LDA)
Clustering
Decision Trees
Support Vector Machines
Naïve Bayes
Sequence Clustering
Time Series
Association Rules
Neural Networks
Genetic Algorithms
Bayesian Networks
Deep Neural Networks

60 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Decision Trees

Neural Networks

Naïve Bayes

Logistic Regression

Linear Discriminant Analysis

61 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Decision Trees

Neural Networks

Logistic Regression

Linear Regression

62 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Clustering

Sequence Clustering

63 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Association Rules

Decision Trees

64 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Time Series

Kalman Filters

Hidden Markov Models

65 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Information Retrieval Algos

Latent Dirichlet Allocations

66 of 120

Tasks Algorithms

Classification
Estimation
Segmentation
Association
Forecasting
Text Analysis
Data Exploration

Now we are talking...

… E D A !

67 of 120

“R” Rstudio

Python/numpy

Java/Kafka

Go/Gonum

Julia

Data Science (Software) Tools

Spark MLib (Python, Java, “R”)

Mahout

TensorFlow

Keras

PyTorch

Nia

Catapult

Maana

68 of 120

Relatively simple algorithm for supervised learning.

Useful because it illustrates the basic working premises of ML

A solid understanding of LR provides a solid foundation to understand the general concepts, processes and steps involved in more complex ML algorithms.

Example: Linear Regression

69 of 120

The steps include how to set up the data, how to apply an algorithm to the data in order to determine a good fit for F(X,Y), and how to use F to perform predictions with previously unseen data sets.

Example: Linear Regression

We will use 200 data records collected to assess the effect

of advertising media (TV, Radio, Newspaper) on Sales.

Three input variables and one output variable, or { X1, X2, X3, Y }.

70 of 120

Note the different inclinations of the blue lines

71 of 120

In the case of a single variable, the regression model F is a function of one independent variable X

Therefore p = 1 and

We could estimate

Sales | TV

Sales | Radio

Sales | NewsPaper

72 of 120

Using residual sum of squares, or RSS

Solving for the Betas gives us intercept and the slope of the line,

which we can use to make predictions on data never seen before.

73 of 120

Now that we know how to obtain model F for a single variable X

How could we examine any combination of X parameters, including

X1 & X2

X2 & X3

X1 & X3

X1 & X2 & X3

74 of 120

In the multivariate linear regression case, F is a function of a set of p explanatory variables

X = [ x₁x₂ , . . . , x_p]

Multi-Variable Linear Regression

Now we have an expression in which

X is a matrix of n rows and p columns.

Beta is now a row vector of size p

Y is still a column vector of size n

75 of 120

Using gradient descent we find the parameters of the model (the B’s ).

The solution in the matrix form and the summation form are:

By expressing the problem in terms of linear algebra, we can compute the B’s using the linear algebra libraries in Python’s numpy, Go’s Gonum, R, Julia, etc.

Pseudo Inverse Matrix

76 of 120

The solution in essence involves finding a “pseudo-inverse” or Moore-Penrose Matrix (MPM) X⁺ matrix, which is a (n,p+1) matrix:

77 of 120

Should I use “R” . . . or Python/numpy, . . . or something else ?

78 of 120

Why Go?

"Go combines the ease of programming of an interpreted, dynamically typed language with the efficiency and safety of a statically typed, compiled language. Go is modern, with support for networked and multicore computing."

https://golang.org/doc/faq

Robert Griesemer (Cray Vectorization, Sawzall, Java HotSpot)
Rob Pike (UTF-8, Squeak, Plan9, Inferno)
Ken Thompson (C, Unix, UTF-8)

79 of 120

Why Go?

Open-Source ( http://github.com/golang/go )
Compiled
Statically-typed
Garbage collected
Speed: within 10% of “C”
Short compile times
Simple syntax
Stack traces on nil dereferences ( no segmentation faults )
Composition types via interfaces, not inheritance
Concurrency
Large standard library growing very fast, e.g., GoNum
List of packages available or under dev: https://awesome-go.com/
My language of choice, the one I use when I need to to push the envelope

language features

80 of 120

Why Go?

Easy to update/download packages (go get)
No makefiles (go install)
Easy to create/use test suites (go test)
No race/deadlocks (go install -race)
Automatic doc generation (godoc)
Code formatting (gofmt)
Automatic import detection (goimports)
Profiling (go tool pprof)
IMHO Go is better than any other language for

Using external libs
Managing complexity
Writing legible, elegant code that will last

go tools

81 of 120

Scientific Computing Around 2008

82 of 120

Scientific Computing Around 2021

Julia

83 of 120

84 of 120

85 of 120

86 of 120

DEMO

87 of 120

Code in GoLang

DEMO

88 of 120

DEMO

89 of 120

DEMO

90 of 120

DEMO

91 of 120

DEMO

92 of 120

There are several Web Servers and Internet products developed in “Pure Go” : Kubernetes, OpenShift, Hugo, Caddy, ...

In fact there is a fantastic library for writing your own web server and/or your own MicroService, sooooo,

Engineering of Data Science:

Go as a System Architecture Tool

93 of 120

Web Client

Web Server

makinacognica

Engineering of Data Science:

Go as a System Architecture Tool

GoWaMain

GoWa

94 of 120

DEMO

GoWa (Web Server)

GoWaMain (Web Client)

makinacognica

95 of 120

DEMO

96 of 120

GoWa (Web Server)

GoWaMain (Web Client)

makinacognica

DEMO

97 of 120

Scientific Computing Around 2021

Julia

98 of 120

ISL Book Ch03 LinReg (in Julia)

https://github.com/JuanVargas/ISL/blob/main/chap_03/isl_ch03_linReg_jl.ipynb

99 of 120

https://bit.ly/3dZgqZu

MovieLens in Go and Julia

100 of 120

Spark was the first unified analytics engine that facilitated large scale data processing, SQL analytics, and ML. Spark was 100x faster than Hadoop.

101 of 120

The Berkeley View

“… An unfortunate academic tradition is that we build research prototypes, then wonder why people don’t use them. Applications must drive research and provide concrete goals and metrics to evaluate progress …”

Field of Dreams: If you built it, they will come

102 of 120

102

103 of 120

Problem Solving and Active Learning

Increases engagement by encouraging a collaborative learning experience in which instructors and students work together towards learning about real problems, propose solutions, and take action

Require students do something versus just learn about something
Require that students document the experience/process from challenge to solution.
Focus on 'real-life' challenges and solutions
Multiple points of entry
Multiple possible solutions
Connection with multiple disciplines
Focus on developing 21th century skills for students and instructors
Leverage modern technology tools and resources geared towards on-line distance learning.
Use of modern technology tools for organizing, collaborating, and sharing.

104 of 120

Who will continue the journey?

105 of 120

References

R. Hamming: “The Art of Probability”

R. Hamming: “Methods of Mathematics Applied to Calculus, Probability and Statistics”

John Tukey: “Exploratory Data Analysis”

James, Witten, Hastie, Tibshirani: “Introduction to Statistical Learning” (ILSR Book)

Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)

106 of 120

References as .PDF

James, Witten, Hastie, Tibshirani: “An Introduction to Statistical Learning” (ILSR Book)

Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)

107 of 120

Thank you !

108 of 120

Concurrency is a first-class citizen in Go

GoRoutines : https://tour.golang.org/concurrency/1

Channels : https://tour.golang.org/concurrency/2

Buffered Channels : https://tour.golang.org/concurrency/3

Select lets a goroutine wait on multiple communication ops

Mutex: https://tour.golang.org/concurrency/9

Go as a System Architecture Tool

109 of 120

Go Cloud

110 of 120

Modern Systems are Massive, … and Complex

111 of 120

111

112 of 120

112

MSFT Catapult

113 of 120

113

114 of 120

114

115 of 120

115

116 of 120

116

117 of 120

117

118 of 120

118

119 of 120

119

TensorFlow

120 of 120

Who will continue the journey?