1 of 120

Intro to Data Science

Juan E. Vargas

New url to the slides: https://bit.ly/3CzCTV5

Juan Vargas Site: https://goo.gl/Ww2HN6

2 of 120

Outline

Who am I?

Where are we?

Where do we need to be?

How do we get there?

How can we ensure sustainable success?

Dr. George Heilmeier

Who are you?

What are you trying to accomplish?

What is the state of the art? What are its limitations?

What is the new approach you propose?

3 of 120

  • BSEE from UTEP
  • MS from CINVESTAV-IPN (MEX) (that included the basic sciences of the medical program)
  • Prof of Neurophysiology, School of Medicine, UNAM (MEX)
  • PhD from the Applied AI Program in Biomedical Engineering @ Vanderbilt University
  • Prof of CSCE at USC, research on applied ML, AI, Data Science, embedded and distributed systems, sensor networks, biomedical engineering. Published 60+ papers, several book chapters, and professional conferences.
  • MSFT Academic Relations Manager (2004-2007)
  • 2008 Gold Nugget Award, given each year to one exceptional graduate from UTEP College of Engineering.
  • Google Sr University Relation Manager (2007-2009)
  • MSFT Principal Research Manager (MSR/XCG, 2009-2012)
  • Assoc Dean of Engineering and Prof of CS at Georgia Southern Univ (2013-2017)
  • VP of Innovation and Education at Infosys, part of team who founded 6 Innovation Centers in the USA (Indianapolis, Raleigh, Providence, Stamford, Dallas, Phoenix) (2017-2019)
  • Consultant/Advisor to Databricks (2021 - June 2022)
  • Director of Data Science, Vericast (June 2022 to date)

Who Am I?

4 of 120

Cloud Computing Academic Research Initiatives ��NSF-CLuE NSF-CiC, EU-VenusC, JPN, AUS

Goals

Accelerate scientific exploration and discovery

Build partnerships with government-sponsored research agencies and university consortia to provide cloud services to academic and research communities worldwide

Broaden the research capabilities of scientists, foster collaborative research communities to accelerate scientific discovery at a global scale

Help researchers interact with massively scalable data analysis tools and services directly accessible from their desktops, laptop or mobile devices in the same way they interact with Web search and online resources

5 of 120

80+ Projects

National Science Foundation

Florida

Georgia

Mass.

Virginia

North Carolina

South Carolina

Indiana

Delaware

Japan

InfoPlosion

    • Tokyo
    • Kyoto

Europe

    • Brussels
    • Venus-C
    • England - University of Nottingham
    • Inria in France
    • Plus Italy, Spain, Greece, Denmark, Switzerland, Germany

WA DC

Seattle

Project HQ

Penn

Louisiana

Washington

New York

New Mexico

North Dakota

California

Colorado

Michigan

Texas

Taiwan- starting

China

Australia

Partners

    • NICTA
    • ANU
    • CSIRO

6 of 120

  • Inferring Pattern and Processes of Genome Evolution through Cloud Computing
  • GIS Vector Data Overlay Processing on Azure Platform
  • Porting the Structure-Adaptive Materials Prediction (SAMP) to the Azure Platform
  • Cooperative Developer Testing with Test Intentions
  • Towards automated and assurable enterprise network migration
  • Data Intensive Grid Computing on Active Storage Clusters
  • Moving Polarizable Force Field Simulations to the Microsoft Azure Platform
  • Maximizing the Utility of Orthologs and Phylogenetic Profiles for Systems-Scale Comparative Genomics.
  • Web-scale Language Modeling Features for Machine Translation
  • Stork Data Scheduler for Azure
  • Exploring Social Classification on Microsoft Azure
  • Where the Ocean Meets the Cloud: Ad Hoc Longitudinal Analysis and Collaboration Over Massive Mesh Data
  • Transforming Morphological Systematics to Web Applications: Development of the Workspace Morphobank.org 3.0
  • Semantic Web Informatics for Species in Space and Time
  • Systems Biology
  • Drug Discovery
  • Bioinformatics
  • Civil Engineering: Building Information Management.
  • Civil Engineering: Structural Analysis of Buildings.
  • Civil Protection and Emergencies.
  • Data for Science: Aqua maps.
  • UK: Project Horizon
  • France: INRIA

NSF (29 projects)

EU (28 projects)

7 of 120

Innovation

… where are we, where are we going?

8 of 120

Advances in STEM+C

propelled our existence into the digital

age. Smart sensors embedded in

devices everywhere feed data 24/7 into

cloud computing clusters where

new types of data analysis, modeling and simulation

are performed at scales and security levels

never seen before.

Science and Engineering are multidisciplinary and

data-centric, unifying theory,

experimentation, modeling, and simulation.

When well orchestrated to nurture each other,

these synergistic interactions open

innovation opportunities for those who

are ready

9 of 120

In the 2020’s a new breed

of students who were born

during this century is coming to

college … living a continuously

connected, computer-based,

social network existence.

The new generations must learn

to operate in and out that existence.

As educators and professionals we must

find innovative ways to transform that

existence into the learning

experiences and products

of the future

10 of 120

Once again we are at

an inflection point

Of the 7.2 billion people on Earth, only about

45% enjoy the benefits of the internet.

The number of “things” (smart devices) getting

connected to the internet is orders of magnitude larger.

Conservative estimates predict that

by year 2025 there will be about

50 billion “things” connected to the internet.

Those “things” will be designed and produced by

professionals with the knowledge, skills

and resources ready for such a huge and

transformative

innovation opportunity.

11 of 120

Science, Technology,

Engineering, Math …

and Computing

are at the center of these trends.

The promises and challenges are real,

but we must find innovative ways to

build and operate new types of

systems at scales never

seen before.

12 of 120

The STEM+C Professionals of the

future will define and build new technology

and software platforms, from device to cloud,

covering Financials, Transportation, Healthcare,

Manufacturing, Aviation, Energy Management,

Power & Water, Oil & Gas, etc.

There are many opportunities but also significant challenges.

13 of 120

I will outline some of these

challenges and opportunities

14 of 120

Invention, Innovation, Invisible Technologies

Innovation is the outcome of inventions leading to the creation of economic impact and social value.

Successful technologies are those that become “invisible” . . . .

15 of 120

Invention, Innovation, Invisible Technologies

16 of 120

Invention, Innovation, Simpler Technologies

17 of 120

Pace of Innovation

    • It took about 55 years to spread the use of automobiles to ¼ of the US population…
    • 35 years for the telephone …
    • 22 years for the radio …
    • 16 years for the PC …
    • 13 years for the cell phone …
    • 7 years for the Internet…
    • ... 4 years for Google to become a verb…
    • … 3.5 years for FB to reach staggering numbers…
    • 3 years for “cloud computing” …
    • 2 years for Big Data and Machine Learning, to take center stage…

18 of 120

Pace of Innovation

    • 2 years for Airbnb to reach millions of users
    • 1.5 year for Uber to do the same
    • 1.5 years for the “renaissance of AI...”
    • 1 Year for Generative AI
    • ChatGPT (2023) is the fastest adopted technology in human history
    • . . . wait, there is more, let's have a different perspective ….
    • ~3000 years for India and China to reach ~ 1.4 billion
    • ~18 years for Facebook to reach ~ 3.0 billion

In most cases, computing has been at the center of innovation during the last 50 years

19 of 120

ORG

Founded

Valuation

APPLE

1976

$ 2.76 T USD

MSFT

1975

$ 2.57 T USD

GOOGLE

1998

$ 1.6 T USD

AMZN

1994

$ 1.42 T USD

Facebook

2004

$ 797 B USD

NVIDIA

1993

$1.07 T USD

TESLA

2003

$ 678 B USB

20 of 120

Where are we?

CORE

FDS

SDS

EDS

FDS = Foundations of Data Science

SDS = Software for Data Science

EDS = Engineering of Data Science

“… applications must drive R&D to provide concrete goals and metrics to evaluate progress …”

David Patterson

21 of 120

Artificial

Intelligence

Data

Science

Machine

Learning

Deep

Learning

Virtuous Circles

FAI

SAI

EAI

FDS

SDS

EDS

22 of 120

CRA Report

The 3V’s

Where are we?

Acquisition & Recording

Extraction Cleaning Annotation

Integration Aggregation Representation

Analysis & Modeling

Interpretation & Reports

Scale

Heterogeneity

Time

Privacy

Human Collaboration

Variety

Velocity

Volume

23 of 120

Where are we?

Serving

Infrastructure

Configuration

Data Collection

Feature

Extraction

Data

Verification

Machine

Resource

Management

Process

Management Tools

Monitoring

ML/DS

Code

Hardest part of Data Science is getting data ready

Analysis

Tools

Acquisition & Recording

Extraction Cleaning Annotation

Integration Aggregation Representation

Analysis & Modeling

Interpretation & Reports

24 of 120

Where are we?

Proof of Concept

80%

Trying to Scale

~15%

Industrialized for Growth

< 5%

  • Orgs conduct experiments and pilots as siloed efforts confined within a lab, a department or a team
  • Unable to extract real value
  • Struggle to scale due to unrealistic expectations on resources and time required
  • No connection to biz OKRS or to the long-term strategy connected to the org priorities
  • Effort and investment to scale are underestimated, yielding low ROI and poor results
  • Long-term strategy and operating model linked to OKRS
  • Processes and accountability identified or being defined
  • Multi-dimensional team(s) active and supported across the org
  • Championed by Chief Data or Analytics Officer
  • However: scaled DS is still not mature, not fully adopted
  • These orgs have a digital platform mindset and a DS culture already in place
  • Data and analytics democratized across organization drive biz decisions
  • Have scaled models with a responsible DS framework
  • Promote product and service innovation
  • Realize benefits from increased visibility into customer and employee expectations

Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *

* Accenture

25 of 120

Data Science as the Fourth Paradigm of Science

Jim Gray said "...everything about science is changing because of the impact of IT and the data deluge...” He called DS the "fourth paradigm" of science:

Empirical -> Theoretical -> Computational -> Data-driven

-> Cloud

-> Quantum

-> AI

DS is an interdisciplinary process; algorithms and systems are used to extract knowledge and insights from structured or unstructured data.

DS unifies Statistics, Probability, EDA, Machine Learning, Neural Newts, to "understand and analyze actual phenomena" with data.

DS Employs methods from several fields (math, statistics, probability, IS, CS).

26 of 120

Where are we?

Sensors

Experiments

Hadron Collider

15 PB/year

Modeling and Simulations

Molecular Dynamics

Anton

Smart sensors feed data 24/7 into clusters where new types of data analysis need to be performed at levels of variety, velocity, volume, security never seen before.

27 of 120

Algorithms

      • Linear Regression
      • Logistic Regression
      • Linear Discriminant Analysis (LDA)
      • PCA, SVD
      • Clustering
      • Decision Trees
      • Genetic Algorithms
      • Naïve Bayes
      • Sequence Clustering
      • Time Series
      • Association Rules
      • Support Vector Machines
      • Bayesian Networks
      • Neural Networks
      • Deep Neural Networks

Given a set of data (D) and some measurement of certainty (C), find statements (S) or patterns (P) that describe relationships among subsets of D with certainty C.

Interesting patterns with sufficient certainty become new pieces of knowledge that is added into a knowledge base.

Machine Learning

28 of 120

Where are we?

John Tukey

Rehashing EDA as Data Science

29 of 120

Where are we?

Serving

Infrastructure

Configuration

Data Collection

Feature

Extraction

Data

Verification

Machine

Resource

Management

Process

Management Tools

Monitoring

ML

Code

Hardest part of Data Science is getting data ready

Hidden Technical Debt in ML Systems, Google NIPS 2015

Analysis

Tools

30 of 120

Where are we?

Database Management Systems (DBMS) (~1990’s to ~2000)

  • Multiple data sources, with a variety of integration and management levels
  • Structured data encoded as relational DBs and SQL schemas
  • No BI, simple analytics via SQL queries
  • Some ACID transactions
  • Compute and storage in single, on-premises appliance
  • Data reporting only on data already stored in system
  • Limited commute and scalability
  • Proprietary formats
  • High Cost

31 of 120

Where are we?

Data Warehouse: First Gen Data Analytics Platform (~2000 - 2009)

  • Integrate data from multiple sources to reduce data silos
  • Structured data encoded as relational DBs and SQL schemas
  • Optimized for downstream BI consumption
  • Full support for ACID transactions
  • Fast reporting of data already stored in system
  • Compute and storage coupled in one or more on-premises appliances
  • Unable to store or query unstructured data
  • Only support BI and reports
  • Limited compute and storage scalability
  • Limited data flexibility (no video, audio, text, raw data, streams)
  • Proprietary formats, lock-in vendors
  • High maintenance cost
  • Expensive, especially at large scale
  • No support for DS, ML, AI, . . .

Multiple

Relational

Tables

32 of 120

Where were we?

33 of 120

Where are we?

Warehouse + Data Lake: Second Gen Data Analytics Platform (~2010 - 2019)

  • Clusters running open data standards (ORC, Parquet)
  • Hadoop -> Spark (100x faster)
  • Elastic compute and storage
  • Data flexibility (video, audio, text, raw, stream …)
  • Access to structured data in warehouse via ETL/ELT
  • Can work in-house, in cloud, mixed mode
  • Lakes don’t support transactions
  • Lakes don't enforce data quality, consistency, isolation
  • When combined with warehouses, additional pipelines for ETL/ELT are needed, resulting in “accidental complexity” (Brooks), delays, failure modes
  • Added complexity for users and developers
  • Non-uniform availability of data
  • Some but limited support for DS, ML, AI
  • Increased total cost of ownership
  • Major warehouses added support for external tables (Parquet, ORC) but connectors perform poorly

34 of 120

Where are we?

Proof of Concept

80%

Trying to Scale

~15%

Industrialized for Growth

< 5%

  • Orgs conduct experiments and pilots as siloed efforts confined within a lab, a department or a team
  • Unable to extract real value
  • Struggle to scale due to unrealistic expectations on resources and time required
  • No connection to biz OKRS or to the long-term strategy connected to the org priorities
  • Effort and investment to scale are underestimated, yielding low ROI and poor results
  • Long-term strategy and operating model linked to OKRS
  • Processes and accountability identified or being defined
  • Multi-dimensional team(s) active and supported across the org
  • Championed by Chief Data or Analytics Officer
  • However: scaled DS is still not mature, not fully adopted
  • These orgs have a digital platform mindset and a DS culture already in place
  • Data and analytics democratized across organization drive biz decisions
  • Have scaled models with a responsible DS framework
  • Promote product and service innovation
  • Realize benefits from increased visibility into customer and employee expectations

Embracing Data Science: A recent study involving 1,500 C-suite executives from organizations across 16 industries revealed *

* Accenture

35 of 120

Where do we need to be?

  • Establish and maintain a digital platform mindset and DS/AI culture
  • Let DS/AI be a “first-class” priority and endeavor the org to drive biz decisions
  • Democratize data and analytics across the entire org
  • Have scaled models with a responsible DS/AI framework
  • Promote product and service innovation
  • Realize benefits from increased visibility into customer and employee expectations
  • Define a long-term strategy and operating model linked to OKRS
  • Define processes, metrics, accountability
  • Set and nurture multidisciplinary team(s) across the entire org

36 of 120

How do we get there?

Critical success factors of those who progressed beyond Proof of Concept *

  • Drive intentionally
    • There is an understanding that DS is a long-term journey to a dynamic destination
    • Have structure and governance in place
    • Define processes, metrics, accountability
  • Improve data quality, management, governance
    • Have clear operating models for end-to-end data management (generation, custody, consumption)
  • DS is a (multidisciplinary) “Team Sport”
    • It is not about a single leader, it is about…
    • Data {scientists, engineers, modelers, GUI and visualization experts, product/process Engineers}
  • Focus on the “I” of ROI (it is a long-term Investment, not a cost)
  • Adopt a digital platform mindset to scale
    • Platforms drive scale, accelerate and extend value, break down silos, foster collaboration
  • Build Trust through responsible DS
    • Ethical, transparent, accountable practices, consistent with org values, laws, social norms

37 of 120

How do we get there?

38 of 120

How do we get there?

Lake House: Third Gen Data Analytics Platform (~2020)

  • Based on open standards (ORC, Parquet, DataFrames, Pandas)
  • Elastic workloads (compute and storage)
  • Data flexibility (video, audio, text, raw, stream …)
  • Support transactions, zero-copy cloning
  • Enforce data quality, consistency, isolation
  • Schema enforcement
  • Access control via constraints API and audit logging
  • Simpler data management for DS, ML, AI
  • Performance comparable to major warehouse architectures
  • Low (operational) cost

39 of 120

Data Science Foundations

As Richard Hamming said:

“ ...do not learn as if you walk through a picture gallery without learning how to mix paints, how to compose pictures... “

“...we must learn and teach how knowledge was generated, not just how to retrieve it… …so that we may generate the results that we need, even if no one has ever done it before us...”

Many courses and tutorials in DS focus on the use of tools, not on a deep understanding of the methods. Academia only covers the math foundations.

Yet, in practice, DS is a set of skills based on a deep understanding of methods from calculus, probability and statistics.

40 of 120

Statistics

The goal of statistics is to make inferences on data. Statistical Inference consists of�

  1. Collecting data
  2. Describing data
  3. Analyzing data
  4. Draw new conclusions from results (via inference) to validate (or not) hypothesis

We postulate a set of hypotheses, then carry on experiments to collect data, then describe the results, then make inferences from the results to draw conclusions concerning our hypotheses.

Note resemblance to the scientific method.

41 of 120

Probability

P(x=6) = [10 c 6] * 0.5^6 * 0.5 ^4

= 210*0.000977 = 0.205

Counting events, trials, permutations, combinations; set theory, contingency tables, probability distribution functions, conditional probability, joint probabilities, Bayes Theorem, maximum likelihood estimation (MLE), Bayesian networks, junction trees, other graphical models, ...

Probability of encountering 6

consecutive heads in 10 tosses?

42 of 120

Statistics and Probability

In Probability we go from a model to what we expect to see in the data.

In Statistics we use the observations to obtain estimates of the model parameters.

learn how to mix paints, how to compose pictures

43 of 120

Data is typically organized as a set of P parameters and a set of N records. Hence, data is represented as a matrix having N rows and P columns, or X[N,P]. The values in the cells may be numerical (continuous or discrete) or categorical (binary or multiple categories).

Problem Setup

44 of 120

Yes, Linear Algebra is a must*

* When dealing with continuous data

45 of 120

Yes, Linear Algebra is a must.

46 of 120

The set of observed data typically is configured as

{ ( x1,y1),

( x2,y2),

. . .

( xn,yn) }

where each row in X[N,P] is configured as a row a vector xi of length p. Y is a column vector of length n

Problem Setup

47 of 120

Given a set of p predictors X = (x1, x2, …, xp ) we would like to predict the value of a response Y.

We assume the existence of a function f such as

Y = f(x) + e

where e is error. In principle, f is in a form that captures the systematic nature of the data.

Problem Setup

48 of 120

The figure shows income vs years of education.

The figure suggests that a function fitting the data might exists.

Problem Setup

49 of 120

We need to find the best estimate of f.

Y = f(x) + e

Given that f is a function of x and e, we must also find e. The accuracy of our prediction depends on the reducible error and the irreducible error.

Reducible errors are inaccuracies that we can try to reduce.

The other element, known as the irreducible error, is part of f and can not be reduced.

Problem Setup

50 of 120

Let f be a function of the data, and Y ’ = f ’ (X) be an estimate. The expected value is simply

reducible error

Problem Setup

irreducible error

The focus of statistical learning is to estimate the best possible f, i.e., the function that minimizes the reducible error.

51 of 120

To be successful, we must determine:

Which predictors are associated to the outcome?

Which predictors are NOT associated to the outcome?

Are some predictors more important than others? If so how can we measure their importance?

What type of models can we use to answer these questions:

Parametric models?

Linear (first degree),

higher degrees?

Non-parametric models?

Prediction accuracy vs Model interpretability

Methods:

Classification vs regression

Supervised Learning ?

Unsupervised Learning?

52 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Is this e-Mail to be trusted?

Who is more likely to respond to these mails?

Will this group of patients health improve or deteriorate?

Is this transaction OK, suspicious, or fraudulent?

Is network traffic normal, low, too high?

53 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Is this record an outlier?

What are the features in my records that will help me partition the data?

54 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

What are the features in my records that will help me partition the data?

How can I partition/classify/segment these data records?

55 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Which events occur together?

Which services are used together?

Which events never occur together?

Which set of remedies should I recommend given that we are under red, green or whatever color?

56 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

What will the risk of a cyber attack be in October?

What is the risk of running out of beer in Oktoberfest?

What type of risk events might occur in Christmas, Cinco De Mayo?

Which type of flu will be the most dangerous next season? Which one the most disseminated?

57 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Analyze unstructured data:

  • search keywords and phrases
  • convert to structured data
  • feed into algorithms for classification, segmentation, association

How can we use web feedback?

How can we classify e-Mail?

How can we identify spam?

How can I improve help desk (centers) calls?

58 of 120

Tasks

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Use visualization tools to further explore relations and patterns not seen at first

Use IR to reveal relations and patterns not visible with more traditional methods

59 of 120

Algorithms

      • Linear Regression
      • Logistic Regression
      • Linear Discriminant Analysis (LDA)
      • Clustering
      • Decision Trees
      • Support Vector Machines
      • Naïve Bayes
      • Sequence Clustering
      • Time Series
      • Association Rules
      • Neural Networks
      • Genetic Algorithms
      • Bayesian Networks
      • Deep Neural Networks

60 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Decision Trees

Neural Networks

Naïve Bayes

Logistic Regression

Linear Discriminant Analysis

61 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Decision Trees

Neural Networks

Logistic Regression

Linear Regression

62 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Clustering

Sequence Clustering

63 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Association Rules

Decision Trees

64 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Time Series

Kalman Filters

Hidden Markov Models

65 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Information Retrieval Algos

Latent Dirichlet Allocations

66 of 120

Tasks Algorithms

    • Classification
    • Estimation
    • Segmentation
    • Association
    • Forecasting
    • Text Analysis
    • Data Exploration

Now we are talking...

… E D A !

67 of 120

“R” Rstudio

Python/numpy

Java/Kafka

Go/Gonum

Julia

Data Science (Software) Tools

Spark MLib (Python, Java, “R”)

Mahout

TensorFlow

Keras

PyTorch

Nia

Catapult

Maana

68 of 120

Relatively simple algorithm for supervised learning.

Useful because it illustrates the basic working premises of ML

A solid understanding of LR provides a solid foundation to understand the general concepts, processes and steps involved in more complex ML algorithms.

Example: Linear Regression

69 of 120

The steps include how to set up the data, how to apply an algorithm to the data in order to determine a good fit for F(X,Y), and how to use F to perform predictions with previously unseen data sets.

Example: Linear Regression

We will use 200 data records collected to assess the effect

of advertising media (TV, Radio, Newspaper) on Sales.

Three input variables and one output variable, or { X1, X2, X3, Y }.

70 of 120

Note the different inclinations of the blue lines

71 of 120

In the case of a single variable, the regression model F is a function of one independent variable X

Therefore p = 1 and

We could estimate

Sales | TV

Sales | Radio

Sales | NewsPaper

72 of 120

Using residual sum of squares, or RSS

Solving for the Betas gives us intercept and the slope of the line,

which we can use to make predictions on data never seen before.

73 of 120

Now that we know how to obtain model F for a single variable X

How could we examine any combination of X parameters, including

X1 & X2

X2 & X3

X1 & X3

X1 & X2 & X3

74 of 120

In the multivariate linear regression case, F is a function of a set of p explanatory variables

X = [ x1 x2 , . . . , xp ]

Multi-Variable Linear Regression

Now we have an expression in which

X is a matrix of n rows and p columns.

Beta is now a row vector of size p

Y is still a column vector of size n

75 of 120

Using gradient descent we find the parameters of the model (the B’s ).

The solution in the matrix form and the summation form are:

By expressing the problem in terms of linear algebra, we can compute the B’s using the linear algebra libraries in Python’s numpy, Go’s Gonum, R, Julia, etc.

Pseudo Inverse Matrix

76 of 120

The solution in essence involves finding a “pseudo-inverse” or Moore-Penrose Matrix (MPM) X+ matrix, which is a (n,p+1) matrix:

77 of 120

Should I use “R” . . . or Python/numpy, . . . or something else ?

78 of 120

Why Go?

"Go combines the ease of programming of an interpreted, dynamically typed language with the efficiency and safety of a statically typed, compiled language. Go is modern, with support for networked and multicore computing."

https://golang.org/doc/faq

  • Robert Griesemer (Cray Vectorization, Sawzall, Java HotSpot)
  • Rob Pike (UTF-8, Squeak, Plan9, Inferno)
  • Ken Thompson (C, Unix, UTF-8)

79 of 120

Why Go?

  • Open-Source ( http://github.com/golang/go )
  • Compiled
  • Statically-typed
  • Garbage collected
  • Speed: within 10% of “C”
  • Short compile times
  • Simple syntax
  • Stack traces on nil dereferences ( no segmentation faults )
  • Composition types via interfaces, not inheritance
  • Concurrency
  • Large standard library growing very fast, e.g., GoNum
  • List of packages available or under dev: https://awesome-go.com/
  • My language of choice, the one I use when I need to to push the envelope

language features

80 of 120

Why Go?

  • Easy to update/download packages (go get)
  • No makefiles (go install)
  • Easy to create/use test suites (go test)
  • No race/deadlocks (go install -race)
  • Automatic doc generation (godoc)
  • Code formatting (gofmt)
  • Automatic import detection (goimports)
  • Profiling (go tool pprof)
  • IMHO Go is better than any other language for
    • Using external libs
    • Managing complexity
    • Writing legible, elegant code that will last

go tools

81 of 120

Scientific Computing Around 2008

82 of 120

Scientific Computing Around 2021

Julia

Go

83 of 120

84 of 120

85 of 120

86 of 120

DEMO

87 of 120

Code in GoLang

DEMO

88 of 120

DEMO

89 of 120

DEMO

90 of 120

DEMO

91 of 120

DEMO

92 of 120

There are several Web Servers and Internet products developed in “Pure Go” : Kubernetes, OpenShift, Hugo, Caddy, ...

In fact there is a fantastic library for writing your own web server and/or your own MicroService, sooooo,

Engineering of Data Science:

Go as a System Architecture Tool

93 of 120

Web Client

Web Server

makinacognica

Engineering of Data Science:

Go as a System Architecture Tool

GoWaMain

GoWa

94 of 120

DEMO

GoWa (Web Server)

GoWaMain (Web Client)

makinacognica

95 of 120

DEMO

96 of 120

GoWa (Web Server)

GoWaMain (Web Client)

makinacognica

DEMO

97 of 120

Scientific Computing Around 2021

Julia

Go

98 of 120

ISL Book Ch03 LinReg (in Julia)

https://github.com/JuanVargas/ISL/blob/main/chap_03/isl_ch03_linReg_jl.ipynb

99 of 120

https://bit.ly/3dZgqZu

MovieLens in Go and Julia

100 of 120

Spark was the first unified analytics engine that facilitated large scale data processing, SQL analytics, and ML. Spark was 100x faster than Hadoop.

101 of 120

The Berkeley View

“… An unfortunate academic tradition is that we build research prototypes, then wonder why people don’t use them. Applications must drive research and provide concrete goals and metrics to evaluate progress …”

Field of Dreams: If you built it, they will come

102 of 120

102

103 of 120

Problem Solving and Active Learning

Increases engagement by encouraging a collaborative learning experience in which instructors and students work together towards learning about real problems, propose solutions, and take action

  • Require students do something versus just learn about something
  • Require that students document the experience/process from challenge to solution.
  • Focus on 'real-life' challenges and solutions
  • Multiple points of entry
  • Multiple possible solutions
  • Connection with multiple disciplines
  • Focus on developing 21th century skills for students and instructors
  • Leverage modern technology tools and resources geared towards on-line distance learning.
  • Use of modern technology tools for organizing, collaborating, and sharing.

104 of 120

Who will continue the journey?

105 of 120

References

R. Hamming: “The Art of Probability”

R. Hamming: “Methods of Mathematics Applied to Calculus, Probability and Statistics”

John Tukey: “Exploratory Data Analysis”

James, Witten, Hastie, Tibshirani: “Introduction to Statistical Learning” (ILSR Book)

Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)

106 of 120

References as .PDF

James, Witten, Hastie, Tibshirani: “An Introduction to Statistical Learning” (ILSR Book)

Hastie, Tibshirani, Friedman: “Elements of Statistical Learning” (ESL Book)

107 of 120

Thank you !

108 of 120

Concurrency is a first-class citizen in Go

GoRoutines : https://tour.golang.org/concurrency/1

Channels : https://tour.golang.org/concurrency/2

Buffered Channels : https://tour.golang.org/concurrency/3

Select lets a goroutine wait on multiple communication ops

Mutex: https://tour.golang.org/concurrency/9

Go as a System Architecture Tool

109 of 120

Go Cloud

110 of 120

Modern Systems are Massive, … and Complex

111 of 120

111

112 of 120

112

MSFT Catapult

113 of 120

113

114 of 120

114

115 of 120

115

116 of 120

116

117 of 120

117

118 of 120

118

119 of 120

119

TensorFlow

120 of 120

Who will continue the journey?