1 of 83

Correlation and Causality:

Case Studies

Prof. Shih-wei Liao

2 of 83

Reference:

3 of 83

Judea Pearl: The Book of Why

4 of 83

More References:

Judea Pearl: Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

http://ftp.cs.ucla.edu/pub/stat_ser/r475.pdf

Domain-knowledge vs. Data driven

https://www.linkedin.com/pulse/domain-knowledge-vs-data-driven-who-winning-karthik-guruswamy/�

Causal Markov Condition

http://www.fitelson.org/269/Woodward_Hausman_IIATCMC.pdf

5 of 83

History of AI

70’s: Booming of Expert System: Rule-based
80’s: Booming of Neural Networks
90’s:

Return of Non-Monotonic Reasoning
NLP: Natural Language Processing & Speech Recognition

Bayesian Networks: Causality

Support Vector Machine: Correlation�

After 2014: Booming of AI based on Big data & Big compute: (2014: Local high of AI investment in US)

Correlation
Deep Neural Networks

Next: Causality?

6 of 83

Correlation ≠ Causality

X correlates with Y, but

X doesn’t cause Y

Z causes Y.

If we know Z, we can predict Y.

�

X causes Z. Z causes Y.
X causes Y.

If we know X, we can predict Y.

Z

X

Y

Z

X

Y

7 of 83

Causality ⇒ Correlation

Correlation ⇏ Causality

For example : Correlation between Ice cream sales and sunglasses sold

As the sales of ice creams is increasing so do the sales of sunglasses.

8 of 83

3 Variables: Heat, Ice cream, Sunglasses

Heat

Ice�cream

Sun glasses

9 of 83

“99% Correlation” != Causality

迴歸參數的效力: 自由心證

oftentimes

?

Wine

#fund managers

99%

10 of 83

Correlation ≠ Causality

For example :

11 of 83

Why Steve Jobs Doesn’t Poll:

Correlation tells you what

數據告訴 “結果”，even possibly insight
但不一定是因果：

Causality tells you why

Note: Decision science CANNOT just count on correlation.

Steve Jobs doesn’t just decide based on data input.

If you are visionary, you (Steve Jobs) have intuition.
But how about us?

System simulation or experiment design required
But is it possible to experiment?

12 of 83

Correlation ≠ Causality

For example :

13 of 83

Recap: Causality vs. Correlation:

Why vs. What
“Why” is hard:

Usually needs domain knowledge
Usually needs to design the experiments!

14 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

15 of 83

Domain Knowledge vs. �Data-Driven

Some AI pundits seem to think that we don't need Domain Experts in future.

However, Domain experts understand a lot of causal models from their experience.

For example, "Does the sunrise causes a rooster to wake up early or the roosters are responsible for the sun to rise ?"

Machine Learning will find the relationship between roosters and the sunrise, but will never know which causes which from the observed data alone.

16 of 83

Challenges in Causality Science:

Domain knowledge
Design the experiments!

BIG data typically don’t come from BIG, expensive experiments.
Some experiment are intrusive

Medical

So, it often needs:

intuition (拍腦袋作決定)
qualitative study or insight (behavioral, psychological)
exploratory research
UI/UX

As a result, Big data AI so far counts on Correlation.

17 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

18 of 83

Machine Learning vs. Causality Science:

19 of 83

Mathematical Foundation of Causality:

Bayesian network

a probabilistic graphical model
represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG)
DAG whose nodes represent variables in the Bayesian sense.

20 of 83

Causality uses Bayesian Networks:

For example:

The joint probability function is:

21 of 83

Causality: Use Bayesian Networks

G = Grass wet (true/false), S = Sprinkler turned on (true/false), and R = Raining (true/false)�

What is the probability that it is raining, given the grass is wet?�

22 of 83

Challenges in Causality Science:

The wet example above: 3-variable
Hard to go with a lot of variables!

We need to know relationship between each other
Causality is usually used in explanation.

Hard to give accurate value that we want to predict.

23 of 83

System Dynamics

24 of 83

The Three-Layer Causal Hierarchy

25 of 83

Structural Causal Models (SCM)

The SCM deploys three parts

Graphical models,
Structural equations, and
Counterfactual and interventional logic

26 of 83

Extension: Multimodal Metadata Fusion Using Causal Strength

ACM Multimedia paper by Yi Wu, Ed Chang, Belle Tseng

Image classification:
Integrate 時間，地點，indoor or outdoor information

E.g., Sunrise or Sunset

“Because the time is 5pm, we know this is a sunset photo”: 好的解釋�

這種態樣 (causality rules) vs. Deep learning

Note: Deep Learning: Low 解釋率，but not limited by the ability to obtain the causality rules.

27 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

28 of 83

Machine Learning uses Correlation or Causality?

Take House Prices Prediction for example:

Input: Zoning, Area, YearBuilt……...
Output: Prices
In fact, policy will also influence it.
Machine Learning uses big data to find correlation between them and prices.

29 of 83

Deep Learning uses Correlation or Causality?

Correlation : Find Correlation between inputs and outputs.

30 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

31 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

32 of 83

Challenges & Opportunities: Causality

Domain knowledge
Design the experiments

Challenges:

The gravitational attraction of the moon causes the oceans to bulge out in the direction of the moon.
Genetic cause of sexuallity?

33 of 83

Challenge: Where Correlation is Effective

Machine Translation:�Where correlation is effective: a = b = c
Glenn Hinton: Knowing metadata (Chinese vs. English) doesn’t help deep learning results!

Unicode with metadata (Chinese vs. English)
Unicode without metadata
Character picture (字圖片)

34 of 83

Outline:

Correlation vs. Causality
Causality Science:

Domain Knowledge & Experiment Design
Mathematical Foundation of Causality

Bayesian Networks
System Dynamics

Correlation Science

Mathematical Foundation of Correlation

Causality & Correlation use different maths�
Challenges & Opportunities: Causality

Where Correlation is effective

Case Studies:

Diaper Study
Medical
Bitcoin Tracing: Correlation + Causality

35 of 83

Case Study: Beer & Diaper

“Beer and Diaper” study by Walmart

Correlated
But why?

Young dads were asked by wives to help fetch diapers.

Get beers at one go

Decision: Walmart should co-locate Beer and Diaper

Metadata: Friday evening�
Correlation helps decision science somewhat
Correlation is less precise than causality

Is correlation effective usually, anyway?

Note: Causal parameters make output more accurate.

36 of 83

Case Study: Medical

37 of 83

38 of 83

Multi-Armed Bandit

39 of 83

40 of 83

Case Study: Bitcoin Tracing

Time zone information:

We used to classify Chinese Traders vs. US traders based on Time Zone information

Metadata is dangerous: Rules may be wrong. Some traders stay up to trade.
Unlike traditional stock market

“無方向性洗錢”

41 of 83

Bitcoin Tracing by causality

因為用虛擬貨幣洗錢，所以會有某些特定的交易模式

第一階段：將現金轉換為主流的虛擬貨幣(ex.Bitcoin)
第二階段：混合主流貨幣與匿名貨幣
第三階段：多層化(在多個匿名貨幣，交易所及公鑰網址間來回轉換)
第四階段：兌現

42 of 83

Unsupervised Analyzer

Network Representation Learning

DeepWalk is one of NRL algorithm, random neighborhood to N vectors
Length (L) * width (K)

E.g., L=10, K=4 below
Complexity per sample = L/k(L-k)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

43 of 83

Unsupervised Analyzer

Social network analysis (SNA)

Girvan-Newman algorithm: vertex-betweenness.

Time Complexity = O(E ^ 2N)

Node2vec algorithm

Time Complexity = O(|V|d)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

44 of 83

Unsupervised Analyzer

2015/12/25, 2016/2/16 transactions:

553,103 addresses, 1,350,917 tx., 51 mixers

Optimized Node2Vec parameter: p = 1.0, q = 2.0

128 dimensions

Average ( C52_2 (cosine-similarity across these 128 dimensions) = 0.86185

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

45 of 83

Unsupervised Analyzer

K-means to 1000 clusters

Input data:1,350,917 tx.

I.e., 1,350,917 transactions in these 2 days

Comparison: MJIB (Ministry of Justice Investigation Bureau) has 58 Money-Laundering patterns (“態樣”)
We found 12 clusters who contain mixers.

How to decide “1000”?

If too few clusters: all mixers in 1 cluster, e.g.
If too many clusters: can’t observe mixers’ pattern
1000:

7

＿

52

3

＿

52

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

46 of 83

Supervised Analyzer

26313 of labelled bitcoin address

Retrieved from https://www.walletexplorer.com

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

47 of 83

Supervised Analyzer

Bitcoin features extraction: 24 more features: 70% precision -> 73% precisions

交易量，交易天數
4 moments: average, variance, …
Deal with numbers span > 10000, otherwise don’t converge

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

48 of 83

Supervised Analyzer

Random forest classification (71%)
Support: 9731 training data for Exchange, ….
Test set size = 26313 - 24476

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

49 of 83

Supervised Analyzer

XGBoost classification (75%)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

50 of 83

Monitor

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

51 of 83

Conclusion

Today, most folks cannot tell the differences between causality and correlation.

Causality Science: Future

52 of 83

Backup

Correlation Science

Clustering
Smart Explorer

53 of 83

K-means Clustering

From: http://stanford.edu/~cpiech/cs221/img/kmeansViz.png

54 of 83

K-means Clustering

Input: A set of data points (x, y), and the number of clusters, K.

Output: The centers of the K clusters.

Algorithm:

Initialize the cluster centers.
Do until convergence:

For each data point, find the closest cluster center, and become the member of that cluster.
For each cluster, update the center as the mean of its members.
Reach convergence, if all cluster centers do not change.

55 of 83

K-means Clustering using MapReduce

"centers": Initialize the K cluster centers.

Mapper:

Load the current values of "centers".
map(data_point) -> (NearestCenterId(data_point), data_point)

Reducer:

reducer(center_id, [p1, p2, ...]) -> ( center_id, avg([p1, p2, ...]) )

Repeat the above Mapper/Reducer steps, until convergence.

56 of 83

MapReduce Programming Model

Programmer specifies two functions:

Map: Procedure for processing each data record independently.

map (input_data) -> list(out_key, intermediate_value)

Reduce: Procedure for summarizing values with the same key.

reduce (out_key, list(intermediate_value)) -> list(out_value)

Main data structure:

(Key, Value) pairs.

57 of 83

Mappers and Reducers

Mapper/Reducer Template:

map (input_data) ->

list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

58 of 83

Mappers and Reducers: Observations

Memoryless: Work independently and process one record at a time.

Easier to achieve parallelization and fault-tolerance.

Limited functionality, but surprisingly enough for most data processing applications.

59 of 83

Example: Count Word Occurrences

Input: A set of words.

Output: Number of occurrences of each�word.

Mapper/Reducer implementation:

map(word) -> (word, 1)
reduce(word, list(1, 1, ..., 1)) -> [(word, total_count)]

Mapper/Reducer template:

map (input_data) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)

60 of 83

Supporting Parallelized and Distributed Processing

Mapper/Reducer Template:

map (input_data) ->

list(out_key, intermediate_value)

reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

61 of 83

Parallel, Distributed Execution

Mapper

Reducer

62 of 83

Parallel, Distributed Execution

Combiner

Shuffler

Mapper

Reducer

63 of 83

Key components of MapReduce

Mapper:

Process an input data and emit one or more (key, value) pairs.

Combiner (optional, in the same machine of a mapper):

Combine output data at local machine, before emitting to Shuffler.

Shuffler:

Send values of the same key to a particular Reducer machine.
Group (and sort) the values with the same key.

Reducer:

Process a (key, value_list) tuple and output one or more (key, value) pairs.

64 of 83

From Smart Explorer to Big Explorer

Big Data

Google Cloud

Smart Explorer: Machine learning-based data center optimization

See Google’s paper

Big Explorer: Machine learning-based Big Data 3S optimization

65 of 83

Tunables + Observables → Performance↑

Hardware

Prefetchers

+

CPI ?

Cache miss rate ?

Smart Explorer

66 of 83

Parameter Space Search Problem

Observables (Performance events) <o₁, o₂, o₃, …, o_m>

Memory
Pipeline stalls
Branch prediction
Resource utilization
FP operations
…

Tunables (Hardware prefetchers) <t₁, t₂, t₃, …, t_n>

Prefetch from memory to L2 cache
Prefetch a cache line to increase bus usage
Prefetch from L2 cache to L1 cache
…

E.g., with linear regression

t₁ = a₀ + a₁ * o₁ + a₂ * o₂ + a₃ * o₃ + … + a_m* o_m

Machine Learning!

67 of 83

Hardware Prefetcher

int a[], b[], c[];

for (i = 0; i < 1000000; i++) {

if (flag) a[i] = b[i] + c[i];

}

10000, 10004, 10008, 10012, …

Prefetch 10028, 10032, 10036, ...

Stride prefetcher

68 of 83

How to Apply Machine Learning (ML)

Goal

Run the program only once to find the best config.

Strategy

myprofile gathers related performance events as observables during 1 execution
smarty (ML) performs correlation analysis & predicts the best config.

Applications

myprofile

smarty

Observables

Predicted Configuration

69 of 83

A Simplified Example

5 applications (1, 2, 3, 4, 5) on a machine with 1 hardware prefetcher (Y for “ON” and N for “OFF”) and 2 hardware counters (CM for “cache miss” and BU for “bus utilization”)

1, 2, 3, 4 are used to train the ML model for example
ML model is used to predict the best configuration for 5

70 of 83

smarty: 3 Phases in Machine Learning

Labeling

Find label via exhaustive search for each app in the training set
Label = Y/N, denoting the best prefetching config.

E.g., (CM₁, BU₁, Y), (CM₂, BU₂, Y), (CM₃, BU₃, N), (CM₄, BU₄, N)

Training

Correlate the observables (CM, BU) with the tunables (HW prefetcher) and build a model for classification

Evaluation

Use the model to predict (CM₅, BU₅, ?)

high CM and low BU → should prefetch!

71 of 83

Workflow of Model Construction

72 of 83

Workflow of Model Construction

Select representative programs as training dataset

1, 2, 3, 4

73 of 83

Workflow of Model Construction

Collect the observables with all possible combinations off-line

myprofile

Configuration Tuner

(CM₁, BU₁, Y), (CM₁^’, BU₁^’, N), (CM₂, BU₂, Y), (CM₂^’, BU₂^’, N),

(CM₃, BU₃, Y), (CM₃^’, BU₃^’, N), (CM₄, BU₄, Y), (CM₄^’, BU₄^’, N)

74 of 83

Workflow of Model Construction

Label the observables and feed them into the model for training

(CM₁, BU₁, Y), (CM₁^’, BU₁^’, N), (CM₂, BU₂, Y), (CM₂^’, BU₂^’, N),

(CM₃, BU₃, Y), (CM₃^’, BU₃^’, N), (CM₄, BU₄, Y), (CM₄^’, BU₄^’, N)

75 of 83

Workflow of Model Construction

Select machine learning algorithm and train the model

smarty

(CM₁, BU₁, Y)

(CM₂, BU₂, Y)

(CM₃, BU₃, N)

(CM₄, BU₄, N)

76 of 83

Workflow of Model Evaluation

77 of 83

Workflow of Model Evaluation

New program interested in obtaining good performance

5

78 of 83

Workflow of Model Evaluation

Collect the observables only once

myprofile

(CM₅, BU₅)

5

79 of 83

Workflow of Model Evaluation

Feed the collected observables into the model

smarty

(CM₅, BU₅, ?)

5

80 of 83

Workflow of Model Evaluation

Configuration Tuner

A guided configuration is generated for future runs

Y

5

(CM₅, BU₅)

81 of 83

In Short, Model Construction and Evaluation

Specify the tunable parameters <t₁, t₂, t₃, …, t_n>

Hardware prefetcher

Identify the observable events <o₁, o₂, o₃, …, o_m>

Cache miss, bus utilization

Create training dataset for the machine learning algorithm

Labeling

Build the prediction model
Verify/evaluate the prediction model
Refine the model if necessary

82 of 83

Performance

Figure 3. Individual speedup for each program in the DCA benchmark.

83 of 83

From Smart Explorer to Big Explorer

Big Data

Google Cloud

Smart Explorer: Machine learning-based data center optimization

See Google’s paper

Big Explorer: Machine learning-based Big Data 3S optimization