1 of 83

Correlation and Causality:

Case Studies

Prof. Shih-wei Liao

2 of 83

Reference:

3 of 83

Judea Pearl: The Book of Why

4 of 83

More References:

  • Judea Pearl: Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

5 of 83

History of AI

  • 70’s: Booming of Expert System: Rule-based
  • 80’s: Booming of Neural Networks
  • 90’s:
    • Return of Non-Monotonic Reasoning
    • NLP: Natural Language Processing & Speech Recognition
      • Bayesian Networks: Causality
    • Support Vector Machine: Correlation�
  • After 2014: Booming of AI based on Big data & Big compute: (2014: Local high of AI investment in US)
    • Correlation
    • Deep Neural Networks
  • Next: Causality?

6 of 83

Correlation ≠ Causality

  • X correlates with Y, but

X doesn’t cause Y

  • Z causes Y.
    • If we know Z, we can predict Y.

  • X causes Z. Z causes Y.
  • X causes Y.
    • If we know X, we can predict Y.

Z

X

Y

Z

X

Y

7 of 83

Causality ⇒ Correlation

Correlation ⇏ Causality

  • For example : Correlation between Ice cream sales and sunglasses sold
    • As the sales of ice creams is increasing so do the sales of sunglasses.

8 of 83

3 Variables: Heat, Ice cream, Sunglasses

Heat

Ice�cream

Sun glasses

9 of 83

“99% Correlation” != Causality

迴歸參數的效力: 自由心證

oftentimes

?

Wine

#fund managers

99%

10 of 83

Correlation ≠ Causality

  • For example :

11 of 83

Why Steve Jobs Doesn’t Poll:

  • Correlation tells you what
    • 數據告訴 “結果”,even possibly insight
    • 但不一定是因果:
  • Causality tells you why

Note: Decision science CANNOT just count on correlation.

  • Steve Jobs doesn’t just decide based on data input.
    • If you are visionary, you (Steve Jobs) have intuition.
    • But how about us?
      • System simulation or experiment design required
      • But is it possible to experiment?

12 of 83

Correlation ≠ Causality

  • For example :

13 of 83

Recap: Causality vs. Correlation:

  • Why vs. What
  • “Why” is hard:
    • Usually needs domain knowledge
    • Usually needs to design the experiments!

14 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths�
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

15 of 83

Domain Knowledge vs. �Data-Driven 

  • Some AI pundits seem to think that we don't need Domain Experts in future.
    • However, Domain experts understand a lot of causal models from their experience.
  • For example, "Does the sunrise causes a rooster to wake up early or the roosters are responsible for the sun to rise ?"
    • Machine Learning will find the relationship between roosters and the sunrise, but will never know which causes which from the observed data alone.

16 of 83

Challenges in Causality Science:

  • Domain knowledge
  • Design the experiments!
    • BIG data typically don’t come from BIG, expensive experiments.
    • Some experiment are intrusive
      • Medical
  • So, it often needs:
    • intuition (拍腦袋作決定)
    • qualitative study or insight (behavioral, psychological)
    • exploratory research
    • UI/UX
  • As a result, Big data AI so far counts on Correlation.

17 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths�
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

18 of 83

Machine Learning vs. Causality Science:

19 of 83

Mathematical Foundation of Causality:

  • Bayesian network
    • a probabilistic graphical model
    • represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG)
    • DAG whose nodes represent variables in the Bayesian sense.

20 of 83

Causality uses Bayesian Networks:

For example:

The joint probability function is:

21 of 83

Causality: Use Bayesian Networks

G = Grass wet (true/false), S = Sprinkler turned on (true/false), and R = Raining (true/false)�

What is the probability that it is raining, given the grass is wet?�

22 of 83

Challenges in Causality Science:

  • The wet example above: 3-variable
  • Hard to go with a lot of variables!
    • We need to know relationship between each other
    • Causality is usually used in explanation.
      • Hard to give accurate value that we want to predict.

23 of 83

System Dynamics

24 of 83

The Three-Layer Causal Hierarchy

25 of 83

Structural Causal Models (SCM)

  • The SCM deploys three parts
    1. Graphical models,
    2. Structural equations, and
    3. Counterfactual and interventional logic

26 of 83

Extension: Multimodal Metadata Fusion Using Causal Strength

  • ACM Multimedia paper by Yi Wu, Ed Chang, Belle Tseng
    • Image classification:
    • Integrate 時間,地點,indoor or outdoor information
      • E.g., Sunrise or Sunset
    • “Because the time is 5pm, we know this is a sunset photo”: 好的解釋�
  • 這種態樣 (causality rules) vs. Deep learning
    • Note: Deep Learning: Low 解釋率,but not limited by the ability to obtain the causality rules.

27 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths�
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

28 of 83

Machine Learning uses Correlation or Causality?

  • Take House Prices Prediction for example:
    • Input: Zoning, Area, YearBuilt……...
    • Output: Prices
    • In fact, policy will also influence it.
    • Machine Learning uses big data to find correlation between them and prices.

29 of 83

Deep Learning uses Correlation or Causality?

  • Correlation : Find Correlation between inputs and outputs.

30 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

31 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths�
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

32 of 83

Challenges & Opportunities: Causality

  • Domain knowledge
  • Design the experiments

  • Challenges:
    • The gravitational attraction of the moon causes the oceans to bulge out in the direction of the moon.
    • Genetic cause of sexuallity?

33 of 83

Challenge: Where Correlation is Effective

  • Machine Translation:�Where correlation is effective: a = b = c
  • Glenn Hinton: Knowing metadata (Chinese vs. English) doesn’t help deep learning results!

    • Unicode with metadata (Chinese vs. English)
    • Unicode without metadata
    • Character picture (字圖片)

34 of 83

Outline:

  • Correlation vs. Causality
  • Causality Science:
    • Domain Knowledge & Experiment Design
    • Mathematical Foundation of Causality
      • Bayesian Networks
      • System Dynamics
  • Correlation Science
    • Mathematical Foundation of Correlation
  • Causality & Correlation use different maths�
  • Challenges & Opportunities: Causality
    • Where Correlation is effective
  • Case Studies:
    • Diaper Study
    • Medical
    • Bitcoin Tracing: Correlation + Causality

35 of 83

Case Study: Beer & Diaper

  • “Beer and Diaper” study by Walmart
    • Correlated
    • But why?
      • Young dads were asked by wives to help fetch diapers.
        • Get beers at one go
      • Decision: Walmart should co-locate Beer and Diaper
  • Metadata: Friday evening�
  • Correlation helps decision science somewhat
  • Correlation is less precise than causality
      • Is correlation effective usually, anyway?
  • Note: Causal parameters make output more accurate.

36 of 83

Case Study: Medical

37 of 83

38 of 83

Multi-Armed Bandit

39 of 83

40 of 83

Case Study: Bitcoin Tracing

    • Time zone information:
      • We used to classify Chinese Traders vs. US traders based on Time Zone information
        • Metadata is dangerous: Rules may be wrong. Some traders stay up to trade.
        • Unlike traditional stock market
    • “無方向性洗錢”

41 of 83

Bitcoin Tracing by causality

  • 因為用虛擬貨幣洗錢,所以會有某些特定的交易模式
    • 第一階段:將現金轉換為主流的虛擬貨幣(ex.Bitcoin)
    • 第二階段:混合主流貨幣與匿名貨幣
    • 第三階段:多層化(在多個匿名貨幣,交易所及公鑰網址間來回轉換)
    • 第四階段:兌現

42 of 83

Unsupervised Analyzer

  • Network Representation Learning
    • DeepWalk is one of NRL algorithm, random neighborhood to N vectors
    • Length (L) * width (K)
      • E.g., L=10, K=4 below
      • Complexity per sample = L/k(L-k)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

43 of 83

Unsupervised Analyzer

  • Social network analysis (SNA)
    • Girvan-Newman algorithm: vertex-betweenness.
      • Time Complexity = O(E ^ 2N)
    • Node2vec algorithm
      • Time Complexity = O(|V|d)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

44 of 83

Unsupervised Analyzer

  • 2015/12/25, 2016/2/16 transactions:
    • 553,103 addresses, 1,350,917 tx., 51 mixers
  • Optimized Node2Vec parameter: p = 1.0, q = 2.0
    • 128 dimensions
  • Average ( C52_2 (cosine-similarity across these 128 dimensions) = 0.86185

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

45 of 83

Unsupervised Analyzer

  • K-means to 1000 clusters
    • Input data:1,350,917 tx.
      • I.e., 1,350,917 transactions in these 2 days
    • Comparison: MJIB (Ministry of Justice Investigation Bureau) has 58 Money-Laundering patterns (“態樣”)
    • We found 12 clusters who contain mixers.
  • How to decide “1000”?
    • If too few clusters: all mixers in 1 cluster, e.g.
    • If too many clusters: can’t observe mixers’ pattern
    • 1000:

7

_

52

3

_

52

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

46 of 83

Supervised Analyzer

  • 26313 of labelled bitcoin address

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

47 of 83

Supervised Analyzer

  • Bitcoin features extraction: 24 more features: 70% precision -> 73% precisions
    • 交易量,交易天數
    • 4 moments: average, variance, …
    • Deal with numbers span > 10000, otherwise don’t converge

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

48 of 83

Supervised Analyzer

  • Random forest classification (71%)
  • Support: 9731 training data for Exchange, ….
  • Test set size = 26313 - 24476

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

49 of 83

Supervised Analyzer

  • XGBoost classification (75%)

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

50 of 83

Monitor

This presentation uses a free template provided by FPPT.com

www.free-power-point-templates.com

51 of 83

Conclusion

  • Today, most folks cannot tell the differences between causality and correlation.

  • Causality Science: Future

52 of 83

Backup

  • Correlation Science
    • Clustering
    • Smart Explorer

53 of 83

K-means Clustering

From: http://stanford.edu/~cpiech/cs221/img/kmeansViz.png

54 of 83

K-means Clustering

Input: A set of data points (x, y), and the number of clusters, K.

Output: The centers of the K clusters.

Algorithm:

  • Initialize the cluster centers.
  • Do until convergence:
    • For each data point, find the closest cluster center, and become the member of that cluster.
    • For each cluster, update the center as the mean of its members.
    • Reach convergence, if all cluster centers do not change.

55 of 83

K-means Clustering using MapReduce

"centers": Initialize the K cluster centers.

Mapper:

  • Load the current values of "centers".
  • map(data_point) -> (NearestCenterId(data_point), data_point)

Reducer:

  • reducer(center_id, [p1, p2, ...]) -> ( center_id, avg([p1, p2, ...]) )

Repeat the above Mapper/Reducer steps, until convergence.

56 of 83

MapReduce Programming Model

Programmer specifies two functions:

  • Map: Procedure for processing each data record independently.
    • map (input_data) -> list(out_key, intermediate_value)
  • Reduce: Procedure for summarizing values with the same key.
    • reduce (out_key, list(intermediate_value)) -> list(out_value)

Main data structure:

  • (Key, Value) pairs.

57 of 83

Mappers and Reducers

Mapper/Reducer Template:

  • map (input_data) ->

list(out_key, intermediate_value)

  • reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

58 of 83

Mappers and Reducers: Observations

  • Memoryless: Work independently and process one record at a time.
    • Easier to achieve parallelization and fault-tolerance.
  • Limited functionality, but surprisingly enough for most data processing applications.

59 of 83

Example: Count Word Occurrences

Input: A set of words.

Output: Number of occurrences of each�word.

Mapper/Reducer implementation:

  • map(word) -> (word, 1)
  • reduce(word, list(1, 1, ..., 1)) -> [(word, total_count)]

Mapper/Reducer template:

  • map (input_data) -> list(out_key, intermediate_value)
  • reduce (out_key, list(intermediate_value)) -> list(out_value)

60 of 83

Supporting Parallelized and Distributed Processing

Mapper/Reducer Template:

  • map (input_data) ->

list(out_key, intermediate_value)

  • reduce (out_key, list(intermediate_value)) ->

list(out_value)

Mapper

Reducer

61 of 83

Parallel, Distributed Execution

Mapper

Reducer

62 of 83

Parallel, Distributed Execution

Combiner

Shuffler

Mapper

Reducer

63 of 83

Key components of MapReduce

Mapper:

  • Process an input data and emit one or more (key, value) pairs.

Combiner (optional, in the same machine of a mapper):

  • Combine output data at local machine, before emitting to Shuffler.

Shuffler:

  • Send values of the same key to a particular Reducer machine.
  • Group (and sort) the values with the same key.

Reducer:

  • Process a (key, value_list) tuple and output one or more (key, value) pairs.

64 of 83

From Smart Explorer to Big Explorer

Big Data

Google Cloud

Smart Explorer: Machine learning-based data center optimization

See Google’s paper

Big Explorer: Machine learning-based Big Data 3S optimization

65 of 83

Tunables + Observables → Performance↑

Hardware

Prefetchers

+

CPI ?

Cache miss rate ?

Smart Explorer

66 of 83

Parameter Space Search Problem

  • Observables (Performance events) <o1, o2, o3, …, om>
    • Memory
    • Pipeline stalls
    • Branch prediction
    • Resource utilization
    • FP operations
  • Tunables (Hardware prefetchers) <t1, t2, t3, …, tn>
    • Prefetch from memory to L2 cache
    • Prefetch a cache line to increase bus usage
    • Prefetch from L2 cache to L1 cache

E.g., with linear regression

t1 = a0 + a1 * o1 + a2 * o2 + a3 * o3 + … + am* om

Machine Learning!

67 of 83

Hardware Prefetcher

int a[], b[], c[];

for (i = 0; i < 1000000; i++) {

if (flag) a[i] = b[i] + c[i];

}

10000, 10004, 10008, 10012, …

Prefetch 10028, 10032, 10036, ...

Stride prefetcher

68 of 83

How to Apply Machine Learning (ML)

  • Goal
    • Run the program only once to find the best config.
  • Strategy
    • myprofile gathers related performance events as observables during 1 execution
    • smarty (ML) performs correlation analysis & predicts the best config.

Applications

myprofile

smarty

Observables

Predicted Configuration

69 of 83

A Simplified Example

  • 5 applications (1, 2, 3, 4, 5) on a machine with 1 hardware prefetcher (Y for “ON” and N for “OFF”) and 2 hardware counters (CM for “cache miss” and BU for “bus utilization”)
    • 1, 2, 3, 4 are used to train the ML model for example
    • ML model is used to predict the best configuration for 5

70 of 83

smarty: 3 Phases in Machine Learning

  • Labeling
    • Find label via exhaustive search for each app in the training set
    • Label = Y/N, denoting the best prefetching config.
      • E.g., (CM1, BU1, Y), (CM2, BU2, Y), (CM3, BU3, N), (CM4, BU4, N)
  • Training
    • Correlate the observables (CM, BU) with the tunables (HW prefetcher) and build a model for classification

  • Evaluation
    • Use the model to predict (CM5, BU5, ?)

high CM and low BU → should prefetch!

71 of 83

Workflow of Model Construction

72 of 83

Workflow of Model Construction

Select representative programs as training dataset

1, 2, 3, 4

73 of 83

Workflow of Model Construction

Collect the observables with all possible combinations off-line

myprofile

Configuration Tuner

(CM1, BU1, Y), (CM1, BU1, N), (CM2, BU2, Y), (CM2, BU2, N),

(CM3, BU3, Y), (CM3, BU3, N), (CM4, BU4, Y), (CM4, BU4, N)

74 of 83

Workflow of Model Construction

Label the observables and feed them into the model for training

(CM1, BU1, Y), (CM1, BU1, N), (CM2, BU2, Y), (CM2, BU2, N),

(CM3, BU3, Y), (CM3, BU3, N), (CM4, BU4, Y), (CM4, BU4, N)

75 of 83

Workflow of Model Construction

Select machine learning algorithm and train the model

smarty

(CM1, BU1, Y)

(CM2, BU2, Y)

(CM3, BU3, N)

(CM4, BU4, N)

76 of 83

Workflow of Model Evaluation

77 of 83

Workflow of Model Evaluation

New program interested in obtaining good performance

5

78 of 83

Workflow of Model Evaluation

Collect the observables only once

myprofile

(CM5, BU5)

5

79 of 83

Workflow of Model Evaluation

Feed the collected observables into the model

smarty

(CM5, BU5, ?)

5

80 of 83

Workflow of Model Evaluation

Configuration Tuner

A guided configuration is generated for future runs

Y

5

(CM5, BU5)

81 of 83

In Short, Model Construction and Evaluation

    • Specify the tunable parameters <t1, t2, t3, …, tn>
      • Hardware prefetcher
    • Identify the observable events <o1, o2, o3, …, om>
      • Cache miss, bus utilization
    • Create training dataset for the machine learning algorithm
      • Labeling
    • Build the prediction model
    • Verify/evaluate the prediction model
    • Refine the model if necessary

82 of 83

Performance

Figure 3. Individual speedup for each program in the DCA benchmark.

83 of 83

From Smart Explorer to Big Explorer

Big Data

Google Cloud

Smart Explorer: Machine learning-based data center optimization

See Google’s paper

Big Explorer: Machine learning-based Big Data 3S optimization