1 of 22

PART 2

Computational & Statistical Thinking

Mathematical Foundation

KU PSYC 500

Jiin Jung

2 of 22

Course Objectives

Specifically, at the conclusion of the course,

1.Students will be able to:

  • Curate data from open sources and development relevant questions
  • Understand data and organize information
  • Produce visualizations of data and communicate information
  • Identify and apply statistical methods and analyze data
  • Evaluate results to recommend policy or organizational changes
  • Demonstrate understanding of Python for use in data science and statistical modeling

3 of 22

Open source data

Datasets are given by organizations

Subsetting

Slicing

Vectorized operation

Reshaping

Combining

Data visualization

Descriptive statistics

Suggesting hypotheses

Accessing assumptions for inferential statistics

4 of 22

Simulation-based statistics

  1. Observe empirical data (EDA)
  2. Develop a research question/Choose a statistic of interest
    1. Mean, median, mean difference, correlation, etc
  3. State H0 (Null Hypothesis)
  4. Simulate data assuming that H0 is true
    • Bootstrapping
    • Permutation
  5. Calculate a replicate from a simulated dataset
    • Define a function
  6. Repeat step 4 & 5 many many times - Iteration
  7. Decision
    • Compute p-value
    • A p-value is a fraction of replicates that is at least as extreme as the observed value.

5 of 22

Why Simulation?

Resampling techniques are rapidly entering mainstream data analysis; some statisticians believe that resampling procedures will soon supplant common nonparametric procedures and may displace most parametric procedures as well. “ (Berger, 2012)

  • Requiring fewer assumptions: when assumptions of traditional parametric tests are not met, as with small samples from non-normal distributions
  • Simpler and more flexible: Methods for testing means, medians, ratios, or other parameters are the same, so we do not need new methods for these different applications.
  • Easier to interpret: e.g. p-value
  • Practical: The general availability of cheap rapid computing and new software

6 of 22

Simulation

Generate a simulated dataset

  • Bootstrapping - Mean difference
  • Permutation - Comparing Distribution, Correlation

A Replicate

Calculate a replicate from the simulated dataset

  • Bootstrap replicates
  • Permutation replicates

Iteration

Process

(10k~)

7 of 22

Part 2

Computational & Statistical Thinking

Mathematical Foundation

Data Ethics. NumPy

Data Ethics.

Iteration. For loop. Defining functions.

Week 5

Define Functions. Simulation

Iteration. For loop. Defining functions. Cumulative distribution function (CDF). Random number generation

>>> 2nd Project Announcement

Week 6

2nd Project Presentation

Lied Center of Art.

Week 8

Statistical Inference

Exploratory data analysis (EDA). Statistical Inference. Probability and Uncertainty. Bernoulli Trials. Binomial Distribution.

Week 7

8 of 22

Questions?

9 of 22

Intro to NumPy

Lec 11

10 of 22

Objectives

  • Intro to NumPy
  • Iteration. For loop
  • Define Functions

11 of 22

NumPy

Numpy is a library for the Python programming language.

  • It contains a large collection of mathematical functions.
  • It operates on large, multi-dimensional arrays and matrices.

12 of 22

Essential libraries and projects that depend on NumPy’s API gain access to new array implementations that support NumPy’s array protocols (Fig. 3).

13 of 22

Import NumPy

import numpy as np

14 of 22

Array: a basic data structure of Numpy

np.array()

15 of 22

ndarray

ndarray: a n-dimensional array

16 of 22

Sorting & adding elements

np.sort()

np.append()

np.concatenate()

17 of 22

Indexing & slicing elements - 1d array

data = np.array([1,2,3])

18 of 22

Indexing & slicing elements - 2d array

data = np.array([[1,2],[3,4],[5,6]])

19 of 22

Selecting a subset

a = np.array([1,2,3,4,5,6,7,8,9,10])

b = a[3:8]#you can use index positions.

c = a[a > 5] #you can use booleans.

20 of 22

Array operations

a = np.array([1,2,3])

b = np.array([4,5,6])

a + b

a - b

a * b

a / b

21 of 22

Math formulas & summary statistics

np.square()

np.sum()

np.mean()

np.var()

np.std()

np.percentile()

np.corrcoef()

22 of 22

Questions?