1 of 40

COSC3000:

Visualisation

Week 2 Lecture 1: Univariate data

Tuesday - 27/02/2024

Ben Roberts [b.roberts@uq.edu.au]

2 of 40

Univariate data

  • Single quantitative variable
    • E.g., temperature, or wind speed

  • Multivariate: matched data
    • Not simply two sets of univariate data
    • E.g., temperate and wind speed (at same time/location)

  • Labels not quantitative variables
    • Station 1 is not 2x station 2! (just labels)
    • Categorical / quantitative

2

3 of 40

Histogram

  • Often first “go-to”�
  • Show frequency of occurrences vs values
    • Or: normalised to approximate “probability density” (area = 1)�
  • Good feel for the data
    • Can hide structure of data�
  • Discrete values: simple�
  • Continuous variables: into “chunks” (bins)�
  • Choosing bin width important
    • Extreme cases: useless

3

4 of 40

Histogram bins

  • Choosing bins import, not always obvious choice
  • Particularly with sparse/limited data�
  • Some rules: (but often need adjusting)

4

[Sturge’s rule]

[default in Excel]

[Rice’s rule]

5 of 40

Example: height data

  • New York Choral Society Singers�
  • Four groups�
    • 39 Sopranos [60-72 inches]
    • 35 Altos [60-72 inches]
    • 20 Tenors [64-76 inches]
    • 36 Basses [66-75 inches]��categorical single quantitative� variable variable

5

Source: WS Cleveland “Visualizing data”

6 of 40

6

Thoughts?

7 of 40

7

Thoughts?

  • Difficult to compare�
  • Axes are different!

8 of 40

8

Thoughts?

  • Fixed axes

  • Not much resolution

9 of 40

9

Thoughts?

  • Better bin distribution

  • Careful! Don’t have equal sample sizes!

  • Sometimes better to normalise, e.g.

plot.hist(data, density=True)

10 of 40

Probability distributions

  • Continuous variables Prob(x) -> 0�
  • Instead: Probability Density Function: PDF�
  • PDF(x) * dx
    • probability to observe variable in region dx around x�
  • Area under curve => probability�
  • Histogram: approximates PDF of distribution
    • Once normalised!

10

11 of 40

Cumulative Distribution Function (CDF)

  • CDF(x): probability to observe value <x�
  • Monotonically increase: 0 -> 1�
  • Quantiles (later) approximate CDF

11

12 of 40

Gaussian: Normal distribution

  • Gaussian, normal, bell curve�
  • Many real-world random variables (in particular, uncorrelated random errors) are roughly Gaussian
    • Central limit theorem�
  • This isn’t exact though, so need caution�
  • Completely described by mean, width�
  • Average of N random gaussian variables:
    • μ ± σ / √N

12

13 of 40

KDE - Kernel density estimation

  • Method of estimating PDF from samples - see: population-sample.ipynb
  • “Smooth fit” to histogram (v. roughly)
  • Can be a nice way to plot histogram
    • But: can be misleading, particularly for small data sets
    • Tempting, but be very careful
    • Best practice is to also show actual histogram

13

Two random samples from same (Gaussian) distribution

14 of 40

Example:

14

Credit: Ciaran O’Hare

15 of 40

Example:

15

Credit: Ciaran O’Hare

Don’t get carried away!

16 of 40

Quantiles (CDF)

16

17 of 40

17

  • Manual (direct) not hard�
  • Numpy provides method:�
  • E.g., in 5% steps

x = np.linspace(0.0, 1.0, 20)

q = np.quantile(data, x)

18 of 40

18

19 of 40

Comparing two univariate data sets: Q-Q plots

19

  • Quantiles are useful for comparing two sets of data�
  • Simply plot one quantile vs another
    • Same quantiles
    • This example: 5% (20 pts)
    • Show x=y line too; plot 1:1

  • If the medians are the same, but the widths of the distributions are different, the Q-Q plot will cross the line y=x in the middle, but lie at an angle to it.

20 of 40

20

21 of 40

Box - whisker plot

21

  • Lots of information in an intuitive way
  • Some choice on out outliers: see documentation
  • Often: “notches” for confidence (standard error) around median

plt.boxplot(data, notch = True, showmeans=True)

22 of 40

22

plt.boxplot(data, notch = True, showmeans = True)

  • Can be great for quickly comparing different data sets

23 of 40

Example:

  • See heights.ipynb
  • Also: matlab version given

23

24 of 40

Descriptive Statistics

24

25 of 40

Distribution measures: descriptive statistics

  • Attempt to capture as much detail with just few numbers
    • E.g., if Gaussian: completely determined by mean + variance�
  • Like to pretend all data uncertainties are Gaussian
    • Easy to model, interpret�
  • Often a reasonable approximation, rarely exact

25

26 of 40

Descriptive statistics: central tendency

  • Mean: average
    • Typically arithmetic mean
    • Also: geometric (useful for data spanning orders of magnitude)�
  • Median: midpoint of data (50% percentile)
    • Robust against outliers
    • Gaussian: mean = median. But for non-normal can be very different�
  • Mode: most frequently occurring value
    • Completely robust against outliers
    • Usually requires (and in practise depends) on binning�
  • Combination/comparison can be informative

26

27 of 40

Descriptive statistics: variation

A measure of the width (spread) of distribution:�

  • Variance
    • 2nd moment

  • Standard deviation

  • Skewness
    • 3rd “standard” moment

  • Kurtosis
    • 4th “standard” moment��

27

28 of 40

28

  • For perfect Gaussian
  • “68 – 95 – 99.7 rule” (1 σ – 2 σ – 3 σ)

29 of 40

“Sample” mean, standard deviation

  • We sample from a true distribution
    • Infer properties of true distribution from small sample�
  • Corrected standard deviation
    • Divide by N-1 instead of N
    • N-1 “independent” (x_i - \bar x) pairs
    • For large N, makes little to no difference
    • Important for small N

29

sigma = np.std(sample)

sigma_c = np.std(sample, ddof=1)

30 of 40

Standard Error (in the Mean): SEM

  • We estimate the mean of the real distribution by sampling it
  • Our estimate will have some uncertainty
  • This uncertainty is not just the standard deviation
    • True distribution has exact mean, and non-zero σ�
  • Standard Error in the Mean
    • Usually use “corrected” sample σ�
  • Important to distinguish:
    • Standard deviation of population (true σ)
    • Standard deviation of the sample
    • Standard deviation of the mean (SEM)

30

31 of 40

Example:

  • Important to understand, so let’s look at example
  • See population-sample.ipynb

31

32 of 40

Accuracy vs. precision

32

Random error

  • Varies between individual values.
  • Affects the precision.
  • If the random variation is large, the set of data has low precision.​
  • Averages down with more data

Systematic error

  • Affects all values the same.
  • Affects the accuracy of the data.
  • Does not avg. down
  • Very difficult to control
  • Not accounted for in standard error

33 of 40

Uncertainty, error bars, significance

  • Require measure of uncertainty to know if results meaningful
    • Useful to display it: error bars�
  • Standard error: 1σ (68% confidence level)
    • Confidence levels independent of distribution type (95% = 2σ is true for Gaussian)
    • Standard differs by field, never hurts to state explicitly�
  • “Statistical significance”
    • Not uniquely defined. Common choices include:
    • 95% Confidence Level (roughly 2σ) [P < 0.05]
      • 83% C.L. error bars compare at P=0.05 level

33

34 of 40

Example:

34

  • Example from me: [PhysRevA.107.052812]
  • Good and bad examples:�
  • Lots of info (too much?)�
  • 1 σ error bars: not obvious which deviations are significant

35 of 40

Some basic plotting tips

35

36 of 40

36

Credit: Ciaran O’Hare

37 of 40

Some general tips:

  • Use vector graphics for scaling
    • Rasterize if file size too large�
  • Use notebooks!
    • I usually discourage overuse of notebooks.
    • Useful for examples in lectures, also very useful for producing plots
    • (Be careful with saved kernel data)�
  • Font sizes
    • Often create plots separately from where they will be used
    • Look at them in the size/context they will be viewed!�
  • Label plots neatly and carefully: don’t make viewers work�
  • Use colour + shading to enhance readability first�
  • Tons of matplotlib examples online: make use of them

37

38 of 40

38

Credit: Ciaran O’Hare

39 of 40

39

Credit: Ciaran O’Hare

40 of 40

Matplotlib cheat sheets

40