1 of 40

COSC3000:

Visualisation

Week 2 Lecture 1: Univariate data

Tuesday - 27/02/2024

Ben Roberts [b.roberts@uq.edu.au]

2 of 40

Univariate data

Single quantitative variable

E.g., temperature, or wind speed

Multivariate: matched data

Not simply two sets of univariate data
E.g., temperate and wind speed (at same time/location)

Labels not quantitative variables

Station 1 is not 2x station 2! (just labels)
Categorical / quantitative

3 of 40

Histogram

Often first “go-to”�
Show frequency of occurrences vs values

Or: normalised to approximate “probability density” (area = 1)�

Good feel for the data

Can hide structure of data�

Discrete values: simple�
Continuous variables: into “chunks” (bins)�
Choosing bin width important

Extreme cases: useless

4 of 40

Histogram bins

Choosing bins import, not always obvious choice
Particularly with sparse/limited data�
Some rules: (but often need adjusting)

[Sturge’s rule]

[default in Excel]

[Rice’s rule]

5 of 40

Example: height data

New York Choral Society Singers�
Four groups�

39 Sopranos [60-72 inches]
35 Altos [60-72 inches]
20 Tenors [64-76 inches]
36 Basses [66-75 inches]��categorical single quantitative� variable variable

Source: WS Cleveland “Visualizing data”

7 of 40

Thoughts?

Difficult to compare�
Axes are different!

8 of 40

Thoughts?

Fixed axes

Not much resolution

9 of 40

Thoughts?

Better bin distribution

Careful! Don’t have equal sample sizes!

Sometimes better to normalise, e.g.

plot.hist(data, density=True)

10 of 40

Probability distributions

Continuous variables Prob(x) -> 0�
Instead: Probability Density Function: PDF�
PDF(x) * dx

probability to observe variable in region dx around x�

Area under curve => probability�
Histogram: approximates PDF of distribution

Once normalised!

11 of 40

Cumulative Distribution Function (CDF)

CDF(x): probability to observe value <x�
Monotonically increase: 0 -> 1�
Quantiles (later) approximate CDF

12 of 40

Gaussian: Normal distribution

Gaussian, normal, bell curve�
Many real-world random variables (in particular, uncorrelated random errors) are roughly Gaussian

Central limit theorem�

This isn’t exact though, so need caution�
Completely described by mean, width�
Average of N random gaussian variables:

μ ± σ / √N

13 of 40

KDE - Kernel density estimation

Method of estimating PDF from samples - see: population-sample.ipynb
“Smooth fit” to histogram (v. roughly)
Can be a nice way to plot histogram

But: can be misleading, particularly for small data sets
Tempting, but be very careful
Best practice is to also show actual histogram

Two random samples from same (Gaussian) distribution

14 of 40

Example:

Credit: Ciaran O’Hare

15 of 40

Example:

Credit: Ciaran O’Hare

Don’t get carried away!

16 of 40

Quantiles (CDF)

17 of 40

Manual (direct) not hard�
Numpy provides method:�
E.g., in 5% steps

x = np.linspace(0.0, 1.0, 20)

q = np.quantile(data, x)

19 of 40

Comparing two univariate data sets: Q-Q plots

Quantiles are useful for comparing two sets of data�
Simply plot one quantile vs another

Same quantiles
This example: 5% (20 pts)
Show x=y line too; plot 1:1

If the medians are the same, but the widths of the distributions are different, the Q-Q plot will cross the line y=x in the middle, but lie at an angle to it.

21 of 40

Box - whisker plot

Lots of information in an intuitive way
Some choice on out outliers: see documentation
Often: “notches” for confidence (standard error) around median

plt.boxplot(data, notch = True, showmeans=True)

22 of 40

plt.boxplot(data, notch = True, showmeans = True)

Can be great for quickly comparing different data sets

23 of 40

Example:

See heights.ipynb
Also: matlab version given

24 of 40

Descriptive Statistics

25 of 40

Distribution measures: descriptive statistics

Attempt to capture as much detail with just few numbers

E.g., if Gaussian: completely determined by mean + variance�

Like to pretend all data uncertainties are Gaussian

Easy to model, interpret�

Often a reasonable approximation, rarely exact

26 of 40

Descriptive statistics: central tendency

Mean: average

Typically arithmetic mean
Also: geometric (useful for data spanning orders of magnitude)�

Median: midpoint of data (50% percentile)

Robust against outliers
Gaussian: mean = median. But for non-normal can be very different�

Mode: most frequently occurring value

Completely robust against outliers
Usually requires (and in practise depends) on binning�

Combination/comparison can be informative

27 of 40

Descriptive statistics: variation

A measure of the width (spread) of distribution:�

Variance

2nd moment

Standard deviation

Skewness

3rd “standard” moment

Kurtosis

4th “standard” moment��

28 of 40

For perfect Gaussian
“68 – 95 – 99.7 rule” (1 σ – 2 σ – 3 σ)

29 of 40

“Sample” mean, standard deviation

We sample from a true distribution

Infer properties of true distribution from small sample�

Corrected standard deviation

Divide by N-1 instead of N
N-1 “independent” (x_i - \bar x) pairs
For large N, makes little to no difference
Important for small N

sigma = np.std(sample)

sigma_c = np.std(sample, ddof=1)

30 of 40

Standard Error (in the Mean): SEM

We estimate the mean of the real distribution by sampling it
Our estimate will have some uncertainty
This uncertainty is not just the standard deviation

True distribution has exact mean, and non-zero σ�

Standard Error in the Mean

Usually use “corrected” sample σ�

Important to distinguish:

Standard deviation of population (true σ)
Standard deviation of the sample
Standard deviation of the mean (SEM)

31 of 40

Example:

Important to understand, so let’s look at example
See population-sample.ipynb

32 of 40

Accuracy vs. precision

Random error

Varies between individual values.
Affects the precision.
If the random variation is large, the set of data has low precision.
Averages down with more data

Systematic error

Affects all values the same.
Affects the accuracy of the data.
Does not avg. down
Very difficult to control
Not accounted for in standard error

33 of 40

Uncertainty, error bars, significance

Require measure of uncertainty to know if results meaningful

Useful to display it: error bars�

Standard error: 1σ (68% confidence level)

Confidence levels independent of distribution type (95% = 2σ is true for Gaussian)
Standard differs by field, never hurts to state explicitly�

“Statistical significance”

Not uniquely defined. Common choices include:
95% Confidence Level (roughly 2σ) [P < 0.05]

83% C.L. error bars compare at P=0.05 level

34 of 40

Example:

Example from me: [PhysRevA.107.052812]�
Good and bad examples:�
Lots of info (too much?)�
1 σ error bars: not obvious which deviations are significant

35 of 40

Some basic plotting tips

36 of 40

https://www.color-blindness.com/coblis-color-blindness-simulator/

Credit: Ciaran O’Hare

37 of 40

Some general tips:

Use vector graphics for scaling

Rasterize if file size too large�

Use notebooks!

I usually discourage overuse of notebooks.
Useful for examples in lectures, also very useful for producing plots
(Be careful with saved kernel data)�

Font sizes

Often create plots separately from where they will be used
Look at them in the size/context they will be viewed!�

Label plots neatly and carefully: don’t make viewers work�
Use colour + shading to enhance readability first�
Tons of matplotlib examples online: make use of them

38 of 40

Credit: Ciaran O’Hare

39 of 40

Credit: Ciaran O’Hare

40 of 40

Matplotlib cheat sheets

https://matplotlib.org/cheatsheets/

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40