Introduction and Data Gathering (Chapter 1 )
At the end of this lecture, the student should:
What I hear, I forget.
What I see, I remember.
What I do, I understand
Chinese Proverb
A Motivating Example: The HIP Trial
Breast cancer: common malignancy among women in rich countries.
Mammography (screening): is today known to lead to fewer deaths.
HIP Trial (1960s). First study to conclusively show merits of screening.
If we compare “screened” (1.1) vs. “refused” (1.5), there’s hardly a difference…?
(More later!)
What Is Statistics?
We will concentrate on (2), although the distinction will not always be clear.
Basic Definitions
Population vs. Sample
Using the sample average to make statements about the population average is an example of inferential statistics.
Descriptive statistical methods: describe the sample.
Inferential statistical methods: make statements about the population based on the sample.
Sample
Inference
Population
First Principle of Statistical Inference
You make inference about the population from which the sample was obtained. (Seems obvious, but is often forgotten.)
In each of the examples below, identify the population being sampled and the inference being made:
Scientific Method
Roles of Statistics
Logical Arguments
In statistics we attempt to formalize and use these concepts in a quantitative way.
Scientific Progress
Hypothesis
Model, Conjecture
Data, Measurements
Inductive Argument
Deductive Argument
New Hypothesis, New Model
New Data
Progress
and
Understanding
We gain knowledge by iterating between models and data.
Basic Study Steps
Study design and study implementation may require iteration.
Graphical Depiction of Scientific Study
Problem
Objectives &
Hypotheses
Sample
Experiment
How to measure?
Interpretation
Knowledge
Base
DATA
Conclusions
DESIGN
Constraints
STATISTICAL ANALYSIS
Graphics & Visualization
Research Design Categories
Observational Study Design
Observational Study Design
Example: Study lung cancer rates among smokers and non-smokers.
Many of these questions are answered by subject matter experts, some can be answered by a statistical analysis.
Observational Study�( Mensuration Experiment)
Population 1
Population 2
Sample 1
Sample 2
How Selected?
Characteristics
1 1 x x x x x …
2 1 x x x x x …
3 1 x x x x x …
…
n 1 x x x x x ...
1 2 x x x x x …
2 2 x x x x x …
3 2 x x x x x …
…
m 2 x x x x x ...
Observations
Sample 1 data
Sample 2 data
What is measured?
How are individuals selected?
Each possible set of individuals has the same probability of selection (Simple Random Sampling).
Special situations allow for increased efficacy of selection.
Simple Random Sampling: Example
A researcher wishes to determine the prevalence of a disease in a greenhouse of tomato seedlings. Each seedling tested for the disease is destroyed in the process, hence only a minimal number should be tested. Expectations are that only about .01% of the roughly 50,000 seedlings in the greenhouse have the disease.
How to select a simple random sample?
Table 13 in Ott and Longnecker.
Table 13 in Ott & Longnecker
Random number tables are constructed in such a way that, no matter where you start and no matter in which direction you move, the digits occur randomly with equal probability. These numbers can also be generated with statistical software packages.
Ex: Greenhouse seedlings
Simple Random Sample
A simple random sample of n units is defined such that each possible sample of size n is equally likely to be drawn.
This sampling principle assures that each unit in the population has the same probability (likelihood) of being selected in the sample.
Textbook definition.
Practical definition.
Stratified Sampling
Pine forest: Estimate expected yield from plot.
22 years
healthy
16 years
healthy
20 years
diseased
Individuals selected at random within each strata.
Variability in diseased subpopulation expected to be much greater than in healthy area. Mean yield greater at 22y than 16y.
Allows us to take into account a factor we already know affects the response of interest. To “remove a source of known variability”.
Cluster Sampling
25
9
14
5
21
7
12
Estimate the average sponge size on natural reefs.
Number of
sponges on
reef
REEF
Selecting sponges at random would be very resource inefficient.
Cheaper to select reefs (sponge clusters) at random with probability proportional to size. All sponges on selected reefs are measured (a cheap thing to do that increases the sample size easily).
MultiStage Sampling
Typically large areas or large complex populations can be more effectively sampled in stages. At the first stage, natural or synthetic clusters are selected. At subsequent stages the selected clusters are subdivided into units and samples of these are selected.
Example: National crop yield survey.
Greenhouse Example
Stratification: Maybe we have observed that plants near the door seem less healthy than those further into greenhouse. Divide room into plants near door and plants “inside”. Random samples from each stratum.
Cluster: Suppose plants are arranged on tables. We could select tables at random then examine all plants on each table selected. Note that if one plant on a table is diseased, all plants on table have an increased probability of also being diseased.
Multi-Stage: Again suppose plants are on tables. Select some tables at random. Next select a few plants from each selected table for testing. First stage unit is the table. Second stage unit is the plant. Third stage unit could be the leaf on the plant, etc.
Systematic: Imagine plants arranged on a large table. Randomly pick a row and column to start. Then, following a systematic route, pick, say, every 10th plant.
What is measured?
Variable: Apt or liable to vary or change from individual to individual, capable of being varied or changed (factor), alterable, inconsistent, having much variation or diversity, a quantity that may assume any given value from a set of values (the variable’s range).
Examples:
Types of Variables: Categorical
Categorical, classification, or qualitative variable
Discrete; essentially describes some characteristic of a sample unit. E.g. color, gender, grade, health status, treatment group. Further subdivided into:
In ordinal data the order is meaningful, but the difference between responses isn’t. Also, arithmetic is sometimes done, but it’s meaning is debatable.
Types of Variables: Quantitative
Quantitative or amount variable
Can be either discrete or continuous; measures the amount or level of a characteristic of a sample unit. For example: age, weight, height, temperature, biomass, volume. Further subdivided into:
In this course we will deal primarily with quantitative variables (ratio).
Study Design Questions
It is important to be able to define the underlined words.
Terminology
Terminology (Cont)
Experimental Study
Ex: Factorial Experiment
Nitrogen Level
Phosphorus
Level
FACTORS
0 kg/ha
10 kg/ha
0 kg/ha
10 kg/ha
20 kg/ha
LEVELS
SITE 1
(block 1)
SITE 2
(block 2)
0 / 10
0 / 10
10 / 10
10 / 10
20 / 10
20 / 10
0 / 0
0 / 0
10 / 0
10 / 0
20 / 0
20 / 0
BLOCKED LAYOUT
(complete block - all treatments in each block)
EXPERIMENTAL
UNIT (PLOT)
0 / 0
0 / 10
10 / 0
10 / 10
20 / 0
20 / 10
TREATMENTS
Standard Form for a Data Set
1 1 F RED x x ... 10.2 x x ...
2 1 F WHITE x x ... 12.9 x x ...
3 1 M BLUE x x ... 20.1 x x ...
. .
. .
. .
n 1 F BLUE x x ... 16.0 x x ...
CATEGORIES
AMOUNTS
Observation
Number
strata
gender
color
Other
categorical
variable
weight
Other
quantitative
variable
Example Data Set in Spreadsheet Format
Indicator of missing data
Inventor's Paradox
The more ambitious the plan, the more chances of success, and the more opportunity for failure.
How does one decide on what to do?
Are there open questions ?
Are there available resources?
Does someone really want the answer?
Can a study be done?
Will the study be able to answer the question?
Statistics may help answer the last question!
The HIP Trial Revisited