1 of 28

INTRODUCTION TO DATA SCIENCE

Experimental Design

Lecture: 02

– FARDINA FATHMIUL ALAM

(fardina@umd.edu)

CMSC 320: 2026

2 of 28

Today’s Objectives

Chapter 3 in : https://ffalam.github.io/CMSC320TextBook/chapter3/Chapter_3_0.html

What is Experimental Design?
Variables & Hypothesis
Confounders & Bias
Briefly: What is Hypothesis ( LATER TOPIC)
Data Collection Methods
Case Study: Online Retail CTR

Today, we’ll cover the basics of experimental design, including how to plan and conduct experiments.

The goal is to help you design and analyze experiments more effectively.

Learn to identify variables, hypotheses, and confounding factors.

3 of 28

Experimental Design in Data Science?

The process of planning, conducting, and analyzing experiments to test hypotheses and gather meaningful data for data-driven decisions.

Data science fundamentally involves making decisions based on data.

4 of 28

What will the weather be like for the next 10 days?

How many visitors did the website ‘X’ receive last week?

Does offering free shipping increase the number of purchases?

Which courses should the department offer to maximize enrollment next semester?

Optimization criteria

Objective or goal function → we want to achieve

(maximize or minimize)

Asking the right questions before solving a DS problem is a great start! And be specific!

We use experimental design to collect good data, but we need causal experimental design (causal and prescriptive problems) only when we want to test the effect of an action in the real world (e.g., A/B testing).

Different questions lead to different models, data needs, and evaluation criteria.

Why did website traffic drop last weekend?

5 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

6 of 28

Example: Online Retail

Let's say you're a data scientist working for an online retailer, and you want to test whether changing the color of the "Buy Now" button on your website affects the click-through rate (CTR)

Click-through rate (CTR) → the percentage of users who click on the ad after seeing it

What is your problem definition?

Find which version of the button “Buy Now” (Option A (default) or Option B (red)) is more likely to maximize the CTR

What is your Optimization Criteria? What we want to maximize?

CTR → we want to select the ad options with button “buy now” that leads to Higher CTR

Buy It Now

Ques: How can we set up an experiment to collect data in this case?

7 of 28

Buy It Now

Data Size / Sample ?

Views the original website with existing button color.
Experiences no changes; baseline for comparison.

Sees the same website but with a different color for the "Buy Now" button.
Experiences the change you want to test.

Collect data on the click-through rates for both groups over a specific period.
Compare the click-through rates (CTR) between the groups after the experiment
Check if there's a significant difference [LATER TOPIC IN THIS COURSE], we can infer that the change in button color influenced the click-through rate.

Ques: How can we set up an experiment to collect data in this case?

→ No. of website visitors

8 of 28

Buy It Now

Data Size / Sample ? → No. of website visitors

Views the original website with existing button color. This group experiences no changes.

Sees the same website but with a different color for the "Buy Now". This is the group that experiences the change you want to test.

Ques: What are the variables here?

CONTROL GROUP

TREATMENT GROUP

Dependent Variable

Draw more reliable conclusions about the impact of the independent (manipulated) variable.

Independent Variable

Ques: How can we set up an experiment to collect data in this case?

Click-through rate(what we measure: outcome)
Color of the “Buy Now” (what we manipulate)

9 of 28

Summary: Variables, Population, and Groups in a Study

Once the problem is defined, identify the variable(s) of interest that are relevant to your research question.

Independent Variable (IV) The variable that is manipulated or changed by the researcher.

Why? To observe its effect
Examples: new algorithm, marketing strategy, drug dosage�

Dependent Variable (DV) The outcome we measure

Expected to change in response to the IV
Examples: user engagement, sales, patient recovery rate�

Treatment vs. Control Groups

Treatment group: receives the intervention (IV applied)
Control group: does not receive the intervention

Comparing these groups helps identify the effect of the IV on the DV

Also, specify the population or sample that your study will focus on.

10 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

11 of 28

Come up with a Hypothesis

A hypothesis is a testable statement you want to evaluate.

If X is true, then Y should happen.

What is a hypothesis?

A testable explanation
An educated guess
Describes the relationship between variables and outcomes

How do we test it?

Run an experiment or make observations
If results match the prediction → hypothesis is supported
If not → revise or form a new hypothesis

12 of 28

Brainstorming Time

Hypothesis: "If the amount of average study time ( ___ variable?) is increased, then average exam scores ( ___ variable?) will also increase.”

Ques: What are the optimization goal/ criteria here? How independent and dependent variables are related?

As books read increases, average literacy also increases
If the exercise duration is extended, then the average calories burned will also increase
If the temperature rises, then average ice cream sales will also increase.

While independent variables affect dependent variables, multiple independent variables ______________ (can / can not) influence each other.

Do you think is it a problem or not? (Yes/No)

Why?

Good experimental design aims to minimize correlated variables.

13 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

14 of 28

Confounders (Before Data Collection)

Confounder: An external variable that affects the DV and distorts the IV → DV relationship if not controlled.

Why it matters

Can lead to incorrect conclusions.
Not the study focus, but must be managed.

Examples

If the exercise duration is extended, then the average calories burned will also increase.�

As books read increases, average literacy also increases

Confounder: metabolic rate

Confounders: age, socioeconomic status etc.

15 of 28

More Example: Experimental Design Flow : Polling

Problem Formulation: Imagine you're a data scientist tasked with predicting the outcome of a political election using a dataset of voter preferences. Your goal is to design an experiment that accurately represents the entire population's voting behavior.

Questions: How do we know which candidate is ahead!

Can we create the IDEAL POLLING?

Eliminate confounding variables as much as possible → Sample Bias, geographic representation, Population Proportion Bias, Demographic Mismatch and many more to make the data as accurate as feasible.

16 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

17 of 28

Some Ways to Deal with Confounder

Design Stage (Before Data Collection)

Randomization: randomly assign units to groups so confounders balance on average

Random Sampling: select a representative subset to reduce bias.
Stratified Randomization: group (stratify) by an important characteristic, then randomize within groups.

Restriction: limit the study to one level of a confounder so it cannot vary.
Matching: pair or group units with similar confounder values across treatment and control.
Replication: repeat the experiment to check consistency and reduce the influence of confounders.

Analysis Stage (After Data Collection): Regression / multivariable models, Statistical adjustment

18 of 28

Example: Control Confounder Variable in an Experiment

If the amount of study time ( independent variable) is increased, then exam scores ( dependent variable) will also increase.

Stratify students by prior knowledge (high/medium/low),

randomly assign control (normal study time) and treatment (increased study time) within each group to balance prior knowledge.

Treatment

Exp. Design Idea 1: Stratified Randomization:

Exp. Design Idea 2: Block Design (Matched Pair)

Pair participants by prior knowledge,

randomly assign one to control (normal study time) and one to treatment (increased study time) to control for prior knowledge within pair.

19 of 28

Try by yourself: control the effect of “age”

"As books read increases, avg. literacy also increases.”

We can measure the age of each individual; to see the effects of age on literacy.

Experimental Design: How to design

Random Sampling
Stratified Randomization
Block Design (Match Pair)

20 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

21 of 28

Methods for Collecting Data

if pre-existing datasets are not available.

A. Observational studies → Observe and record data (variables) without intervening or manipulating variables (Observe; don’t change anything intentionally).

E.g. Observing animal behavior in a natural habitat without any external influence.

B. Surveys → Collect information through structured questionnaires or interviews.

E.g. Conducting a survey to gather opinions on a political issue.

C. Experiments → We actively change something to see what happens.

D. Simulations → Create artificial scenarios to model real-world situations for data collection.

E.g. Using a computer simulation to study traffic patterns in a city.

22 of 28

Observational Studies

Cross-sectional studies: data collected at one time point� Example: survey people’s exercise habits today
Retrospective (case-control) studies: look back at past exposure� Example: compare smoking history of lung cancer patients vs. non-patients
Prospective (longitudinal/cohort) studies: follow a group (cohort) over time� Example: track smokers and non-smokers for 10 years

23 of 28

B. Surveys (A specific type of observational study)

Collect data using questions or questionnaires�

Commonly used to study people and populations�
No intervention; we observe, not manipulate�
Question wording and order must be carefully designed to avoid bias�

Example: Survey students about study habits and exam stress.

24 of 28

D. Simulation

Use a computer or mathematical model to mimic real-world systems�

Useful when real experiments are too costly, risky, or impractical�
Allows testing “what-if” scenarios under controlled settings�

Example: Simulate traffic flow to study congestion without changing real roads.

25 of 28

Topics

An example of Experimental Design (ED)
Identifying Variables & Population/Sample of the Study

Independent variable
Dependent variables

Hypothesis
A potential problem in ED: Confounder Variable
How to Deal with Confounder

Control
Randomization
Replication

Methods for Collecting Data

Experiments
Observational studies

Cross sectional studies
Retrospective (case control) studies
Prospective (longitudinal or cohort) studies

Surveys.
Simulations

Bias in Experiments: Placebo & Blinding

Placebo Effect
A common method to minimize bias in Experimental Design

26 of 28

Placebo Effect

Improvement occurs due to belief in treatment, not the treatment itself

Can bias results if not controlled
Common in medical and behavioral studies

Example: Patients feel better after receiving a sugar pill they believe is real medicine

Minimizing Bias in Experimental Design

Blinding: participants (and/or researchers) do not know who receives the treatment or placebo

Single-blind: one side doesn’t know�
Double-blind: both sides don’t know�

Using a placebo helps keep participants unaware of their group, reducing bias.

27 of 28

The Fundamental Rule of Data Collection

Your data must representative of the population you want to study.

Keep in mind that

It is almost impossible to be certain that your experiment has completely removed all forms of bias. It is necessary to consider possible sources of bias and highlight them in your analysis. Ideally, future experiments would improve upon your method by iteratively eliminating those sources of bias.

Key Takeaways

28 of 28

Quick Class Task

Identify which method for collecting data (observational study, an experiment, a simulation, or a survey) is best in each of the following situations and explain your answer.

The effect of a severe earthquake would have on the Salt Lake Valley.
Whether or not a certain coupon attached to the outside of a catalog makes recipients more likely to order products from a mail-order company.
Whether or not smoking has an effect on coronary heart disease.
Determining the average household income of homes in Salt Lake City.