1 of 25

Session 5:

Data and

Statistics

2 of 25

ICEBREAKER & CHECK IN

3 of 25

Why are data and statistics important for public health research?

4 of 25

What is Data?

  • Data are bits of information about

individuals

Examples:

  1. patient demographics - occupation,

ethnicity, area of residence, income

  • patient outcomes in clinical trials - response to treatment, adverse events, assay results
  • genetic sequencing - Ancestry and 23andMe, also a huge area of focus for bioinformatics
  • public health data - COVID-19 case numbers, positivity rates, hospitalizations, variant frequencies

5 of 25

What is Data Science?

  • Data is often NOISY and there’s a LOT of it
  • Data science - area of study that attempts to make meaning out of this “big data”
  • Why is it important?
    • The digitization of the modern

world (i.e. the internet) has made

large amount of data available to

individuals

    • Collecting data is crucial for

answering questions and solving

problems - data-driven science

6 of 25

What is Statistics?

  • One of the most popular data science methods
  • We usually only have data on a subset of individuals out of an entire population, or a sample
  • Statistics summarize this sample data in a

meaningful way

    • Enables us to make inferences without

needing data from the entire population

    • Allows for comparisons between groups
    • Allow us to make sense of experiments

7 of 25

What is Statistics?

  • You’re probably familiar with some of these
    • Mean
    • Median
    • Standard Deviation
    • IQR
  • They give us information about a certain group/sample
  • There are also statistical methods for COMPARING groups
    • Confidence Intervals
    • Hypothesis Testing

8 of 25

Statistics Terminology

What is a “p-value?” What about a “confidence interval?”

9 of 25

P-value

  • The probability that there is some non-random explanation for the things we observe in a data set
  • In experiments, used to support the thought that our experiment actually caused certain effects
  • In other words: what are the odds that we would observe this result from chance alone?

10 of 25

P-values, cont’d.

  • The general baseline that you’re aiming for is a p-value of <0.05
  • That means that there’s a <5% chance of this thing happening completely at random
  • You want a lower p-value, because it means that there’s a lower chance that this thing happened randomly = higher chance that your “treatment” is what caused the change

11 of 25

How do you get a p-value?

  • Statistical tests known as t-tests or z-tests, which we won’t cover at the moment
  • Calculated by comparing an experimental group and a control group (null)

12 of 25

Confidence Interval

  • A range of values that we think capture the “true” value for a statistic (e.g. the mean for an entire population) some percentage of the time if we sampled MANY times
  • Usually aim for a 95% confidence interval
  • Also has a formula, but we won’t go into it

13 of 25

Confidence Interval

  • If they don’t overlap, we know there is a statistically significant difference between the groups being compared

14 of 25

Confidence Intervals

The picture shows early results on Pfizer’s vaccine before it was sponsored by the company

15 of 25

Confidence Intervals

This picture shows a comparison between different vaccines, again before they were sponsored by companies.

Keep in mind the confidence intervals are represented by shading here instead of error bars

16 of 25

What is Data Visualization?

  • Ok - we have all this data summarized with statistics
    • What do we do with it?
  • Data visualization is displaying

those pieces of information so

we can see their insights

    • KEY to scientific

communication

  • Handy tools:
    • Excel - small amounts of data
    • Programming languages like R and Python - big data

17 of 25

Data Massaging

(aka “Fudging the Numbers”)

  • Sometimes, people may manipulate/adjust data in order to show the results they want in their studies or papers
  • What might be a few reasons people would do this?
    • Funding - some institutions provide resources to their labs based on successful results or paper publications
    • Agenda - some people might want to just push an agenda and thus cheat their results accordingly
    • etc.

18 of 25

Case Study: Theranos

  • Theranos: company that supposedly offered highly informative blood testing that would revolutionize the medical industry
    • The company had at least a billion dollar evaluation $$$
  • When investigations were performed: administration pressured employees to fake data and keep stuff secret
  • Became a WIDELY publicized scandal

19 of 25

Takeaways

  • ALWAYS take reported statistics and data with a grain of salt
  • Make sure that the data has been mined responsibly and that a random sample is being used if possible
    • Some individuals may cherry pick their data to push an agenda or fake promising results
  • Most peer-reviewed sources will be safe from this stuff, so rest assured those will be reliable

20 of 25

Sources Activity

Mentees: Take a look at your collected sources and analyze the data they provide (if any). Discuss how you can interpret these findings.

21 of 25

Community

Interviews!

22 of 25

Communicating with Interviewees

  • Most professional method of contact: email
  • Expect at least a 1 week delay between emailing your interviewee and receiving a response
    • Send a friendly follow-up email if the interviewee hasn’t responded for 1 week
  • Responses are NOT guaranteed - if no response for some time, don’t take it personally
    • Find a backup interviewee and reach out

23 of 25

  • Make SURE you have done your research on your interviewee
    • Position, background, experiences, achievements, etc.
  • Use these to your advantage
    • Ask your intervieweees questions that allow them to draw from their experiences - remember your interviewee is giving you expert testimonial

Communicating with Interviewees

24 of 25

  • -

Interview Activity

  1. Mentors: Give your mentees some background about your interests, major, extracurriculars, etc.
  2. Mentees: Draft some questions for your mentees based on their background, and conduct a practice “interview” with them

25 of 25

To Do

Over the coming weeks…

  1. Finalize an annotated list of ~5 sources for your project
  2. Draft and send out emails to your prospective interviewees
  3. Draft the questions you’d like to ask during your interview.