1 of 24

CSE 163

Data Communication�

Suh Young Choi

🎶 Listening to: Max LL

💬 Before Class: Happy June! Do you have a favorite book, movie, or piece of media?

2 of 24

Announcements

Project Notebooks due Monday, June 9

Mapping Interview grades released

Final resubmission window closes Friday, June 6

    • Any assignment is fair game, but you can only submit 1
    • Make sure to address all TA feedback!
    • See Ed post #509 for more details

Code Interview Makeup/Retakes on Monday, June 9

    • 2:30-4:20pm in CSE2 G20 during final exam slot
    • See Ed post #361 for more details

2

3 of 24

This Time

  • Data Literacy
  • Data Context
  • Data Storytelling
  • Tips for Writing ☺

Last Time

  • Machine Learning
  • Large Language Models

3

4 of 24

What can this code tell us about the data that’s used?��������Answer on Ed!

4

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

data = pd.read_csv('home/data.csv')

data = data[['name', 'fin_length', 'age']]

data = data.dropna()

sns.relplot(data=data, kind=‘line’, x='age’, y='fin_length', hue='name')

plt.title('Shark Ages vs. Fin Length')

plt.xlabel('Age (months)')

plt.ylabel('Fin Length (in)')

plt.savefig('/home/plot.png')

5 of 24

Data Settings Revisited

It’s all about the context!

5

6 of 24

Data Context Guiding Questions

  • Who: the people represented in the data or responsible for its collection and usage
  • What: the information that is represented in the data, such as demographics or measurements
  • When: the timeframe associated with data collection or represented in the data
  • Where: the location of the data, virtual or otherwise
  • Why: the stated (or unstated) purpose of collecting the data
  • How: the methods used for data collection and storage

6

7 of 24

Where do we find context?

  • Data dictionaries

  • Metadata (including README.md or CHANGELOG.md files)

  • Contacting the researchers/data collectors

7

8 of 24

Data dictionary example

8

Column name

Feature

Encodings

Type

id

Unique identifier per shark

N/A

int

name_code

Encoded name of shark species

0 – great white

1 – whale

2 – mako

3 – leopard

string

name

common name of shark species

N/A

string

name_scientific

Scientific name of shark species

N/A

string

age

Shark’s age at time of capture, in months

N/A

float

fin_length

Length of pectoral fin, in inches

N/A

float

tail_length

Length of caudal fin, in inches

N/A

float

sex

Sex of the shark

0 – female

1 – male

int

health

Treated by on-site veterinarian?

True – yes

False – no

Boolean

9 of 24

README �example

  • Usually written in Markdown
  • Information about data, usage, and file organization

9

# README

This repository contains data about sharks tagged by the Shark Stewards organization between June 2011 and August 2011.

## Folder layout:

.

├── data

│ ├── data.csv

│ ├── data.parquet

│ └── data.json

└── README.md

## Etc…

10 of 24

CHANGELOG �example

  • Usually written in Markdown
  • Information about what changes have occurred for the data/files

10

# CHANGELOG

All notable changes to this repository will be tracked in this file.

## Added

  • v.1: initial commit
  • v.2: added .json and .csv files

## Fixed

  • v.1: values for the mako shark were in the wrong units

## Removed

- v.2: removed .stdf file

11 of 24

Data Storytelling

Let’s talk about parallel universes…

11

12 of 24

Shark Multiverse…

  • A set of bar graphs illustrating different categorical qualities of the different shark species
  • A YouTube tutorial using this dataset to demonstrate how to drop or replace missing values
  • A neural network using all columns to predict whether the shark is endangered or not
  • A neural network using all columns to predict shark tail size
  • A single string representing the most common shark species in the dataset

12

13 of 24

Narrative Plot Mountain

13

Exposition

Climax

Rising Action

Falling action

Resolution

14 of 24

Data Story Plot Mountain

14

Finding data

Posing research

questions

Pre-processing

Coding

Testing

Visualizations

Writing

reports

Presenting

findings

Analysis to answer questions

Interpreting results

Communications

15 of 24

For what research question(s) might this plot be useful?�������Answer on Ed!

Suppose that this is the plot we created from the code earlier:

For what research question(s) might this plot be useful?

15

16 of 24

Finding Main Characters

  1. Make sure the questions are relevant to your data, and the data is relevant to your questions.

  • Ask questions that don't have simple answers.

  • Use your domain knowledge to come up with interesting questions.

  • Pacing matters!

16

17 of 24

Writing Tips

  • We care more about what you have to say rather than how you say it.

  • You will need some amount of markdown cells in your final report.

  • Keep your writing relevant and to the point!

  • Some strategies for answering your research questions:
    • Answer a question, then explain that answer
    • Answer the question, think about a counterexample/counterargument, then refute it
    • Answer the question, pose a follow-up question, then answer the follow-up

17

18 of 24

What are possible questions we might have after seeing this plot?�������Answer on Ed!

Maybe we’re missing some context…

18

19 of 24

Avoiding the Vacuum

Interpret numbers and/or trends in context!

Do NOT leave free-floating p-values or ML model evaluation metrics. (If you do, Suh Young will be a bit sad ☹)

Your reader may not know your code as well as you do!

Think about explaining your project/code to someone who is not in this class.

19

20 of 24

Interpreting Numbers

p-values

  • The probability of observing your results assuming that your null hypothesis is true.
  • Do you reject or fail to reject your null hypothesis?
  • Restate both the hypothesis and rejection decision in plain English.
  • Failing to reject the null hypothesis does NOT automatically mean acceptance of the alternative hypothesis.

Pearson’s coefficient (also called R)

  • Is there positive or negative association between your variables?
  • How strong is this correlation?
  • Correlation is NOT causation, but it may suggest it. Use domain knowledge here.
  • R-squared is just the square of the Pearson’s coefficient.

20

21 of 24

Interpreting Numbers, cont’d.

Linear model coefficients

  • What variables are accounted for in your model?
  • Are there interaction terms?
  • “A change in one unit of X explains N amount of change in Y (assuming all other variables are fixed)”

ML model evaluation metrics

  • What metric is appropriate for your ML task?
  • What about false positives and negatives?
  • What are the consequences for getting a prediction wrong?

21

22 of 24

Sentence-Level Details

A few mechanical things to think about:

  • Writing in the first-person ("I/We did X for my/our analysis to find Y" as opposed to "X was used to find Y")
  • Active voice vs. passive voice ("The results of A proved B" as opposed to "B was proven using A")
  • Present vs. past tense
  • Using abbreviations or shorthand instead of full names

22

23 of 24

Final Notes

Think about how your analysis might be used or interpreted

Consider biases in your data and analysis—even the ones that might come from you!

Consider the impact, ethics, and consequences of your analysis

No piece of information exists in a vacuum. Contextualize!

Your projects tell a story—who is your “main character”, and what do you want to focus on?

23

24 of 24

Before Next Time

  • Get your resubs in ☺
  • Continue working on your projects ☺☺

Next Time

  • What comes after 163
  • Hear from TAs!

24