1 of 30

Remixing Writing Pedagogies

Writing Code and Data Through Exploratory Data Analysis

bit.ly/4c25slides

Chris Lindgren, PhD

Assistant Professor of Techcomm

NC State University

clndgrn.com

@lingeringcode

2 of 30

Coding as Writing with Data

What is our discipline’s role in teaching and advocating for critical approaches to coding and data practices?

  • genAI tool practices & issues�(Owusu-Ansah, 2023; Sano-Franchini, McIntyre & Fernandes, 2024; Vee, Laquintano & Schnitzler, 2023 …)
  • Critical approaches to programmatic writing of data, i.e., coding�(Byrd, 2020; Beck, 2016; Beveridge, 2015; Brock & Shepherd, 2016; Easter, 2018, 2020; Lindgren, 2021, 2024; Rea, 2022; Vee, 2017 …)

3 of 30

Coding as Writing with Data Programmatically

“... take seriously a new set of responsibilities to teach machines what to do and what not to do with powerful rhetorical strategies … [and] stay involved in the kind of work that mobilizes our disciplinary knowledge” (p. 253).

Hart-Davidson (2018)

4 of 30

Objectives

  1. Context of the course�
  2. Primer on logistic regression�
  3. Primer Exploratory Design Analysis (EDA)�
  4. Activity with computational notebook from the course with discussion prompts�

5 of 30

Demonstrate implications of different perspective

Background of UG Course Designed with Programmatic Writing with Data

  • Course Objectives: Ethical Reasoning & Discourse general ed. objectives�
  • Topics & Skills: Ethics & data design justice issues; Writing and coding with digital data in relation to digital media produced by code, LLMs, and ML.�
  • Modality: Originally designed & taught fully online, async (!!!!)�
  • Programmatic Goal:
    • Interdisciplinary learning space at the intersections of TPC and data science programs
    • Originally designed at Virginia Tech
    • Now, redesigning as UG/G-split at NC State University

6 of 30

Phase 1 - Course Weeks 1-4

Read & Discuss Ethical Concepts & Theories of Data/ML:

  • Designed to define, understand, and challenge ethical concepts and how such concepts/ideas help us recognize issues and interrogate what we know about data and ML.�
  • Provided prompts for every set of readings, which frame them and ask students to create different types of media to encourage participation and creativity with their ideas.�
  • Many students reported in their midterm participation memos how they appreciated this varied approach to the discussions.

7 of 30

Phase 1 - Topics & Readings Used

Week #1 - Social Justice vs. Ethic of Expediency

  1. Jones, N. N., and Walton, R. (2023). Social Justice. In H. Yu & J. Buehl (Eds.), Keywords in Technical and Professional Communication (pp. 267–272). UP of Colorado.
    1. Promo! :-) Lindgren, Gerdes & Lawrence. The future of ethics and social justice. In D. Ross (Ed.), Routledge Handbook of Ethics in Technical and Professional Communication.
  2. Katz, S. B. (1992). The Ethic of Expediency: Classical Rhetoric, Technology, and the Holocaust. College English, 54(3), 255–275.�

Week #2 - Defining AI/ML and the human decisions that shape them

  1. Stephanie Yee & Tony Chu. A visual introduction to ML - Part 1 & Part 2
  2. Baack, S. (2023, August 2). The human decisions that shape generative AI: Who is accountable for what? [Tech Foundation]. Mozilla Foundation.

Week #3 - How data are always situated and contextual?

Week #4 - What are some existing ethical approaches and ideas in AI/ML?

  1. Birhane, A. (2021). Algorithmic injustice: A relational ethics approach. Patterns, 2, 1–9.
  2. Dylan Baker. (2022). Datasets have worldviews.

8 of 30

Phase 2 - Course Weeks 5-9

Learn & Practice Python Language Fundamentals:

  • Students begin to use a digital textbook�
    • OG Github repo with computational notebooks
    • Read about and learn how to code in the Python programming language.�
  • Worked through chapters 1-2. (See the chapters in the "book" folder.)

9 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

  • Chp 3 - Exploratory Data Analysis:�Students learn how to …�
    • use the Python pandas code library and its super helpful "panel data" approach to structuring, processing, analyzing, and visualizing data.
    • About Exploratory Data Analysis (estimates of location, variance, and distribution), and why it is a useful means to "read" the data.
    • Create various visualization/plots as a means to understand and analyze example data provided in the notebooks.

10 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

  • Chp 4 - Textual analysis with LLMs (Docuscope)�Students …�
    • Build on their EDA work
    • Learn how to conduct textual analysis methods:
      • Parts-of-Speech analysis and "Categorical" analysis with the Docuscope LLM;
      • Apply the DS LLM via the Python docuscopacy module.

11 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

  • Chp 5 - Training a News Genre Classifier: �
    • Train a logistic regression classifier model that predicts the news genre of a news article, based on its headline. �
    • Conduct an EDA of the data, attempting to identify any issues with the data in relationship with the ML goals.�
    • Precursor for the final team project, where they will train a LR model with different data with new goals in mind.

12 of 30

FINAL PROJECT - Course Weeks 13-Finals

_Team Project_ to Develop & Interrogate Their Own ML Model:�In teams, students …

  • Apply what they have learned throughout the course by creating their own ML project to code and document in Python.
  • Research:
    • Find a data set and develop a LR model with it.
    • Conduct research about the provenance of the data and how their modeling goals align or not with this provenance.
  • Document their process and outcomes with two genres:
  • Teams used to ensure that every team has a range of strengths to draw on: some are better coders, some are great at analysis and ideation, and others are really organized and can do great research, writing/documenting, and review work.
  • Since I know the students' strengths, I assign teams accordingly.

13 of 30

Logistic Regression Primer

14 of 30

Rules & decisions on/about data

(Digital) Data

Output / Answer

Review: Primer on how ML works

Rules on/about data

(Digital) Data

Outputs/Answers

Traditional Programming

Machine Learning

15 of 30

Logistic Regression - What is it?

Buy A Coffee/Espresso = p(morning|tired|stressed) * mornings

55.75 days = p(0.15) * 365

= p(evening|refreshed|caffeinated) * 365

3.65 days = p(0.01) * 365

  • Stochastic: It’s conjecture, yo! Randomly distributed descriptions.
  • Returns a probability score that something may or may not happen or be
  • Ex. What’s the likelihood that I’m going to buy a coffee/espresso?

16 of 30

Logistic Regression - What is it?

  • Uses binary (dichotomous or discrete) data to predict probability of binary decision: Yes or No? �
  • 1 or more independent variables: features, i.e., words from a corpus of headlines
    • “Predictors”�
  • Variables used to determine a binary decision about the output (dependent variable (news genre of article)
    • “Criterion”

{

The Most

Uniquely Popular Halloween Candy In Each U.S. State

Food & Drink

Dependent

Variable

{

Independent

Variables

???

17 of 30

Logistic Regression - What is it?

  • Uses probability, or likelihood, scores to output predictions based on the available features in the input as compared with the trained models features�
  • LR models use a classification threshold value to ultimately decide how to classify the input�
  • New genres present us with an interesting case, where inputs may interact with the features scored in the trained model in interesting ways :-)

{

President Biden announces his favorite Halloween candy before U.S. Congress

POLITICS

Dependent

Variable

{

Independent

Variables

0.990

18 of 30

Logistic Regression == Discrete classification

  • Input/Output, Yes/No, 0/1 : �
    • Very crude way to decide how to classify an input.
    • No perfect rule to make that decision
    • So, how can we nuance that decision?

1

0

19 of 30

Logistic Regression - Nuancing discrete classification

Enter Stochastic Models: A conditional distribution of probabilities

  • Goal: Probability of event occurring / thing being (P)
  • Use: Classification problems

0.5

P = w1x1 + w2x2 + w3x3 + … wkxk + b

P(News Genre) = w1(halloween) + w2(candy) + w3(popular) + … wkxk + b

Words (features)

Probability Estimate

1.0

20 of 30

Logistic Regression - Goods & Caveats

  • Good for …
    • Supervised ML, i.e., data with correct and accurate labels
    • Classification: Predicting categorical (discrete) data vs continuous (linear)
    • Example: Predict news genre, based on X�
  • Not good for data with potential inaccuracies or biases toward certain labels�
  • The lesson we’re going to review was meant to provide an opportunity for students to interrogate, deliberate, and evaluate if this dataset is good for the goal of the model via EDA + new (to them) assessment measures.

21 of 30

Activity Time!

EDA

22 of 30

get(notebook)

23 of 30

Activity Time!

  1. (~5 mins) Acclimate to the notebook by reviewing sections 0 & 1 in the notebook together.
    • Note Parts: EDA, Training, and Assessment�
  2. (~10 mins) Read through EDA part of the notebook:
    • Use the 4c25-lr.html.
    • Take notes about your experience navigating the notebook: questions, feelings, anything.
    • Work through some of the prompts as best you can:�2.3.1, 2.4.1, 2.7.2, 2.8.1
  3. Reconvene and talk through our responses.�
  4. If time, discuss insights about coding as writing data – thoughts about writing as a material and rhetorical perspective to bring to these types of practices.

24 of 30

EDA - Estimates of Location

Measure

Definition

Sensitive to Outliers?

Best Use Cases

Mean

Average of all values

Yes

Symmetrical data, all values important

Median

Middle value

No

Skewed data, presence of outliers

Mode

Most frequent value

No

Categorical data, identifying common values

Measures the central tendency of a dataset.

25 of 30

EDA - Estimates of Distribution

Values that describe the spread, variability, or dispersion of data points in a dataset.

Measure

Definition

Sensitive to Outliers?

Focus

Range

Min - Max

Yes

Entire dataset extent

Variance

Average squared deviation from the mean

Yes

Spread around the mean

Standard Deviation

Square root of variance

Yes

Typical deviation from the mean

26 of 30

2.3.1 Exercise

Observations About

the ‘headline’ Column

Observation 1

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 2

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 3

  • Supporting evidence
  • Model impact?
    • Enter response

27 of 30

2.4.1 Exercise

Observations About

the ‘date’ Column

Fewer 2018 articles compared to the rest of the dataset.

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 2

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 3

  • Supporting evidence
  • Model impact?
    • Enter response

28 of 30

2.7.2 Exercise

Observations About

the ‘category’ Column

Observation 1

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 2

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 3

  • Supporting evidence
  • Model impact?
    • Enter response

29 of 30

2.8.1 Exercise

Observations About the

short descriptions Columns

Observation 1

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 2

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 3

  • Supporting evidence
  • Model impact?
    • Enter response

30 of 30

EDA & Model Accuracy

Observations Comparing EDA Work Against the LR Model

(See the sections 4.4.3 and on …)

Observation 1

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 2

  • Supporting evidence
  • Model impact?
    • Enter response

Observation 3

  • Supporting evidence
  • Model impact?
    • Enter response