1 of 30

Remixing Writing Pedagogies

Writing Code and Data Through Exploratory Data Analysis

bit.ly/4c25slides

Chris Lindgren, PhD

Assistant Professor of Techcomm

NC State University

clndgrn.com

@lingeringcode

2 of 30

Coding as Writing with Data

What is our discipline’s role in teaching and advocating for critical approaches to coding and data practices?�

genAI tool practices & issues�(Owusu-Ansah, 2023; Sano-Franchini, McIntyre & Fernandes, 2024; Vee, Laquintano & Schnitzler, 2023 …)�
Critical approaches to programmatic writing of data, i.e., coding�(Byrd, 2020; Beck, 2016; Beveridge, 2015; Brock & Shepherd, 2016; Easter, 2018, 2020; Lindgren, 2021, 2024; Rea, 2022; Vee, 2017 …)

3 of 30

Coding as Writing with Data Programmatically

“... take seriously a new set of responsibilities to teach machines what to do and what not to do with powerful rhetorical strategies … [and] stay involved in the kind of work that mobilizes our disciplinary knowledge” (p. 253).

Hart-Davidson (2018)

4 of 30

Objectives

Context of the course�
Primer on logistic regression�
Primer Exploratory Design Analysis (EDA)�
Activity with computational notebook from the course with discussion prompts�

5 of 30

Demonstrate implications of different perspective

Background of UG Course Designed with Programmatic Writing with Data

Course Objectives: Ethical Reasoning & Discourse general ed. objectives�
Topics & Skills: Ethics & data design justice issues; Writing and coding with digital data in relation to digital media produced by code, LLMs, and ML.�
Modality: Originally designed & taught fully online, async (!!!!)�
Programmatic Goal:

Interdisciplinary learning space at the intersections of TPC and data science programs
Originally designed at Virginia Tech
Now, redesigning as UG/G-split at NC State University

6 of 30

Phase 1 - Course Weeks 1-4

Read & Discuss Ethical Concepts & Theories of Data/ML:

Designed to define, understand, and challenge ethical concepts and how such concepts/ideas help us recognize issues and interrogate what we know about data and ML.�
Provided prompts for every set of readings, which frame them and ask students to create different types of media to encourage participation and creativity with their ideas.�
Many students reported in their midterm participation memos how they appreciated this varied approach to the discussions.

7 of 30

Phase 1 - Topics & Readings Used

Week #1 - Social Justice vs. Ethic of Expediency

Jones, N. N., and Walton, R. (2023). Social Justice. In H. Yu & J. Buehl (Eds.), Keywords in Technical and Professional Communication (pp. 267–272). UP of Colorado.

Promo! :-) Lindgren, Gerdes & Lawrence. The future of ethics and social justice. In D. Ross (Ed.), Routledge Handbook of Ethics in Technical and Professional Communication.

Katz, S. B. (1992). The Ethic of Expediency: Classical Rhetoric, Technology, and the Holocaust. College English, 54(3), 255–275.�

Week #2 - Defining AI/ML and the human decisions that shape them

Stephanie Yee & Tony Chu. A visual introduction to ML - Part 1 & Part 2
Baack, S. (2023, August 2). The human decisions that shape generative AI: Who is accountable for what? [Tech Foundation]. Mozilla Foundation.

Week #3 - How data are always situated and contextual?

Kevin Guyan. Queer Data. Introduction "Data and difference" – Chp. 2 "Moving targets" is optional
Natasha Jones & Miriam Williams. (2022). Archives, Rhetorical Absence, and Critical Imagination: Examining Black Women’s Mental Health Narratives at Virginia’s Central State Hospital. IEEE Transactions on Procomm.
Lily Ng and Lena Nahorna. (22 May 2022). Grammarly AI-NLP Club #15: How to Collect High-Quality Data to Power Your Machine Learning Systems. Grammarly Ukraine [YouTube].

Week #4 - What are some existing ethical approaches and ideas in AI/ML?

Birhane, A. (2021). Algorithmic injustice: A relational ethics approach. Patterns, 2, 1–9.
Dylan Baker. (2022). Datasets have worldviews.

8 of 30

Phase 2 - Course Weeks 5-9

Learn & Practice Python Language Fundamentals:

Students begin to use a digital textbook�

OG Github repo with computational notebooks
Read about and learn how to code in the Python programming language.�

Worked through chapters 1-2. (See the chapters in the "book" folder.)

9 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

Chp 3 - Exploratory Data Analysis:�Students learn how to …�

use the Python pandas code library and its super helpful "panel data" approach to structuring, processing, analyzing, and visualizing data.
About Exploratory Data Analysis (estimates of location, variance, and distribution), and why it is a useful means to "read" the data.
Create various visualization/plots as a means to understand and analyze example data provided in the notebooks.

10 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

Chp 4 - Textual analysis with LLMs (Docuscope)�Students …�

Build on their EDA work
Learn how to conduct textual analysis methods:

Parts-of-Speech analysis and "Categorical" analysis with the Docuscope LLM;
Apply the DS LLM via the Python docuscopacy module.

11 of 30

Phase 3 - Course Weeks 10-12

Process, Analyze and Create a ML (Logistic Regression) Model:

Chp 5 - Training a News Genre Classifier: �

Train a logistic regression classifier model that predicts the news genre of a news article, based on its headline. �
Conduct an EDA of the data, attempting to identify any issues with the data in relationship with the ML goals.�
Precursor for the final team project, where they will train a LR model with different data with new goals in mind.

12 of 30

FINAL PROJECT - Course Weeks 13-Finals

_Team Project_ to Develop & Interrogate Their Own ML Model:�In teams, students …

Apply what they have learned throughout the course by creating their own ML project to code and document in Python.
Research:

Find a data set and develop a LR model with it.
Conduct research about the provenance of the data and how their modeling goals align or not with this provenance.

Document their process and outcomes with two genres:

Datasheet (Gebru et al., 2021)
Model Card (Mitchell et al., 2019)

Teams used to ensure that every team has a range of strengths to draw on: some are better coders, some are great at analysis and ideation, and others are really organized and can do great research, writing/documenting, and review work.
Since I know the students' strengths, I assign teams accordingly.

13 of 30

Logistic Regression Primer

14 of 30

Rules & decisions on/about data

(Digital) Data

Output / Answer

Review: Primer on how ML works

Rules on/about data

(Digital) Data

Outputs/Answers

Traditional Programming

Machine Learning

15 of 30

Logistic Regression - What is it?

Buy A Coffee/Espresso = p(morning|tired|stressed) * mornings

55.75 days = p(0.15) * 365

= p(evening|refreshed|caffeinated) * 365

3.65 days = p(0.01) * 365

Stochastic: It’s conjecture, yo! Randomly distributed descriptions.
Returns a probability score that something may or may not happen or be
Ex. What’s the likelihood that I’m going to buy a coffee/espresso?

16 of 30

Logistic Regression - What is it?

Uses binary (dichotomous or discrete) data to predict probability of binary decision: Yes or No? �
1 or more independent variables: features, i.e., words from a corpus of headlines

“Predictors”�

Variables used to determine a binary decision about the output (dependent variable (news genre of article)

“Criterion”

{

The Most

Uniquely Popular Halloween Candy In Each U.S. State

Food & Drink

Dependent

Variable

{

Independent

Variables

???

17 of 30

Logistic Regression - What is it?

Uses probability, or likelihood, scores to output predictions based on the available features in the input as compared with the trained models features�
LR models use a classification threshold value to ultimately decide how to classify the input�
New genres present us with an interesting case, where inputs may interact with the features scored in the trained model in interesting ways :-)

{

President Biden announces his favorite Halloween candy before U.S. Congress

POLITICS

Dependent

Variable

{

Independent

Variables

0.990

Let's say our LR model returns a value of 0.990 for a particular headline's news genre as being "POLITICS". This probability score is very likely to accurately predict that this headline is indeed an article about POLITICS. Conversely, another headline with a prediction score of 0.005 on that same logistic regression model is very likely not about POLITICS. Yet, what about a headline with a prediction score of 0.6?

In this lesson, the LR model uses those probability estimates as a binary category. To do so, we must decide what's called a "classification threshold" or "decision threshold". Any value above that threshold indicates a headline is about POLITICS, and any value below the threshold indicates that the headline is not POLITICS, but some other news genre.

The default decision threshold in the scikit-learn code library that we will use is 0.5. But, this library also enables us to "tune" the LR model based on our problem-dependency / context, as well as take the best/top probability score to predict the news genre of the input headline.

18 of 30

Logistic Regression == Discrete classification

Input/Output, Yes/No, 0/1 : �

Very crude way to decide how to classify an input.
No perfect rule to make that decision
So, how can we nuance that decision?

19 of 30

Logistic Regression - Nuancing discrete classification

Enter Stochastic Models: A conditional distribution of probabilities

Goal: Probability of event occurring / thing being (P)
Use: Classification problems

0.5

P = w₁•x₁ + w₂•x₂ + w₃•x₃ + … w_k•x_k + b

P(News Genre) = w₁•(halloween) + w₂•(candy) + w₃•(popular) + … w_k•x_k + b

Words (features)

Probability Estimate

1.0

20 of 30

Logistic Regression - Goods & Caveats

Good for …

Supervised ML, i.e., data with correct and accurate labels
Classification: Predicting categorical (discrete) data vs continuous (linear)
Example: Predict news genre, based on X�

Not good for data with potential inaccuracies or biases toward certain labels�
The lesson we’re going to review was meant to provide an opportunity for students to interrogate, deliberate, and evaluate if this dataset is good for the goal of the model via EDA + new (to them) assessment measures.

21 of 30

Activity Time!

EDA

22 of 30

get(notebook)

https://bit.ly/4c25zen

23 of 30

Activity Time!

(~5 mins) Acclimate to the notebook by reviewing sections 0 & 1 in the notebook together.

Note Parts: EDA, Training, and Assessment�

(~10 mins) Read through EDA part of the notebook:

Use the 4c25-lr.html.
Take notes about your experience navigating the notebook: questions, feelings, anything.
Work through some of the prompts as best you can:�2.3.1, 2.4.1, 2.7.2, 2.8.1�

Reconvene and talk through our responses.�
If time, discuss insights about coding as writing data – thoughts about writing as a material and rhetorical perspective to bring to these types of practices.

24 of 30

EDA - Estimates of Location

Measure	Definition	Sensitive to Outliers?	Best Use Cases
Mean	Average of all values	Yes	Symmetrical data, all values important
Median	Middle value	No	Skewed data, presence of outliers
Mode	Most frequent value	No	Categorical data, identifying common values

Measures the central tendency of a dataset.

25 of 30

EDA - Estimates of Distribution

Values that describe the spread, variability, or dispersion of data points in a dataset.

Measure	Definition	Sensitive to Outliers?	Focus
Range	Min - Max	Yes	Entire dataset extent
Variance	Average squared deviation from the mean	Yes	Spread around the mean
Standard Deviation	Square root of variance	Yes	Typical deviation from the mean

26 of 30

2.3.1 Exercise

Observations About

the ‘headline’ Column

Observation 1

Supporting evidence
Model impact?

Enter response

Observation 2

Supporting evidence
Model impact?

Enter response

Observation 3

Supporting evidence
Model impact?

Enter response

27 of 30

2.4.1 Exercise

Observations About

the ‘date’ Column

Fewer 2018 articles compared to the rest of the dataset.

Supporting evidence
Model impact?

Enter response

Observation 2

Supporting evidence
Model impact?

Enter response

Observation 3

Supporting evidence
Model impact?

Enter response

28 of 30

2.7.2 Exercise

Observations About

the ‘category’ Column

Observation 1

Supporting evidence
Model impact?

Enter response

Observation 2

Supporting evidence
Model impact?

Enter response

Observation 3

Supporting evidence
Model impact?

Enter response

29 of 30

2.8.1 Exercise

Observations About the

short descriptions Columns

Observation 1

Supporting evidence
Model impact?

Enter response

Observation 2

Supporting evidence
Model impact?

Enter response

Observation 3

Supporting evidence
Model impact?

Enter response

30 of 30

EDA & Model Accuracy

Observations Comparing EDA Work Against the LR Model

(See the sections 4.4.3 and on …)

Observation 1

Supporting evidence
Model impact?

Enter response

Observation 2

Supporting evidence
Model impact?

Enter response

Observation 3

Supporting evidence
Model impact?

Enter response