1 of 54

Amir H. Payberah

payberah@kth.se

Oct. 10, 2025

Data Colonization

2 of 54

Civilization

2

https://tinyurl.com/bdhbexwf

3 of 54

Colonialism and Land Grabbing

  • Explore�
  • Expand�
  • Exploit�
  • Exterminate

3

https://tinyurl.com/mry959j7

4 of 54

Data Grab - Ulises A. Mejias and Nick Couldry

4

https://tinyurl.com/yc62yz9h

5 of 54

Data Colonialism

  • Data is often harvested without consent, reinforcing existing inequalities and creating new forms of exploitation.

5

https://tinyurl.com/39w2zfkf

6 of 54

Impact of Data Colonialism

  • Silences marginalized voices�
  • Erases narratives
  • Perpetuates oppression

6

https://tinyurl.com/5xe26xx3

7 of 54

Challenge These Practices

  • Embrace pluralism in data science�
  • Value and include the marginalized communities viewpoints at all stages of the process.�
  • Recognizing a multiplicity of voices for a complete picture.

7

8 of 54

Anti-Eviction Mapping Project

8

9 of 54

Anti-Eviction Mapping Project (AEMP)

  • AEMP: an example of how data can be used to resist the 4X model and serve the interests of marginalized communities.�

9

10 of 54

Housing Crisis and Evictions in San Francisco

  • High income gap between the richest and poorest in San Francisco.�
  • Wealth gap has racial and gender dimensions.�
  • Escalating rate of evictions since 2003.

10

https://tinyurl.com/ycx2fjz2

11 of 54

AEMP Community

  • A collective of housing justice activists, researchers, artists, and historians (founded in 2013).�
  • With many designers and nonexperts (local residents) to map evictions.

11

12 of 54

Feminist Counterdata and Countervisualization

  • AEMP documents the displacement of residents through gentrification and eviction, particularly in communities of color and low-income neighborhoods.�
  • A set of counterdata collection and countervisualization strategies.
    • Where people go after they are evicted?
    • How many of those people end up homeless?
    • Which landlords are responsible for evicting?�

12

https://tinyurl.com/ywnzbnxx

13 of 54

https://antievictionmap.com

  • Maps of displaced residents
  • Maps of evictions
  • Maps of tech buses
  • Maps of property owners
  • Maps of the Filipino diaspora
  • Maps of the declining numbers �of Black residents in the city
  • And more …

13

https://antievictionmap.com

14 of 54

Embracing Pluralism in Data Science

  • AEMP exemplifies the embrace pluralism principle of data feminism.�
  • Valuing various perspectives at all stages of the process, from data collection to cleaning to analysis to communication. �

14

https://tinyurl.com/2kks39b6

15 of 54

Questions for Discussion

  • Can pluralism and profit coexist in the field of data science, or are they fundamentally at odds?�
  • How effective is collaborative data science in truly shifting power to marginalized communities?�
  • What are the potential risks of using these methods in ways that still benefit those in power?

15

16 of 54

Embracing Pluralism in Data Science

  • Also means recognizing how data science methods can unintentionally silence voices for clarity, cleanliness, and control.

16

https://tinyurl.com/39yzjt99

17 of 54

Data Cleaning

17

18 of 54

Cleaning and Preparing Data

  • Data cleaning is considered a crucial part of data science.�
  • What might be lost in this process?�
  • Whose perspectives might be lost in this process?�
  • Whose perspectives might be additionally imposed?

18

19 of 54

Historical Roots and Ethical Considerations of Cleanliness

  • Ideas of cleanliness in data have historical ties to eugenics.
    • Improve the genetic quality of human populations through selective breeding.�
  • Early statisticians like Pearson and Galton were also leaders in the eugenics movement.�

19

https://tinyurl.com/ydv28yh7

https://tinyurl.com/4b9j47yd

Pearson Correlation Coefficient

20 of 54

Ghost Stories for Darwin - Banu Subramaniam

  • Argues that scientific practices are deeply embedded in cultural and political contexts.�
  • Critiques how the desire for purity and uniformity in scientific data has often mirrored broader societal efforts to control and categorize human populations.

20

https://tinyurl.com/ynky768r

21 of 54

The TESCREAL Bundle

  • Transhumanism�
  • Extropianism�
  • Singularitarianism�
  • Cosmism�
  • Rationalism�
  • Effective Altruism�
  • Llongtermism

21

22 of 54

The Cleaning Paradigm and Its Critique

  • Katie Rawson and Trevor Muñoz, in Against Cleaning argued that cleaning assumes an underlying correct order.�
  • Rich information can be lost during cleaning.�
  • Diversity-hiding trick

22

23 of 54

Epistemic Violence - Gayatri Chakravorty Spivak

  • The harm caused silencing or invalidating�marginalized knowledge and voices.

23

24 of 54

Situated Knowledges - Donna Haraway

  • All knowledge is partial, no single person or group can claim an objective view of the Truth.�
  • People make knowledge from a particular standpoint: from a situated, embodied location in the world.

24

https://tinyurl.com/y6zzjhj5

25 of 54

All Data Are Local - Yanni Loukissas

  • Data is inherently local: it reflects the specific conditions, practices, and power structures of the settings in which it is created.
  • Considering data settings rather than datasets.
  • E.g., Clemson University Library's "upstate" term is clear locally, but confusing to outsiders.

25

https://lmc.gatech.edu/featured-people/yanni-loukissas

26 of 54

  • How do we start gaining a deeper understanding in data science?

26

27 of 54

Transparency and Reflexivity

  • Transparency: revealing technical details.�
  • Reflexivity: who is doing the work and the process behind it.

27

https://tinyurl.com/5n99sjuj

28 of 54

Beyond Transparency and Reflexivity

  • Actively and deliberately inviting other perspectives into the data analysis and storytelling process.

28

https://tinyurl.com/mrxy76mz

29 of 54

From Data for Good (Data Ethics) to Data for Co-liberation (Data Justice)

29

30 of 54

Questions for Discussion

  • Is it possible to achieve both technical accuracy and ethical responsibility in data cleaning, or are these goals sometimes in conflict?�
  • How should data scientists approach situations where prioritizing one might compromise the other?

30

31 of 54

Situating Data on the Wild Web

31

32 of 54

32

https://tinyurl.com/3p9r2fnr

33 of 54

Context and Metadata in Data

  • Many datasets lack context or metadata.�
  • Lack of context makes data exploration difficult.�
  • Without local knowledge, understanding�power dynamics is challenging.

33

34 of 54

Context-Free Data Analysis

  • Chris Anderson's "The End of Theory" claims data speak for themselves, eliminating the need for context.�
  • Statistical inference relies on sampling, but big data suggests using all data directly.

34

35 of 54

Is Correlation Enough?

  • Anderson insists: correlation is enough.�
  • E.g., Google search
  • But what happens when the number of�links is also highly correlated with sexist,�racist, and pornographic results?�
  • Correlation can reinforce societal biases.

35

36 of 54

"Raw Data" Is an Oxymoron - Lisa Gitelman

  • The numbers speak for themselves” is the premise that data are a raw input.�
  • Data is anything but raw.

36

https://tinyurl.com/2f7v5325

37 of 54

Raw Data, Cooked Data, Cooking

  • Data are not raw inputs; they are cooked through social, political, and historical contexts.�
  • One strategy for considering context is to consider the cooking process that produces “raw” data.�
  • Exploring and analyzing what is missing �from a dataset.

37

https://tinyurl.com/yvhs3v6w

38 of 54

Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes

  • Word embeddings�
  • Explore gender and ethnic stereotypes
    • Across Google Books and the New York Times�
  • Words like intelligent, logical, and thoughtful were predominantly associated with men until the 1960s.�
  • From the 1960s onwards, these words increasingly became associated with women.�
  • Attributed to the women's movement in the 1960s and 1970s.

38

39 of 54

Formerly Gang-Involved Youth as Domain Experts for Analyzing Unstructured Twitter Data

  • SAFELab: use Twitter data to understand and prevent gang violence.�
  • Deep listening and contextual analysis of social media.�

39

40 of 54

Consider Context

40

41 of 54

Consider Context

  • Where did it come from?�
  • Who collected it? �
  • When? �
  • How was it collected? �
  • Why was it collected?

41

42 of 54

Consider Context in Data

  • Datasheets for datasets�
  • Dataset nutrition label�
  • A Participatory approach for �data work

42

https://tinyurl.com/3a4av458

43 of 54

Datasheets for Datasets

  • Address the needs of two stakeholder groups: dataset creators and consumers
  • Objectives for dataset creators
    • Encourage careful reflection on the creation, distribution, �and maintenance of a dataset, considering assumptions,�risks, and potential impacts.�
  • Objectives for dataset consumers
    • Ensure they have the information they need to make �informed decisions about using a dataset.

43

https://www.dair-institute.org/projects/documentation-accountability/

44 of 54

Datasheets for Datasets - Key Components

  • Motivation
    • Why was the dataset created? What is its intended use?�
  • Composition
    • What does the dataset consist of? Who are the subjects?�
  • Collection process
    • How was the data collected? Any ethical concerns?�
  • Preprocessing
    • Any cleaning or transformations applied to the data?�
  • Distribution
    • How can the dataset be accessed or shared?�
  • Maintenance
    • Who is responsible for maintaining the dataset?

44

45 of 54

Datasheets for Datasets - Example (Health Records)

  • Motivation
    • Collected to improve healthcare outcomes�
  • Composition
    • Contains anonymized patient records from 10 hospitals�
  • Collection process
    • Data collected through surveys and hospital records�
  • Preprocessing
    • Data anonymized and cleaned for missing entries�
  • Distribution
    • Available to researchers under a strict data-sharing agreement�
  • Maintenance
    • Updated annually by the hospital consortium

45

46 of 54

Dataset Nutrition Label

  • A structured approach to provide key information about datasets.�
  • Inspired by food nutrition labels: offers transparency on data content and quality.

46

https://tinyurl.com/2dx7xe7x

47 of 54

Dataset Nutrition Label - Key Components

  • Data provenance
    • Information about the origin and source of the data.�
  • Data collection methods
    • How the data was gathered, including any ethical �concerns.�
  • Limitations and biases
    • Acknowledging data biases, gaps, or potential issues.�
  • Data use and misuse
    • Guidelines for appropriate use and potential risks of misuse.�
  • Maintenance and updates
    • Who manages the data and how often it’s updated.

47

48 of 54

Dataset Nutrition Label - Example (Health Records)

  • Data provenance
    • Collected from multiple healthcare providers.�
  • Data collection methods
    • Patient data from electronic health records.�
  • Limitations and biases
    • Underrepresentation of rural populations and minority groups.�
  • Data use and misuse
    • For predictive healthcare models, but should avoid use in determining insurance premiums due to potential biases.�
  • Maintenance and updates
    • Maintained by a health consortium, updated annually.

48

49 of 54

Participatory Documentation

  • Involves multiple stakeholders (data creators, curators, users) in the documentation process.�
  • Encourages diverse perspectives and shared ownership over data production practices.

49

https://tinyurl.com/4ndhe9hm

50 of 54

Participatory Documentation - Key Steps

  • Involve stakeholders early
    • Engage data collectors, curators, and users from the beginning of the project.�
  • Map data creation
    • Document the entire data production process, from collection to final output, noting key decisions.�
  • Record rationales
    • Explain why certain data collection, cleaning, or curation choices were made.�
  • Highlight biases and gaps
    • Identify and openly document any biases or limitations in the data.

50

51 of 54

Questions for Discussion

  • Who should provide context?
    • End users?
    • Data publishers?
    • Data intermediaries (e.g., librarians, journalists, educators)?

51

52 of 54

Summary

52

53 of 54

53

54 of 54

  • Data Feminism, (ch. 4-5)
  • Data colonialism: Rethinking big data’s relation to the contemporary subject, Nick Couldry et al., 2019
  • Against Cleaning, Katie Rawson and Trevor Muñoz [link]
  • All Data Are Local: Thinking Critically in a Data-Driven Society, Yanni Loukissas (ch. introduction)
  • Situated knowledges: The science question in feminism and the privilege of partial perspective, Donna Haraway, Feminist Studie, 1988
  • Social media for large studies of behavior, Derek Ruths and Jürgen Pfeffer, 2014
  • Artificial Intelligence and Inclusion: Formerly Gang-Involved Youth as Domain Experts for Analyzing Unstructured Twitter Data, William R. Frey et al., 2020
  • Datasheets for datasets, Timnit Gebru et al., Communications of the ACM, 2021
  • The Dataset Nutrition Label (2nd Gen), Kasia S. Chmielinski et al., 2018
  • Documenting Data Production Processes: A Participatory Approach for Data Work, Milagros Miceli et al., 2022

54