1 of 83

Cracking the Black Box:

Investigating Algorithms

2025 Global Investigative Journalism Conference

2 of 83

Who’s who

Karol Ilagan�(she/they)�University of the Philippines Diliman / Pulitzer Center�Moderator

Gabriel Geiger�(he/him)�Lighthouse Reports

Jasper Jackson�(he/him)�Transformer (previously TBIJ)

Lam Thuy Vo�(she/they)�CUNY / Documented NY

3 of 83

What we’ll cover today

  • AI investigations framework
  • Story case studies

4 of 83

What is an algorithm?

  • An algorithm is a sequence of rules performed to carry out a certain task.
  • It generates an output from a given input, similar to solving a mathematical problem or cooking a meal through a recipe.

5 of 83

AI models

data

compute

applications

AI development stages

Related issues

Actors

labor exploitation

loss of privacy/consent

surveillance capitalism

Impacted people

energy/water use

mineral extraction

geopolitics

automation

embedded bias

errors/hallucinations

embedded abuse

discrimination

mis/disinformation

businesses

governments

individuals

talent

companies

workers

democracies

consumers

communities

communities

planet

individuals

workers

emissions

+

governments

companies

governments

companies

citizens

individuals

communities

investors

investors

businesses

erosion of IP

job loss/degradation

surveillance

6 of 83

Jasper Jackson

Now: Managing editor of Transformer

Previously: Tech editor of the Bureau of Investigative Journalism

7 of 83

Input

8 of 83

AI models

data

compute

applications

AI development stages

Related issues

Actors

labor exploitation

loss of privacy/consent

surveillance capitalism

Impacted people

energy/water use

mineral extraction

geopolitics

automation

embedded bias

errors/hallucinations

embedded abuse

discrimination

mis/disinformation

businesses

governments

individuals

talent

companies

workers

democracies

consumers

communities

communities

planet

individuals

workers

emissions

+

governments

companies

governments

companies

citizens

individuals

communities

investors

investors

businesses

erosion of IP

job loss/degradation

surveillance

9 of 83

AI/Machine learning is fundamentally based on data

Key feature enabling the algorithmic tools of today is data — huge amounts of it.

Numbers start to become meaningless but apparently GPT-5 was trained on something like 10 trillion words of text.

But, while we hear a lot about data used to train LLMs — the kinds of data different AI tools require vary massively. It’s not just sucking up the internet.

And a lot of that data requires human beings to help turn it into something that is actually useful.

10 of 83

Areas where you may uncover wrongdoing or systemic problems

  • Labor — Lots and lots of people are involved in producing this data, and many aspects of this work can be harmful in a range of ways.
  • Intellectual property — Plenty of battles over this. I think that, from a legal perspective actually most of the battles will lose, but that’s largely that IP law is inadequate for what AI, particularly generative AI, does.
  • Surveillance — All the old issues, but cranked up a notch.
  • Bias — Again, many similar issues, but cranked up, particularly due to complexity, lack of transparency and the spread of AI systems into key areas where discrimination is possible.
  • Quality — This I think is often overlooked. The quality of the data used will have huge repercussions for how it is implemented, and is also actually a target for bad actors (as I’ll mention later).

11 of 83

Two key focuses for investigation:

I’m going to focus on two areas that I think can be good starting points/routes of attack for investigations into algorithms and AI.

Labour: The people who work on data!

Data itself: The actual stuff that is put into AI!

(Note): There is often a lot of overlap between these — The way you get at the data can often be passed on by the people working on it. The data itself can often be derived from those people (pictures, IDs etc).

12 of 83

Labour

13 of 83

Training data is a big industry

All this data needs labelling! Normally by human beings!

Worth circa $20bn…

14 of 83

And it employs lots of people…

15 of 83

Example 1: Facial recognition

Compelling issue:

  • Increasingly common worldwide.
  • Clear privacy and bias issues.
  • Commonly used as a tool of repression.

But also has unique data requirements that require input from human workers:

  • Databases trained on online data alone ineffective in real-world scenarios.
  • Historically ineffective at identifying non-white faces.
  • Requires extensive labelling which means workers!

16 of 83

Example 1: Facial recognition

17 of 83

18 of 83

19 of 83

💸

💻

💻

💸

💻

💻

💸

💻

💻

💻

20 of 83

21 of 83

Key ways into AI labour investigations:

  • Public discussions
  • Closed groups
  • Videos
  • Unions
  • Community groups
  • Lawyers
  • Public contracts

22 of 83

Data

23 of 83

Example 2: Data Poisoning

“Data poisoning is a type of cyber-attack in which an adversary intentionally compromises a training dataset used by an AI or machine learning (ML) model.”

  • Relatively underexplored.
  • Potentially big implications….

Recent Anthropic/AISI study: “By injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.”

24 of 83

Pravda/Portal Kombat

25 of 83

26 of 83

“The network may have been custom-built to flood LLMs with pro-Russia content. …

This final finding poses foundational implications for the intersection of disinformation and artificial intelligence (AI), which threaten to turbocharge highly automated, global information operations in the future.”

“A well-funded Moscow-based propaganda machine has successfully infiltrated leading artificial intelligence models, flooding Western AI systems with Russian disinformation.”

“By prompting popular AI chatbots such as OpenAI’s ChatGPT and Google’s Gemini, we found that content posted by Pravda news portals had found its way into the generated responses.”

27 of 83

28 of 83

Key ways into data set investigations:

  • Look at what’s on the market - lot of data sets are for sale.
  • Think about what kind of data can be harmful.
  • Identify new data sources that will be ingested by an LLM.
  • Identify data that doesn’t talk to humans.

29 of 83

Compute

30 of 83

energy/water use

mineral extraction

geopolitics

communities

planet

emissions

governments

companies

citizens

investors

AI models

data

compute

applications

AI development stages

+

31 of 83

AI models

data

compute

applications

AI development stages

+

automation

embedded bias

errors/hallucinations

embedded abuse

talent

companies

individuals

communities

investors

businesses

32 of 83

32

GOAL

You want to understand how an AI model actually works

GLASS BOX

GOAL

You want to understand how an AI model operates in the real world

BLACK BOX

TWO APPROACHES

SYSTEM TYPE

“Simpler” machine learning models (usually)

ACTOR

Governments (usually)

METHODS

Public records requests, sourcing

SYSTEM TYPE

More advanced AI models (e.g. Generative AI, social media algos)

ACTOR

Companies (usually)

METHODS

Sourcing, systematic testing

www.websitename.com

www.websitename.com

ENLIGHT StartUp

33 of 83

33

GOAL

You want to understand how an AI model actually works

GLASS BOX

TWO APPROACHES

SYSTEM TYPE

“Simpler” machine learning models (usually)

ACTOR

Governments (usually)

METHODS

Public records requests, sourcing

www.websitename.com

www.websitename.com

ENLIGHT StartUp

34 of 83

Case Study #1

34

35 of 83

What we knew

35

36 of 83

How can we interrogate bias across the AI Lifecycle?

Input Variables

Does the AI system use variables that are unfair, like a person’s race or gender? Does it use proxy variables for these characteristics, like postcode?

Training Data

Does the training data contain historical biases? Is it representative of the real world?

Model Type

Does the machine learning technique inject randomness?

Outcomes

Is there disparate impact against vulnerable groups?

Accuracy

Does the system perform equally well across different groups?

Deployment

Is the application of an AI system in itself biased?

37 of 83

Our public records request

1.a The machine learning model file and source code used to train the model.

2.a Documents containing the full set of input variables�2.b data dictionaries or other documents that describe how the input variables are defined.

3.a Technical documentation related to the model.

3.b Handbooks or other manuals for how end-users should interpret and act upon algorithmic outputs.

4 a Documents containing tests and/or evaluations of the model, algorithm.

4.b Documents describing the training data for the model.

5. Data Protection, Privacy, and/or Human Rights Impact Assessments

38 of 83

38

38

39 of 83

39

39

40 of 83

A source points us to a buried report from the agency’s auditor

40

41 of 83

Simple data!!!

42 of 83

42

42

43 of 83

43

GOAL

You want to understand how an AI model operates in the real world

BLACK BOX

TWO APPROACHES

SYSTEM TYPE

More advanced AI models (e.g. Generative AI, social media algos)

ACTOR

Companies (usually)

METHODS

Sourcing, systematic testing

www.websitename.com

www.websitename.com

ENLIGHT StartUp

44 of 83

  • Midjourney AI reproduces bias and stereotypes.
  • Flattens cultural differences and hierarchies.

45 of 83

46 of 83

47 of 83

48 of 83

No backend access to midjourney.

49 of 83

Methodology

  • NOT a fancy statistical experiment; two people manually labeling data in a spreadsheet.

  • Analyze 3,000 images with consistent prompts.

  • Breakdown by country.

50 of 83

51 of 83

Applications: Social media and misinformation

52 of 83

Applications of algorithms in social media

Social media algorithms promote some content more than others. It’s not necessarily important for us to know how the algorithm works. It can be more useful for our audiences to track what is amplified and what kind of actors and content are able to thrive on them.

53 of 83

Three approaches

  • How your audience experiences the social web: Quantified selfie
  • How bad actors manipulate the web: looking at content
  • Investigating the system: running experiments to prove it does harm

54 of 83

Looking at your audience

55 of 83

Quantified Selfie

Understanding media ecosystems: By interviewing community members and analyzing the YouTube viewing history of one person, The Markup found that many influencers translated conspiracy theories and misinformation from right-wing news sites into Vietnamese.

56 of 83

Quantified Selfie

Understanding impact on people’s relationships: asking for access to a person’s personal data to tell their data story.

We analyzed 2,367 posts from the Facebook News Feeds of a politically divided mother and daughter to show them just how different their online worlds are.

57 of 83

Scaling the quantified selfie

Documented worked with five migrants to download their TikTok history.

Documented identified about 300 videos that had been watched most of the men. The men watched videos that provided either partial or inaccurate content about vital matters like how to fill out legal forms for requesting asylum, that if acted upon, could derail their asylum processes and integration into American society.

58 of 83

Bad actors

59 of 83

60 of 83

Looking at questionable content

Documented found 50+ YouTube channels selling the same product but advertising it using different phone numbers.

61 of 83

Testing the system

62 of 83

Testing the system

In this story from ProPublica, reporters set up an account to place Facebooks ads and found that the platform would not stop them from placing racist and discriminatory ads.

63 of 83

Testing the system

The Wall Street Journal created over 100 automated accounts or bots assigning each a set of specific interests, without disclosing them to TikTok. They found that TikTok only needs one important piece of information to figure out what you want: the amount of time you linger over a piece of content. Every second you hesitate or rewatch, the app is tracking you.

64 of 83

Investigating algorithms

in data-scarce contexts

Case study: Ride-hailing apps

in the Philippines

65 of 83

AI models

data

compute

applications

AI development stages

Related issues

Actors

labor exploitation

loss of privacy/consent

surveillance capitalism

Impacted people

energy/water use

mineral extraction

geopolitics

automation

embedded bias

errors/hallucinations

embedded abuse

discrimination

mis/disinformation

businesses

governments

individuals

talent

companies

workers

democracies

consumers

communities

communities

planet

individuals

workers

emissions

+

governments

companies

governments

companies

citizens

individuals

communities

investors

investors

businesses

erosion of IP

job loss/degradation

surveillance

66 of 83

AI models

data

compute

applications

AI development stages

Related issues

Actors

labor exploitation

loss of privacy/consent

surveillance capitalism

Impacted people

energy/water use

mineral extraction

geopolitics

automation

embedded bias

errors/hallucinations

embedded abuse

discrimination

mis/disinformation

businesses

governments

individuals

talent

companies

workers

democracies

consumers

communities

communities

planet

individuals

workers

emissions

+

governments

companies

governments

companies

citizens

individuals

communities

investors

investors

businesses

erosion of IP

job loss/degradation

surveillance

67 of 83

Philippines

68 of 83

69 of 83

Is an algorithm the answer to the complex problem of commuting in Metro Manila?

70 of 83

The reporting challenge

  • Old laws, weak regulation or lack thereof meant that no data were available from the government
  • Grab is almost a monopoly in the Philippines; bought out Uber in 2018. Quite popular, has become a lifeline for many Filipinos.
  • Opaque system
  • Tech literacy is low

71 of 83

What we had at the beginning

  • Numerous accounts from users about steep fares
  • Multiple stories from workers about their experiences
  • Personal experiences

72 of 83

What we did: Collect and analyze data systematically

Manual:

We formed a team of 20 researchers who attempted to book rides for 10 routes across Metro Manila every hour from 6 a.m. to midnight for one week.

73 of 83

What we did: Collect and analyze data systematically

API data:

74 of 83

Our data collection yielded more than 8,000 entries

Explore the data:

https://bit.ly/40Ez130

75 of 83

Data: Grab rides always included surge fees

76 of 83

Grab Philippines acknowledges our findings

“In some cases, high demand periods may persist, leading to prolonged surge pricing. The aim is to attract more drivers to areas with high demand, thereby reducing wait times over time. Continuous adjustments are made to ensure optimal service delivery.”����

77 of 83

Illustrating the impact of gamifying apps on workers

We interviewed nearly 50 drivers and riders over the course of four months to learn about the risks.

78 of 83

Completing the story: Accountability questions

  1. Who is supposed to do what, when, and how?
  2. How is this process or system supposed to work? How did it work, actually?
  3. What standards are supposed to be in place, how were they established, and who enforces them?

79 of 83

Learn more about our methodology

https://bit.ly/48aJKo2

80 of 83

Q&A + Reflections

81 of 83

Training opportunity: Reporting on AI Intensive

Applications due Nov. 27!

82 of 83

AI Spotlight Series Now Available Online

83 of 83

Thank you.

bit.ly/gjic2025-algo-blackbox