Cracking the Black Box:
Investigating Algorithms
2025 Global Investigative Journalism Conference
Who’s who
Karol Ilagan�(she/they)�University of the Philippines Diliman / Pulitzer Center�Moderator
Gabriel Geiger�(he/him)�Lighthouse Reports
Jasper Jackson�(he/him)�Transformer (previously TBIJ)
Lam Thuy Vo�(she/they)�CUNY / Documented NY
What we’ll cover today
What is an algorithm?
AI models
data
compute
applications
AI development stages
Related issues
Actors
labor exploitation
loss of privacy/consent
surveillance capitalism
Impacted people
energy/water use
mineral extraction
geopolitics
automation
embedded bias
errors/hallucinations
embedded abuse
discrimination
mis/disinformation
businesses
governments
individuals
talent
companies
workers
democracies
consumers
communities
communities
planet
individuals
workers
emissions
+
governments
companies
governments
companies
citizens
individuals
communities
investors
investors
businesses
erosion of IP
job loss/degradation
surveillance
Jasper Jackson
Now: Managing editor of Transformer
Previously: Tech editor of the Bureau of Investigative Journalism
Input
AI models
data
compute
applications
AI development stages
Related issues
Actors
labor exploitation
loss of privacy/consent
surveillance capitalism
Impacted people
energy/water use
mineral extraction
geopolitics
automation
embedded bias
errors/hallucinations
embedded abuse
discrimination
mis/disinformation
businesses
governments
individuals
talent
companies
workers
democracies
consumers
communities
communities
planet
individuals
workers
emissions
+
governments
companies
governments
companies
citizens
individuals
communities
investors
investors
businesses
erosion of IP
job loss/degradation
surveillance
AI/Machine learning is fundamentally based on data
Key feature enabling the algorithmic tools of today is data — huge amounts of it.
Numbers start to become meaningless but apparently GPT-5 was trained on something like 10 trillion words of text.
But, while we hear a lot about data used to train LLMs — the kinds of data different AI tools require vary massively. It’s not just sucking up the internet.
And a lot of that data requires human beings to help turn it into something that is actually useful.
Areas where you may uncover wrongdoing or systemic problems
Two key focuses for investigation:
I’m going to focus on two areas that I think can be good starting points/routes of attack for investigations into algorithms and AI.
Labour: The people who work on data!
Data itself: The actual stuff that is put into AI!
(Note): There is often a lot of overlap between these — The way you get at the data can often be passed on by the people working on it. The data itself can often be derived from those people (pictures, IDs etc).
Labour
Training data is a big industry
All this data needs labelling! Normally by human beings!
Worth circa $20bn…
And it employs lots of people…
Example 1: Facial recognition
Compelling issue:
But also has unique data requirements that require input from human workers:
Example 1: Facial recognition
💸
💻
💻
💸
💻
💻
💸
💻
💻
💻
Key ways into AI labour investigations:
Data
Example 2: Data Poisoning
“Data poisoning is a type of cyber-attack in which an adversary intentionally compromises a training dataset used by an AI or machine learning (ML) model.”
Recent Anthropic/AISI study: “By injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.”
Pravda/Portal Kombat
“The network may have been custom-built to flood LLMs with pro-Russia content. …
This final finding poses foundational implications for the intersection of disinformation and artificial intelligence (AI), which threaten to turbocharge highly automated, global information operations in the future.”
“A well-funded Moscow-based propaganda machine has successfully infiltrated leading artificial intelligence models, flooding Western AI systems with Russian disinformation.”
“By prompting popular AI chatbots such as OpenAI’s ChatGPT and Google’s Gemini, we found that content posted by Pravda news portals had found its way into the generated responses.”
Key ways into data set investigations:
Compute
energy/water use
mineral extraction
geopolitics
communities
planet
emissions
governments
companies
citizens
investors
AI models
data
compute
applications
AI development stages
+
AI models
data
compute
applications
AI development stages
+
automation
embedded bias
errors/hallucinations
embedded abuse
talent
companies
individuals
communities
investors
businesses
32
GOAL
You want to understand how an AI model actually works
GLASS BOX
GOAL
You want to understand how an AI model operates in the real world
BLACK BOX
TWO APPROACHES
SYSTEM TYPE
“Simpler” machine learning models (usually)
ACTOR
Governments (usually)
METHODS
Public records requests, sourcing
SYSTEM TYPE
More advanced AI models (e.g. Generative AI, social media algos)
ACTOR
Companies (usually)
METHODS
Sourcing, systematic testing
www.websitename.com
www.websitename.com
ENLIGHT StartUp
33
GOAL
You want to understand how an AI model actually works
GLASS BOX
TWO APPROACHES
SYSTEM TYPE
“Simpler” machine learning models (usually)
ACTOR
Governments (usually)
METHODS
Public records requests, sourcing
www.websitename.com
www.websitename.com
ENLIGHT StartUp
Case Study #1
34
What we knew
35
How can we interrogate bias across the AI Lifecycle?
Input Variables
Does the AI system use variables that are unfair, like a person’s race or gender? Does it use proxy variables for these characteristics, like postcode?
Training Data
Does the training data contain historical biases? Is it representative of the real world?
Model Type
Does the machine learning technique inject randomness?
Outcomes
Is there disparate impact against vulnerable groups?
Accuracy
Does the system perform equally well across different groups?
Deployment
Is the application of an AI system in itself biased?
Our public records request
1.a The machine learning model file and source code used to train the model.
2.a Documents containing the full set of input variables�2.b data dictionaries or other documents that describe how the input variables are defined.
3.a Technical documentation related to the model.
3.b Handbooks or other manuals for how end-users should interpret and act upon algorithmic outputs.
4 a Documents containing tests and/or evaluations of the model, algorithm.
4.b Documents describing the training data for the model.
5. Data Protection, Privacy, and/or Human Rights Impact Assessments
38
38
39
39
A source points us to a buried report from the agency’s auditor
40
Simple data!!!
42
42
43
GOAL
You want to understand how an AI model operates in the real world
BLACK BOX
TWO APPROACHES
SYSTEM TYPE
More advanced AI models (e.g. Generative AI, social media algos)
ACTOR
Companies (usually)
METHODS
Sourcing, systematic testing
www.websitename.com
www.websitename.com
ENLIGHT StartUp
No backend access to midjourney.
Methodology
Applications: Social media and misinformation
Applications of algorithms in social media
Social media algorithms promote some content more than others. It’s not necessarily important for us to know how the algorithm works. It can be more useful for our audiences to track what is amplified and what kind of actors and content are able to thrive on them.
Three approaches
Looking at your audience
Quantified Selfie
Understanding media ecosystems: By interviewing community members and analyzing the YouTube viewing history of one person, The Markup found that many influencers translated conspiracy theories and misinformation from right-wing news sites into Vietnamese.
Quantified Selfie
Understanding impact on people’s relationships: asking for access to a person’s personal data to tell their data story.
We analyzed 2,367 posts from the Facebook News Feeds of a politically divided mother and daughter to show them just how different their online worlds are.
Scaling the quantified selfie
Documented worked with five migrants to download their TikTok history.
Documented identified about 300 videos that had been watched most of the men. The men watched videos that provided either partial or inaccurate content about vital matters like how to fill out legal forms for requesting asylum, that if acted upon, could derail their asylum processes and integration into American society.
Bad actors
Looking at questionable content
Documented found 50+ YouTube channels selling the same product but advertising it using different phone numbers.
Testing the system
Testing the system
In this story from ProPublica, reporters set up an account to place Facebooks ads and found that the platform would not stop them from placing racist and discriminatory ads.
Testing the system
The Wall Street Journal created over 100 automated accounts or bots assigning each a set of specific interests, without disclosing them to TikTok. They found that TikTok only needs one important piece of information to figure out what you want: the amount of time you linger over a piece of content. Every second you hesitate or rewatch, the app is tracking you.
Investigating algorithms
in data-scarce contexts
Case study: Ride-hailing apps
in the Philippines
AI models
data
compute
applications
AI development stages
Related issues
Actors
labor exploitation
loss of privacy/consent
surveillance capitalism
Impacted people
energy/water use
mineral extraction
geopolitics
automation
embedded bias
errors/hallucinations
embedded abuse
discrimination
mis/disinformation
businesses
governments
individuals
talent
companies
workers
democracies
consumers
communities
communities
planet
individuals
workers
emissions
+
governments
companies
governments
companies
citizens
individuals
communities
investors
investors
businesses
erosion of IP
job loss/degradation
surveillance
AI models
data
compute
applications
AI development stages
Related issues
Actors
labor exploitation
loss of privacy/consent
surveillance capitalism
Impacted people
energy/water use
mineral extraction
geopolitics
automation
embedded bias
errors/hallucinations
embedded abuse
discrimination
mis/disinformation
businesses
governments
individuals
talent
companies
workers
democracies
consumers
communities
communities
planet
individuals
workers
emissions
+
governments
companies
governments
companies
citizens
individuals
communities
investors
investors
businesses
erosion of IP
job loss/degradation
surveillance
Philippines
Is an algorithm the answer to the complex problem of commuting in Metro Manila?
The reporting challenge
What we had at the beginning
What we did: Collect and analyze data systematically
Manual:
We formed a team of 20 researchers who attempted to book rides for 10 routes across Metro Manila every hour from 6 a.m. to midnight for one week.
What we did: Collect and analyze data systematically
API data:
Our data collection yielded more than 8,000 entries
Data: Grab rides always included surge fees
Grab Philippines acknowledges our findings
“In some cases, high demand periods may persist, leading to prolonged surge pricing. The aim is to attract more drivers to areas with high demand, thereby reducing wait times over time. Continuous adjustments are made to ensure optimal service delivery.”����
Illustrating the impact of gamifying apps on workers
We interviewed nearly 50 drivers and riders over the course of four months to learn about the risks.
Completing the story: Accountability questions
Learn more about our methodology
https://bit.ly/48aJKo2
Q&A + Reflections
Training opportunity: Reporting on AI Intensive
Applications due Nov. 27!
AI Spotlight Series Now Available Online