1 of 26

Data

Fall 2024

2 of 26

Outline

History of “Big Data”
Symbiosis between data and algorithms
Big Data & Society
The Great Hack

3 of 26

Outline

History of “Big Data”
Symbiosis between data and algorithms
Big Data & Society
The Great Hack

4 of 26

Big Data

Boyd & Crawford (2012) – in their influential paper about big data – examine the idea of mass-scale data collection. They define “big data” along three dimensions:

Technology - Means of gathering, aggregating, linking, comparing data
Analysis - Means of identifying patterns to make knowledge claims
“Mythology” - Idea of a higher, more objective way of �producing knowledge

4

5 of 26

What was the world like before Big Data?

In the pre-Internet era, databases were...

Not networked
Designed for one context and used in that context (no “boomerang effect”)
Not generated and queried at scale in realtime
Not storing a lot of ‘personal data’
Metadata for documenting data provenance; data collected�about people often involved a consent process.

5

Databases were only to organize information (a way to store, retrieve, archive, version, document, info).
Ask questions of the data. Tax info, HR databases (hiring, etc.), recipients of public services, geodatabases of protected lands and habitats, genome databases, etc. Stores used them to track inventory.
You couldn’t connect them. You couldn’t easily crawl / replicate them in order to create one’s own personal copy / archive (e.g. Cambridge Analytica, Clearview AI, Google’s Search Engines).

What’s new?

Data Collection

Pervasiveness and ubiquity of data: size, scale, and types (personal) of data that are being collected
New opportunities for data collection

Third party cookies, clicking and browsing history, reading your email, platform trackers: https://www.nytimes.com/2017/07/03/smarter-living/how-to-see-what-the-internet-knows-about-you.html
Sensors
Video cameras everywhere
Intercepting communication transmission; cellphone logs

Speed: Databases are getting built and processed in real-time

Replication and Aggregation

Databases are being copied, replicated, and remixed over the internet.
Disparate data sources being stitched together, bought and sold, reprocessed, etc.

New Statistical and Computational Methods

New computational methods on a range of new data sources have powered many new types of applications: * Widespread use of statistics and data-driven algorithmic methods (ML) over user-generated data at an unprecedented size and a scale * From mostly descriptive to leveraging correlations to derive features, classify entities, and make predictions about behavior: ML / Data Science

6 of 26

...but of course today...

New data collection practices (at scale)�Sensors; trackers; shopping, browsing, posting; logging; intercepting communication transmissions. “A surveillance apparatus.”
Ease of replication and aggregation�data copied, replicated, remixed, stitched together, bought and sold, reprocessed, etc.
New statistical & computational methods �ML/Data Science: finding correlations, deriving features, �classifying entities, and making predictions

6

Databases were only to organize information (a way to store, retrieve, archive, version, document, info).
Ask questions of the data. Tax info, HR databases (hiring, etc.), recipients of public services, geodatabases of protected lands and habitats, genome databases, etc. Stores used them to track inventory.
You couldn’t connect them. You couldn’t easily crawl / replicate them in order to create one’s own personal copy / archive (e.g. Cambridge Analytica, Clearview AI, Google’s Search Engines).

What’s new?

Data Collection

Pervasiveness and ubiquity of data: size, scale, and types (personal) of data that are being collected
New opportunities for data collection

Third party cookies, clicking and browsing history, reading your email, platform trackers: https://www.nytimes.com/2017/07/03/smarter-living/how-to-see-what-the-internet-knows-about-you.html
Sensors
Video cameras everywhere
Intercepting communication transmission; cellphone logs

Speed: Databases are getting built and processed in real-time

Replication and Aggregation

Databases are being copied, replicated, and remixed over the internet.
Disparate data sources being stitched together, bought and sold, reprocessed, etc.

New Statistical and Computational Methods

New computational methods on a range of new data sources have powered many new types of applications: * Widespread use of statistics and data-driven algorithmic methods (ML) over user-generated data at an unprecedented size and a scale * From mostly descriptive to leveraging correlations to derive features, classify entities, and make predictions about behavior: ML / Data Science

7 of 26

How have these new developments impacted us?

Who can you call if you want to take it down?
What is the difference between a one-off records request and actually downloading an entire public records database?

7

Databases were only to organize information (a way to store, retrieve, archive, version, document, info).
Ask questions of the data. Tax info, HR databases (hiring, etc.), recipients of public services, geodatabases of protected lands and habitats, genome databases, etc. Stores used them to track inventory.
You couldn’t connect them. You couldn’t easily crawl / replicate them in order to create one’s own personal copy / archive (e.g. Cambridge Analytica, Clearview AI, Google’s Search Engines).

What’s new?

Data Collection

Pervasiveness and ubiquity of data: size, scale, and types (personal) of data that are being collected
New opportunities for data collection

Third party cookies, clicking and browsing history, reading your email, platform trackers: https://www.nytimes.com/2017/07/03/smarter-living/how-to-see-what-the-internet-knows-about-you.html
Sensors
Video cameras everywhere
Intercepting communication transmission; cellphone logs

Speed: Databases are getting built and processed in real-time

Replication and Aggregation

Databases are being copied, replicated, and remixed over the internet.
Disparate data sources being stitched together, bought and sold, reprocessed, etc.

New Statistical and Computational Methods

New computational methods on a range of new data sources have powered many new types of applications: * Widespread use of statistics and data-driven algorithmic methods (ML) over user-generated data at an unprecedented size and a scale * From mostly descriptive to leveraging correlations to derive features, classify entities, and make predictions about behavior: ML / Data Science

8 of 26

Atlas of AI, Chapter 3. Data

“It’s that the NIST databases foreshadow the emergence of a logic that has now thoroughly pervaded the tech sector: the unswerving belief that everything is data and is there for the taking. It doesn’t matter where a photograph was taken or whether it reflects a moment of vulnerability or pain or if it represents a form of shaming the subject. It has become so normalized across the industry to take and use whatever is available that few stop to question the underlying politics.…It is all treated as data to be run through functions, material to be ingested to improve technical performance. This is a core premise in the ideology of data extraction.”

8

9 of 26

Outline

History of “Big Data”
Symbiosis between data and algorithms

Quick ML / Deep Learning primer

Big Data & Society
The Great Hack

10 of 26

Machine Learning: “Traditional Approach”

Find features using known, domain-specific math functions / approaches.
Classify resulting patterns to outputs (cat, dog, bird).
Eventually, your model will be able to recognize new instances it hasn’t seen before.

Source: Mathworks

11 of 26

Machine Learning: Traditional Approach

Known, domain-specific math �functions used

Source: Mathworks

12 of 26

Deep Learning (type of ML): Newer Approach

Features are learned by the network over time.
Each “layer” of the network represents a type of learned �feature – but those features not have any theoretical basis �(instead, derived from data).

Source: Mathworks

13 of 26

Deep Learning: Newer Approach

Source: Mathworks

14 of 26

Outline

History of “Big Data”
Symbiosis between data and algorithms
Big Data & Society
The Great Hack

15 of 26

Big Data: Big Ideas and Societal Implications

Big data changes the definition of knowledge
Claims to objectivity and accuracy are misleading
Bigger data are not always better data
Just because it’s accessible, doesn’t make it ethical
Limited access to big data creates new digital divides

15

16 of 26

1. Big data changes the definition of knowledge

Privileges quantification: “Other methods for ascertaining why people do things, write things, or make things are lost in the sheer volume of numbers.”
Influences the direction of knowledge production: The available data drive the research questions.

Example: if we’re missing historical data, we just don’t ask those questions

Diminishes the importance of understanding mechanism

16

17 of 26

Example 2. WordNet, ImageNet, and mTurk

Ghost Work / Gig Work: “Services delivered by companies like Amazon, Google, Microsoft, and Uber rely on a vast, invisible human labor force – who usually earn less than legal minimums for traditional work, have no health benefits, and can be fired at any time for any reason, or none.”

Data labeling for $1 / hour – imbue subjectivities and biases (e.g. WordNet, and ImageNet);
Training Chatbots to respond in a more human-like way

Content moderation tasks – poorly compensated; Profoundly psychologically damaging

18 of 26

2. Claims to Objectivity & Accuracy are Misleading

Human judgement is a fundamental part of data science practice — from the research questions asked, to how data are procured, cleaned, clustered & classified, analyzed, and applied. Yet, we talk about data as if they were objective.
As datasets propagated w/o information re: how they were made, we lose our ability to account for bias
Datasets that inform our language models:

Enron emails
IBM Lawsuit
South American Terrorists in the 1980s

18

19 of 26

2. Claims to Objectivity & Accuracy are Misleading

Risk of seeing statistically significant connections that aren’t there

19

20 of 26

Example. Gender Shades: Joy Buolamwini

Systematically compared three facial recognitions systems (Microsoft, Face++, and IBM) across male and female faces with different skin shades.

21 of 26

3. Bigger data are not always better data

Importance of systematicity over volume:

“How people behave / act” != Reddit + Twitter + Tiktok

Huge biases (over / underrepresentation) in these data.

Data also skewed from the beginning because we don’t know how platforms have already altered it

Example: researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate.

21

22 of 26

3. Bigger data are not always better data

Access issues: very few people actually have access to all of the Twitter data (“the firehose”). In general, ‘the public’ can only access a subset.
Compounding errors as data are stitched together

22

23 of 26

4. Just because it’s accessible, doesn’t make it ethical

Just because content is publicly accessible does not mean that it was meant to be consumed by just anyone.
Considerable difference between being in public (i.e. sitting in a park) and being public (i.e. actively courting attention) (boyd & Marwick 2011).
Issues of Power: There are also significant questions of truth, control, and power in Big Data studies: researchers (and companies and governments) have the tools and the access, while social media users as a whole do not.
Clearview AI Lawsuit

23

24 of 26

5. Limited access creates new digital divides

Who gets access? For what purposes? In what contexts? And with what constraints? For the most part, only the social media companies themselves (or privileged researchers)
Who has the skills?
Who is asking the questions determines which questions are asked (Harding 2010; Forsythe 2001).
If you have to get permission from a SM company, what kinds of questions will you be allowed to ask?
‘Effective democratisation can always be measured by…�participation in and access to the archive, its constitution, �and its interpretation’ (Derrida, 1996)

24

25 of 26

Outline

History of “Big Data”
Symbiosis between data and algorithms
Big Data & Society
The Great Hack

26 of 26

We’re going to start watching�The Great Hack

26