1 of 26

Data

Fall 2024

2 of 26

Outline

  1. History of “Big Data”
  2. Symbiosis between data and algorithms
  3. Big Data & Society
  4. The Great Hack

3 of 26

Outline

  • History of “Big Data”
  • Symbiosis between data and algorithms
  • Big Data & Society
  • The Great Hack

4 of 26

Big Data

Boyd & Crawford (2012) – in their influential paper about big data – examine the idea of mass-scale data collection. They define “big data” along three dimensions:

  1. Technology - Means of gathering, aggregating, linking, comparing data
  2. Analysis - Means of identifying patterns to make knowledge claims
  3. “Mythology” - Idea of a higher, more objective way of �producing knowledge

4

5 of 26

What was the world like before Big Data?

In the pre-Internet era, databases were...

  • Not networked
  • Designed for one context and used in that context (no “boomerang effect”)
  • Not generated and queried at scale in realtime
  • Not storing a lot of ‘personal data’
  • Metadata for documenting data provenance; data collected�about people often involved a consent process.

5

6 of 26

...but of course today...

  • New data collection practices (at scale)�Sensors; trackers; shopping, browsing, posting; logging; intercepting communication transmissions. “A surveillance apparatus.”
  • Ease of replication and aggregation�data copied, replicated, remixed, stitched together, bought and sold, reprocessed, etc.
  • New statistical & computational methods �ML/Data Science: finding correlations, deriving features, �classifying entities, and making predictions

6

7 of 26

How have these new developments impacted us?

  • Who can you call if you want to take it down?
  • What is the difference between a one-off records request and actually downloading an entire public records database?

7

8 of 26

Atlas of AI, Chapter 3. Data

“It’s that the NIST databases foreshadow the emergence of a logic that has now thoroughly pervaded the tech sector: the unswerving belief that everything is data and is there for the taking. It doesn’t matter where a photograph was taken or whether it reflects a moment of vulnerability or pain or if it represents a form of shaming the subject. It has become so normalized across the industry to take and use whatever is available that few stop to question the underlying politics.…It is all treated as data to be run through functions, material to be ingested to improve technical performance. This is a core premise in the ideology of data extraction.”

8

9 of 26

Outline

  • History of “Big Data”
  • Symbiosis between data and algorithms
    • Quick ML / Deep Learning primer
  • Big Data & Society
  • The Great Hack

10 of 26

Machine Learning: “Traditional Approach”

  • Find features using known, domain-specific math functions / approaches.
  • Classify resulting patterns to outputs (cat, dog, bird).
  • Eventually, your model will be able to recognize new instances it hasn’t seen before.

Source: Mathworks

11 of 26

Machine Learning: Traditional Approach

Known, domain-specific math �functions used

Source: Mathworks

12 of 26

Deep Learning (type of ML): Newer Approach

  • Features are learned by the network over time.
  • Each “layer” of the network represents a type of learned �feature – but those features not have any theoretical basis �(instead, derived from data).

Source: Mathworks

13 of 26

Deep Learning: Newer Approach

Source: Mathworks

14 of 26

Outline

  • History of “Big Data”
  • Symbiosis between data and algorithms
  • Big Data & Society
  • The Great Hack

15 of 26

Big Data: Big Ideas and Societal Implications

  1. Big data changes the definition of knowledge
  2. Claims to objectivity and accuracy are misleading
  3. Bigger data are not always better data
  4. Just because it’s accessible, doesn’t make it ethical
  5. Limited access to big data creates new digital divides

15

16 of 26

1. Big data changes the definition of knowledge

  • Privileges quantification: “Other methods for ascertaining why people do things, write things, or make things are lost in the sheer volume of numbers.”
  • Influences the direction of knowledge production: The available data drive the research questions.
    • Example: if we’re missing historical data, we just don’t ask those questions
  • Diminishes the importance of understanding mechanism

16

17 of 26

Example 2. WordNet, ImageNet, and mTurk

Ghost Work / Gig Work: Services delivered by companies like Amazon, Google, Microsoft, and Uber rely on a vast, invisible human labor force – who usually earn less than legal minimums for traditional work, have no health benefits, and can be fired at any time for any reason, or none.”

  • Data labeling for $1 / hour – imbue subjectivities and biases (e.g. WordNet, and ImageNet);
  • Training Chatbots to respond in a more human-like way
    • Content moderation tasks – poorly compensated; Profoundly psychologically damaging

18 of 26

2. Claims to Objectivity & Accuracy are Misleading

  • Human judgement is a fundamental part of data science practice — from the research questions asked, to how data are procured, cleaned, clustered & classified, analyzed, and applied. Yet, we talk about data as if they were objective.
  • As datasets propagated w/o information re: how they were made, we lose our ability to account for bias
  • Datasets that inform our language models:
    • Enron emails
    • IBM Lawsuit
    • South American Terrorists in the 1980s

18

19 of 26

2. Claims to Objectivity & Accuracy are Misleading

  • Risk of seeing statistically significant connections that aren’t there

19

20 of 26

Example. Gender Shades: Joy Buolamwini

Systematically compared three facial recognitions systems (Microsoft, Face++, and IBM) across male and female faces with different skin shades.

21 of 26

3. Bigger data are not always better data

Importance of systematicity over volume:

  • “How people behave / act” != Reddit + Twitter + Tiktok
    • Huge biases (over / underrepresentation) in these data.
  • Data also skewed from the beginning because we don’t know how platforms have already altered it
    • Example: researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate.

21

22 of 26

3. Bigger data are not always better data

  • Access issues: very few people actually have access to all of the Twitter data (“the firehose”). In general, ‘the public’ can only access a subset.
  • Compounding errors as data are stitched together

22

23 of 26

4. Just because it’s accessible, doesn’t make it ethical

  • Just because content is publicly accessible does not mean that it was meant to be consumed by just anyone.
  • Considerable difference between being in public (i.e. sitting in a park) and being public (i.e. actively courting attention) (boyd & Marwick 2011).
  • Issues of Power: There are also significant questions of truth, control, and power in Big Data studies: researchers (and companies and governments) have the tools and the access, while social media users as a whole do not.
  • Clearview AI Lawsuit

23

24 of 26

5. Limited access creates new digital divides

  • Who gets access? For what purposes? In what contexts? And with what constraints? For the most part, only the social media companies themselves (or privileged researchers)
  • Who has the skills?
  • Who is asking the questions determines which questions are asked (Harding 2010; Forsythe 2001).
  • If you have to get permission from a SM company, what kinds of questions will you be allowed to ask?
  • ‘Effective democratisation can always be measured by…�participation in and access to the archive, its constitution, �and its interpretation’ (Derrida, 1996)

24

25 of 26

Outline

  • History of “Big Data”
  • Symbiosis between data and algorithms
  • Big Data & Society
  • The Great Hack

26 of 26

We’re going to start watching�The Great Hack

26