1 of 61

CMPSC 442: Artificial Intelligence

Lecture 16. AI Ethics

Rui Zhang

Spring 2024

1

2 of 61

3 of 61

How should AI systems behave, and who should decide?

3

https://openai.com/blog/how-should-ai-systems-behave/

4 of 61

The Human Factor in NLP

"The common misconception is that language has to do with words and what they mean. It doesn’t. It has to do with people and what they mean."

--- Herbert H. Clark & Michael F. Schober, 1992

4

5 of 61

Harm caused by Bias of NLP Technology

5

6 of 61

Harm caused by Bias of NLP Technology

6

https://www.theguardian.com/technology/2017/oct/24/facebook-palestine-israel-translates-good-morning-attack-them-arrest

7 of 61

Gender Bias in Word Embeddings

7

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

8 of 61

Gender Bias in Text-to-Image Retrieval

Image search query “Doctor” (June 2017)

8

Slide Credit: Yulia Tsvetkov

9 of 61

Gender Bias in Text-to-Image Retrieval

Image search query “Nurse” (June 2017)

9

Slide Credit: Yulia Tsvetkov

10 of 61

Gender Bias in Machine Translation

10

https://arxiv.org/pdf/1809.02208.pdf

11 of 61

Gender Bias in Machine Translation

Google Translation systems: gender neutral Turkish sentences into English

11

https://blog.google/products/translate/reducing-gender-bias-google-translate/

12 of 61

Social/Racial Bias in NLG of Dialog Systems

12

https://aclanthology.org/2020.findings-emnlp.291.pdf

13 of 61

Human Bias in Data

Human Reporting Bias

The frequency with which people write about actions, outcomes, or properties is not a reflection of real-world frequencies or the degree to which a property is characteristic of a class of individuals.
e.g., "Doctor" vs "Female Doctor"
e.g., "Banana" vs "Yellow Banana"

13

Reporting bias and knowledge acquisition

14 of 61

Human Bias in Data Collection and Annotation

Selection Bias

Selection does not reflect a random sample

14

http://turktools.net/crowdsourcing/

https://ai.googleblog.com/2018/09/introducing-inclusive-images-competition.html

15 of 61

Inductive Bias

The assumptions used by our model

recurrent neural networks for NLP assume that the sequential ordering of words is meaningful
features in discriminative models are assumed to be useful to map inputs to outputs

15

https://people.cs.umass.edu/~miyyer/cs685_f20/slides/18-ethics.pdf

16 of 61

Bias Amplification in Learned Models

Dataset Gender Bias

16

Slide Credit: Mark Yatskar

Model Bias After Training

17 of 61

Human Bias in Interpretation

Confirmation bias: The tendency to search for, interpret, favor, recall information in a way that confirms preexisting beliefs.

Overgeneralization: Coming to conclusion based on information that is too general and/or not specific enough (related: overfitting).

Correlation Fallacy: Confusing correlation with causation.

Automation Bias: Propensity for humans to favor suggestions from automated decision-making systems over contradictory information without automation.

17

Slide Credit: Margaret Mitchell

18 of 61

Algorithmic bias: Unjust, unfair, or prejudicial treatment of people related to race, income, sexual orientation, religion, gender, and other characteristics historically associated with discrimination and marginalization, when and where they manifest in algorithmic systems or algorithmically aided decision-making.

18

Data Collection

Data Annotation

Model Training

Result Interpretation

Human Bias

19 of 61

Understand and Document your Data

19

https://arxiv.org/pdf/1803.09010.pdf

20 of 61

Also Be Responsible for your Model

20

https://arxiv.org/pdf/1810.03993.pdf

21 of 61

AI Ethics

Fairness and Bias

Security and Privacy

Transparency and Explainability

…

22 of 61

Appendix

23 of 61

Fair Abstractive Summarization of Diverse Perspectives

Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi

Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang

NAACL 2024

23

24 of 61

Are Large Language Models

Fair Summarizers?

24

25 of 61

Conflicting Product Reviews

25

How did I get interested in this topic of fair summarization?

Let me start with an example of my personal experience.

Recently, I often suffered from lower back pain, so I thought I might need a better office chair.

I went to Amazon, and I found a very popular ergonomic office chair.

It has a very good rating, 4.3 out of 5, which is pretty high among other similar products.

But when I take a closer look at the reviews, it’s not that simple.

You can see 69% percent of people give a 5-star rating, their reviews say it is high-quality, great value, and very comfortable because it can be adjusted easily.

But still, some customers hate it, and there are 7% 1-star ratings.

These bad reviews also mention some other aspects, like it is noisy, and the lumber support is not very helpful.

So even though it is a popular product, people still have very different opinions about it. And they are all authentic and valid reviews focusing on different perspectives.

Start with an example of product reviews and opinions

I want to buy a office chair: look for product reviews

Be personal, casual, and daily on this example

26 of 61

Diverse Perspectives and Conflicting Opinions

26

Product and Restaurant Reviews

Political Stances

Legal Cases

Scientific Debates

Actually, people always have different ideas, on almost every issue.

Product Review on Amazon is just one example, we can also have different restaurant reviews on Yelp. People already study those as opinion summarization. That’s probably more chill cases. If you find the office chair doesn’t suit you, you can replace it, and not a big deal.

There are also other more important and fundamental things in our society, and people do have really diverse perspectives and even conflicting opinions!

For example, there are scientific debates, such as Bohr and Einstein. They disagreed on quantum physics, and both can give strong evidence to support their arguments. And in fact, they are both correct in some sense, and their disagreement together forms the basis of modern physics.

Political stances can be controversial, as we can see on a daily basis, and no one side can persuade the other.

When it comes to legal cases, we also need to listen to both sides to make a fair judgment.

Four Pictures

are Everywhere

Many Other Examples and Domains

27 of 61

Value Pluralism and Fairness of Summarization

Value Pluralism: There are several values which may be equally correct and fundamental, and yet in conflict with each other.

Fair Summarization: A fair summary for user-generated data by providing an accurate and comprehensive view of various perspectives from these groups.

27

So, as we can see, we live in a world of value pluralism, because there are several values that can be equally correct and fundamental, and yet in conflict with each other.

Therefore, when a summarization system is faced with multiple user-generated documents, we want our system to provide an accurate and comprehensive view of various perspectives from these groups.

And we call this fair summarization.

For example, if the summarization system is given multiple user reviews, the summary should cover both positive and negative comments.

As a quick try, we give GPT-3.5 two positive reviews and two negative reviews on this office chair, we found that GPT-3.5 can cover important aspects of positive reviews, but it misses several other issues from the negative reviews. Therefore, we believe it is not a fair summary.

Introduce the idea of Fairness of Summarization over Diverse Perceptiveness

28 of 61

PerspectiveSumm: A Benchmark for Fair Abstractive Summarization

Characteristics

Quality: Human-written inputs marked with clear, precise social attributes
Diversity: Cover various domains, forms, attributes, and values

28

We would like to collect a benchmark for fair abstractive summarization.

We should have people from different social and demographic groups to express diverse perspectives and conflicting opinions. So we call our benchmark dataset PerspectiveSumm.

When building this dataset, we want to ensure two characteristics.

First, it is high-quality. The source texts are human-written, and they are marked with clear and precise social attributes.

Second is diversity, we want our dataset to cover various domains, forms, attributes, and values.

This table shows some basic information. We include a broad set of domains such as healthcare, politics, restaurant and product review, law, and public policy. Different domains take different user data forms such as social media posts, reviews, dialogues, and debates. And they are grouped by different attributes including gender, party, sentiment, rating, and speakers.

Appendix

The major difference from previous popular summarization datasets is that

These attributes are meta-data in the dataset except that we use a classifier for sentiment classification.

29 of 61

PerspectiveSumm: Examples of Claritin and US Election

29

30 of 61

PerspectiveSumm: Examples of Yelp and Amazon

30

31 of 61

PerspectiveSumm: Examples of Supreme Court and IQ2 Debate

31

32 of 61

Summarization of Diverse Perspectives with Social Attributes

Social attributes: indicate the properties that form groups of people.

Sentiment: Positive, Negative
Gender: Male, Female, Other
Party: Conservatism, Liberalism, Moderates

32

Source 1: Positive Review

Source 2: Negative Review

Target

Positive Summary

Negative Summary

Neutral Summary

To make our task more concrete,

We assume users have different social attributes, which indicate the properties of that group.

There are different social attributes depending on the applications and domains.

For example, sentiment is a social attribute, and it has two values positive and negative.

Gender is another attribute, with values of Male, Female, and Other.

In the political domain, party is a social attribute, with values of Conservatism, Liberalism, Moderates.

Now, when the input contains both positive and negative reviews, GPT models will generate summaries.

Some part of summary focuses on the positive review, some focuses on the negative review.

And the rest could come from both sides, and we call this part neutral summary.

Appendix

We consider summarization of diverse perspectives in user-generated text.

Each social attribute can take different values.

33 of 61

Definition of Fairness of Summarization

33

Fair Summary

Unfair Summary

34 of 61

Probing Fairness of LLMs through Summarization

34

Fair Summary

Unfair Summary

35 of 61

Existing Metrics are not Enough for Evaluating Fairness

We do not always have reference summaries.
Even if reference summaries are available, they are not always fair. (Actually, they are often not fair according to our experiments later.)
Even if the summaries are fair, existing metrics (ROUGE/BLEU/BERTScore) captures similarity but cannot capture the notion of fairness.

35

Hopefully, this task makes sense to you.

But existing metrics are not enough to evaluate summarization fairness.

The reason is on different levels:

First, we do not always have reference summaries. When people use GPT for summarization, it can be on anything, where we do not have a reference summary to compare with. But still, we want to quickly tell if the summary generated by GPT is fair.

Second, even if reference summaries are available for some datasets, they are not always fair. Maybe humans, when they write the summary, neglect certain parts of inputs, and it is not a fair summary. Actually, we will have some experiments to show that existing reference summaries are often not fair.

Third, even if the summaries are fair, existing metrics (ROUGE/BLEU/BERTScore) only capture similarity, but they cannot capture the notion of fairness.

36 of 61

Our Approach to Quantifying Fairness of Summaries

36

1. Quantify the distribution of values in both sources and targets.

2. Quantify the differences of value distributions between sources and targets.

37 of 61

Our Approach to Quantifying Fairness of Summaries

37

1. Quantify the distribution of values in both sources and targets.

2. Quantify the differences of value distributions between sources and targets.

38 of 61

Value Distribution of Text

We view text as a probability distribution of semantic units, e.g., tokens.

Each semantic unit maps to social attribute values.

This gives us value distribution of text!

38

39 of 61

Value Distribution of Source

39

Source

Positive Review

Negative Review

Source Value Distribution

Source Text

40 of 61

Value Distribution of Source

40

This is easy as the meta-data already has the values.

So we can count the number of tokens of each values.

41 of 61

Value Distribution of Target

41

Target

Positive Summary

Negative Summary

Neutral Summary

Target Value Distribution

Target Text

42 of 61

Value Distribution of Target

This is not easy due to the abstractive nature of summaries!

We explore two methods for estimating

N-gram Matching: find n-gram overlap between target and source for hard matching
Neural Matching: use BERTScore/BARTScore for soft matching

42

However, the calculation of p(y) is not as easy as p(x), because of the abstractive nature of summaries! Abstractive summarization does not directly copy from the source text, so it contains novel words that are not in the source but still semantically similar. This makes it difficult to attribute a sentence in the summary to a certain part of the source. To tackle this, we propose two methods for calculating the target distribution p(y):

The first approach is N-gram Matching. This means we find n-gram overlap between target and source. For n-gram in the target, if it is also in the source, this n-gram has the value of the source.

The second approach is Neural Matching. We use BARTScore and BERTScore to measure the distance between target and source. In this way, we can capture the semantic similarity in abstractive summaries.

This table shows that these methods are different in some aspects. They have different granularity.

N-gramScore is based on the token. So it is on token level.

BERTScore computes the similarity of the sentences. So it is on sentence level.

BARTScore computes similarity using entailment from source to the entire summary. So it is on the summary level.

Furthermore, these scores have different advantages. N-gramScore can be applied to diverse length, while the other two scores can capture more semantic relationships, but they have length limitations.

43 of 61

Our Approach to Quantifying Fairness of Summaries

43

1. Quantify the distribution of values in both sources and targets.

2. Quantify the differences of value distributions between sources and targets.

44 of 61

Summarization Fairness - Ratio Fairness

44

The target value distribution should follow the source value distribution.

Source 1: Positive Review

Source 2: Negative Review

Target

Positive Summary

Negative Summary

Neutral Summary

45 of 61

Summarization Fairness - Equal Fairness

45

Target

Positive Summary

Negative Summary

Neutral Summary

The target value distribution should follow the uniform value distribution, regardless of the source.

Source 1: Positive Review

Source 2: Negative Review

46 of 61

Summarization Fairness - User-Defined Fairness

46

Target

The target value distribution should follow user-defined distribution.

Source 1: Positive Review

Source 2: Negative Review

47 of 61

Metric 1 - Binary Unfair Rate (BUR)

Definition.

Binary Unfair Rate (BUR) outputs 1 if the sample is unfair; and 0 otherwise.

A summary is fair if and only if

This means no value is under-represented.

47

48 of 61

Metric 2 - Unfair Error Rate (UER)

Definition.

Unfair Error Rate (UER) measures the distance between value distributions of sources and targets.

It computes the average percentage of values that are underrepresented.

48

49 of 61

Sanity Check on Our Metric Quality by Extreme Synthetic Examples

We create pseudo-summary by sampling from the source

Biased Summary: Sample only male tweets
Balanced Summary: Sample both male and female tweets with balanced ratios

49

Our metrics do capture the difference of value distributions to measure fairness.

As we just proposed several new metrics, we want to ensure they work as expected before doing large-scale experiments.

So we perform some sanity check on the quality of our metrics.

The first sanity check is to test the metrics by using extreme synthetic examples.

These examples are extremely fair or unfair by our manual construction.

So we hope our metric can touch their lower bounds and upper bounds on these synthetic examples.

We create pseudo-summary by sampling from the source.

For Biased Summary, we sample only male tweets.

For Balanced Summary, we sample male and female tweets with balanced ratios.

As we can see from this figure, all metrics give very different scores for biased and balanced summaries.

This indicates that our metrics do capture the difference in value distributions to measure fairness.

50 of 61

Sanity Check on Our Metric Quality by Human Evaluation

We perform a two-stage human evaluation to understand how humans perceive the fairness of abstractive summaries.

50

Sentence Fact Identification

2. Summary Fairness Identification

51 of 61

Sanity Check on Our Metric Quality by Human Evaluation

51

High correlation of proposed metrics and human evaluation.

52 of 61

How fair are the abstractive summaries generated by LLMs?

52

Many summaries generated by LLMs are not fair, as judged our automatic metrics.
gpt-turbo-3.5 and gpt-4 are in general better than their older version text-davince-003 and other small open-source models.
But we don’t find strong evidence that gpt-4 is better than gpt-turbo-3.5.

53 of 61

How fair are the abstractive summaries generated by LLMs?

53

While the summary can be unfair per instance, on the entire testing set, models do not generate more unfair summaries for one side than the other.

We then zoom in to investigate the fairness of summaries by looking at all the examples.

We pick the gpt-turbo-3.5 outputs over the Claritin dataset.

For each example, we measure both target distribution and source distribution, and take their ratio.

Ideally, if the target distribution is the same as the source distribution, the ratio is 1, which is in the middle of this figure.

If the target distribution is more than the source distribution, they will be on the right. In this case, it is fine.

Otherwise, if the target distribution is less than the source distribution, they will be on the left. In this case, they are unfair if they are too far on the left because the ratio is less than 0.8.

As you can see, this figure roughly follows a Gaussian-style distribution.

About 20% outputs contain unfair summaries for both female and male values.

This reveals that large language models lack explicit notions of fairness to equally incorporate opinions from different groups.

While the summary can be unfair per instance, on the entire testing set, models do not generate more unfair summaries for one side than the other.

54 of 61

How do humans perceive the fairness of abstractive summaries?

54

Our results indicate that many summaries generated by LLMs are not fair, as judged our human evaluators.

55 of 61

How fair are the existing human-written reference summaries?

55

Interestingly, existing reference summaries are not fair either, even worse than LLM-generated summaries.

56 of 61