1 of 9

Reza Zafarani

rzafaran@syr.edu

CausalBench 2026

Evaluation in Social Media Research:

Challenges and Opportunities

2 of 9

Bias in Earlier studies

Earlier studies in sociology or psychology

Small Scale
Conducted with WEIRD participants

12 percent of world population
Accounting for 80% of psychology and sociology experiments

Non-representative samples

Using Web and Social Media as an observation tool

Testing Well-Established Theories
Developing New Theories
Correcting for biases introduced by sample size and WEIRD people

3 of 9

This is open-ended…

Testing Well-Established Theories

Correcting for biases introduced by sample size / WEIRD people

Understanding users better
Large-Scale studies

Developing New Theories

Engineering and Computer SCIENCE | Syracuse University

Dunbar’s Number: “humans can only comfortably maintain 150 stable relationships”

4 of 9

Data, Data, Data, D4ta

With the constant and rapid growth of social media data, researchers now have access to massive datasets for mining human behavior and social media mining.
However, as in traditional machine learning, evaluating findings typically requires ground truth: something that can be particularly challenging in social media, where ground truth may be unavailable, limited, or biased

Finally, digital behavior even when completely observed can be quite different from real-world behavior.

Zafarani, Reza, and Huan Liu. "Evaluation without ground truth in social media research." Communications of the ACM 58.6 (2015): 54-60.

5 of 9

No Ground Truth

Unlike traditional scientific domains: social media research often lacks ground truth due to limited access to users and the difficulty of confirming the true intentions behind online behaviors.

Leveraging the periodic nature of human behavior (LBSNs)
Crowdsourcing judgments
Using ensemble methods
Perform (quasi) controlled experiments
Conduct randomized and natural experiments
Utilize nonequivalent control groups

Kumar, S., Zafarani, R. and Liu, H., 2011, August. Understanding user migration patterns in social media. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 25, No. 1, pp. 1204-1209).
Xinyi Zhou,Kai Shu, Vir V. Phoha, Huan Liu, Reza Zafarani, “This is Fake! Shared it by Mistake”: Assessing the Intent of Fake News Spreaders, TheWeb Conference 2022

6 of 9

Limited Ground Truth

While the traditional solution in ML has been to collect more data, it is often infeasible in social media research.

Achieving maximum performance by

Utilizing limited (or minimum) information
Relying heavily on domain knowledge
Performing additional processing, and more advanced techniques

Cao, Zhaoyang, John Nguyen, and Reza Zafarani. "Is Less Really More? Fake News Detection with Limited Information." ACM SIGKDD Explorations Newsletter 27.1 (2025): 20-31.

Cai, Weibin, Jiayu Li, and Reza Zafarani. "Unpacking Hateful Memes: Presupposed Context and False Claims." arXiv preprint arXiv:2510.09935 (2025).

Rajadesingan, Ashwin, Reza Zafarani, and Huan Liu. "Sarcasm detection on twitter: A behavioral modeling approach." Proceedings of the eighth ACM international conference on web search and data mining. 2015.

7 of 9

Biased or Subjective Ground Truth

As labeling human behavior data can be subjective, ground truth can be biased in many tasks, e.g., in hate speech detection, annotators differ in their perspectives, criteria, and background information and may label the same content differently.
Treating such labels as uniform ground truth can mislead models.

Cai, Weibin, and Reza Zafarani. "Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection." arXiv preprint arXiv:2510.13837 (2025).

Pairwise hate-speech label agreement ratios between annotators from five countries: United Kingdom (GB), United States (US), Australia (AU), South Africa (ZA), and Singapore (SG).

8 of 9

Decomposing Humans/Subjectivity

Idea: each individual is at the intersection of a series of cultural parameters
Learning hate spaces (some variation of matrix factorization)

Where Y’s are culture-to-content relationships
Results: performs much better than state-of-the-art (LLM/non-LLM)

Opportunities for estimating personalized/biased causal effects (estimating treatment effects on an individualized basis.)

9 of 9

Finally, …. External Validity

In studies of online human behavior, external validity helps understand how online human behavior relates to real-world human behavior.
There are cases where we can conduct “digital twin” RCTs: one in social media and one in the real-world and there are opportunities in connecting the two to correct biases
There have been some limited studies in this domain, but perhaps this community can help move this area further….