1 of 9

Reza Zafarani

rzafaran@syr.edu

CausalBench 2026

Evaluation in Social Media Research:

Challenges and Opportunities

2 of 9

Bias in Earlier studies

Earlier studies in sociology or psychology

      • Small Scale
      • Conducted with WEIRD participants
        • 12 percent of world population
        • Accounting for 80% of psychology and sociology experiments
      • Non-representative samples

Using Web and Social Media as an observation tool

      • Testing Well-Established Theories
      • Developing New Theories
      • Correcting for biases introduced by sample size and WEIRD people

2

3 of 9

This is open-ended…

    • Testing Well-Established Theories

    • Correcting for biases introduced by sample size / WEIRD people
            • Understanding users better
            • Large-Scale studies

    • Developing New Theories

Engineering and Computer SCIENCE | Syracuse University

3

Dunbar’s Number: “humans can only comfortably maintain 150 stable relationships”

4 of 9

Data, Data, Data, D4ta

  • With the constant and rapid growth of social media data, researchers now have access to massive datasets for mining human behavior and social media mining.
  • However, as in traditional machine learning, evaluating findings typically requires ground truth: something that can be particularly challenging in social media, where ground truth may be unavailable, limited, or biased

  • Finally, digital behavior even when completely observed can be quite different from real-world behavior.

Zafarani, Reza, and Huan Liu. "Evaluation without ground truth in social media research." Communications of the ACM 58.6 (2015): 54-60.

5 of 9

No Ground Truth

  • Unlike traditional scientific domains: social media research often lacks ground truth due to limited access to users and the difficulty of confirming the true intentions behind online behaviors.
    • Leveraging the periodic nature of human behavior (LBSNs)
    • Crowdsourcing judgments
    • Using ensemble methods
    • Perform (quasi) controlled experiments
    • Conduct randomized and natural experiments
    • Utilize nonequivalent control groups

  • Kumar, S., Zafarani, R. and Liu, H., 2011, August. Understanding user migration patterns in social media. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 25, No. 1, pp. 1204-1209).
  • Xinyi Zhou,Kai Shu, Vir V. Phoha, Huan Liu, Reza Zafarani, “This is Fake! Shared it by Mistake”: Assessing the Intent of Fake News Spreaders, TheWeb Conference 2022

6 of 9

Limited Ground Truth

  • While the traditional solution in ML has been to collect more data, it is often infeasible in social media research.
    • Achieving maximum performance by
      • Utilizing limited (or minimum) information
      • Relying heavily on domain knowledge
      • Performing additional processing, and more advanced techniques

Cao, Zhaoyang, John Nguyen, and Reza Zafarani. "Is Less Really More? Fake News Detection with Limited Information." ACM SIGKDD Explorations Newsletter 27.1 (2025): 20-31.

Cai, Weibin, Jiayu Li, and Reza Zafarani. "Unpacking Hateful Memes: Presupposed Context and False Claims." arXiv preprint arXiv:2510.09935 (2025).

Rajadesingan, Ashwin, Reza Zafarani, and Huan Liu. "Sarcasm detection on twitter: A behavioral modeling approach." Proceedings of the eighth ACM international conference on web search and data mining. 2015.

7 of 9

Biased or Subjective Ground Truth

  • As labeling human behavior data can be subjective, ground truth can be biased in many tasks, e.g., in hate speech detection, annotators differ in their perspectives, criteria, and background information and may label the same content differently.
  • Treating such labels as uniform ground truth can mislead models.

Cai, Weibin, and Reza Zafarani. "Seeing Hate Differently: Hate Subspace Modeling for Culture-Aware Hate Speech Detection." arXiv preprint arXiv:2510.13837 (2025).

Pairwise hate-speech label agreement ratios between annotators from five countries: United Kingdom (GB), United States (US), Australia (AU), South Africa (ZA), and Singapore (SG).

8 of 9

Decomposing Humans/Subjectivity

  • Idea: each individual is at the intersection of a series of cultural parameters
  • Learning hate spaces (some variation of matrix factorization)

  • Where Y’s are culture-to-content relationships
  • Results: performs much better than state-of-the-art (LLM/non-LLM)

  • Opportunities for estimating personalized/biased causal effects (estimating treatment effects on an individualized basis.)

9 of 9

Finally, …. External Validity

  • In studies of online human behavior, external validity helps understand how online human behavior relates to real-world human behavior.
  • There are cases where we can conduct “digital twin” RCTs: one in social media and one in the real-world and there are opportunities in connecting the two to correct biases
  • There have been some limited studies in this domain, but perhaps this community can help move this area further….