1 of 54

Respondent-Driven Sampling:�An Overview

Ashton M. Verdery

Duke Network Analysis Center

May, 2019

2 of 54

Outline

  • Intuition about network sampling
  • Leveraging social networks for sampling
    • Why?
    • How?
  • What is RDS?
    • Hidden populations
    • RDS origins and concepts
    • RDS applications
    • Pitfalls and promises of RDS
    • New directions

Ashton M. Verdery

2

3 of 54

Samples from social networks

Ashton M. Verdery

3

1

6

2

8

3

11

4

5

9

7

10

4 of 54

Samples from social networks

Ashton M. Verdery

4

1

6

2

8

3

11

4

5

9

7

10

5 of 54

Samples from social networks

Ashton M. Verdery

5

1

6

2

8

3

11

4

5

9

7

10

6 of 54

Samples from social networks

Ashton M. Verdery

6

1

6

2

8

3

11

4

5

9

7

10

7 of 54

Samples from social networks

Ashton M. Verdery

7

1

6

2

8

3

11

4

5

9

7

10

8 of 54

Samples from social networks

Ashton M. Verdery

8

1

6

2

8

3

11

4

5

9

7

10

9 of 54

Samples from social networks

Ashton M. Verdery

9

1

6

2

8

3

11

4

5

9

7

10

10 of 54

Samples from social networks

Ashton M. Verdery

10

1

6

2

8

3

11

4

5

9

7

10

11 of 54

Why do it?

  • Future of social science research
    • New populations of interest are hard to survey
      • e.g., undocumented migrants, people who use drugs
    • New theories & tools require new types of data
      • e.g., social network analysis
    • Existential threat of declining survey participation
      • i.e, all groups are becoming hidden populations

Ashton M. Verdery

11

12 of 54

Silliness

12

http://www.pewresearch.org/2017/05/15/what-low-response-rates-mean-for-telephone-surveys/

13 of 54

Hidden populations

  • Collecting data from hidden populations is difficult because the absence of a sampling frame
    • Stigma
    • Non response
    • Lack of trust
    • Rarity

Ashton M. Verdery

13

Household based sampling in Lilongwe, Malawi

Escamilla et al. 2014

14 of 54

How to sample hidden populations?

  • Traditional approaches
    • Convenience samples
    • Clinical samples
    • Location samples
  • Problems
    • Are we learning about people other than those sampled?
      • Limited ability to infer representation
      • Poor coverage for sampling frame
      • Often time intensive, costly, very small samples

Ashton M. Verdery

14

15 of 54

Respondent-Driven Sampling (RDS)

  • A sociological method with wide applications
    • Heckathorn 1997
  • Most popular solution to problems of hidden populations in recent decades (as of May 2019)
    • 544+ studies
    • 1.2k+ papers, 24k+ cites
    • H-index of 59
    • Over $213 mill. from NIH
  • Compare to “ego centric”
    • 254 studies funded
    • $59 million since 1990

16 of 54

RDS applications

  • Hidden populations of many stripes
    • Men who have sex with men
    • People who inject drugs
    • Commercial sex workers
    • High risk heterosexuals
    • Other drug users (opioids, methamphetamines)
    • Domestic violence victims
    • Victims of sexual violence (child prostitution, sex trafficking, war-time rape)
    • Jazz musicians
    • Vegetarians and vegans in Argentina
    • Wheelchair users
    • Non-institutionalized older adults (85+)
  • Most common questions
    • Can we sample this population?
    • What are the characteristics of this population?
    • What is the size of this population?

16

17 of 54

Top 10 fields

Ashton M. Verdery

17

Web of Science. May 2019.

18 of 54

RDS overview

Two parts

1) Chain referral / peer recruitment

    • “Seed” participants receive 2 coupons
      • Recruit 2 new participants each
      • Dual incentives for participation & recruitment
      • Each new respondent given 2 coupons to recruit others
      • Process continues until desired sample size is obtained
      • (No one participates more than once)
      • *Researchers lack control of sampling process

2) Post-recruitment weighting of cases

    • Correct for theoretical sampling process
    • Make inferences about population & quantify uncertainty

19 of 54

Seeds & coupons

  • Seeds
    • 7-10 population members
    • Convenience selection
      • Willing to participate
      • Large personal networks
      • Diverse on relevant attributes
  • Coupons
    • Give 2 to 3 per respondent
      • Non-seeds can only participate with a coupon
    • Uniquely coded for tracking
      • Codes given out & redeemed
    • Non-physical coupons
      • Possible, but challenging

19

Wirtz et al. 2017

20 of 54

Coupons

Ashton M. Verdery

20

Contact number

Consent and study

description (on back)

Valid dates

Interview site location

Tracking codes

21 of 54

Example

Ashton M. Verdery

21

(Fisher and Merli, Net. Sci. 2014)

22 of 54

Example

22

(Verdery, et al Soc. Meth. 2017)

Sampled = black

23 of 54

Core resources

  • Useful website from Handcock, Gile, & collaborators
    • http://hpmrg.org
  • Manuals for RDS survey design
    • Johnson tutorial, with questionnaires, consent forms, etc.
      • http://applications.emro.who.int/dsaf/EMRPUB_2013_EN_1539.pdf
    • CDC, UNAIDS, & others also have useful manuals
      • https://www.cdc.gov/hiv/pdf/statistics/systems/nhbs/nhbs-idu3_nhbs-het3-protocol.pdf
      • https://globalhealthsciences.ucsf.edu/sites/globalhealthsciences.ucsf.edu/files/ibbs-rds-protocol.pdf
  • Software for RDS analysis
    • Stand alone software for RDS coupon management & analysis
      • http://www.respondentdrivensampling.org/main.htm
    • R package “RDS” for analysis & diagnostics
      • https://cran.r-project.org/web/packages/RDS/index.html
    • Stata packages for analysis
      • http://www.stata-journal.com/article.html?article=st0247
      • I have unreleased Stata packages for many RDS estimators and RDS multivariate regression
  • Diagnostics for RDS preplanning and post-survey analysis
    • http://www.princeton.edu/~mjs3/gile_diagnostics_2014.pdf

24 of 54

Key concepts & assumptions

  • Baseline assumptions
    • Population members are linked in a social network & will refer other members into the study
  • Key concepts
    • Primary & secondary interviews
    • Respondent degree
    • Random recruitment
    • Bottlenecks
    • Bias, sampling variance, & RMSE

  • Different estimators make different assumptions about recruitment process and underlying network

Ashton M. Verdery

24

Network structure assumptions

There is a social network

Population size large (N>>n)

Homophily weak

Community structure weak

Connected graph w/1 component (giant component)

All ties reciprocated (undirected)

Known population size N

Sampling assumptions

Sampling with replacement

Single, non-branching chain (1 seed; 1 coupon)

Sufficiently many sample waves

Initial sample of seeds unbiased

Degree accurately measured

Conditionally random referrals (random recruitment)

(see Gile 2011:144)

25 of 54

Primary & secondary interviews

Ashton M. Verdery

25

26 of 54

Respondent degree

  • Degree
    • Popularity
    • How many incoming ties
      • network assumed undirected
  • Typical solicitation
    • “how many people do you know (you know their name and they know yours) who have exchanged sex for money in the past six months?”
    • Often, successive restrictions
      • Last 30 days, live in area, etc.

  •  

Merli, et al Soc. Sci. Med. 2015

27 of 54

Assumption: “random recruitment”

Ashton M. Verdery

27

A

B

C

D

3/9

3/9

3/9

28 of 54

In practice: “preferential recruitment”

Ashton M. Verdery

28

A

B

C

D

4/9

4/9

1/9

29 of 54

Reasons for preferential recruitment

  • NOT A REASON
    • Has more connections to similar people
      • In principle, the weighting approaches should deal with this
  • Reasons (not exhaustive)
    • Better relationships with similar people
    • Wants to help friend who needs money
    • Wants friend to get HIV test
    • Only friends who do riskier things want to get tested
    • Unemployed friends more likely to be encountered
    • Etc.

Ashton M. Verdery

29

30 of 54

“Bottlenecks”

  • Few ties between clusters
    • Assumed to matter substantially
    • Somewhat overstated
  • General advice:
    • Split sample
    • Tough to achieve a priori

With n=500, rds on this network exhibits 150X the sampling variance of SRS and the estimated sampling variance bears no relation to this, we see this in network after network after network

Mouw & Verdery Soc. Meth. 2012

Salgnik & Goel Stat. Med. 2009

31 of 54

Key concepts

  • Bias
    • “Accuracy”
    • How far from the population parameter is the average sample?
  • Sampling variance
    • “Precision”
    • How variable are the results, sample to sample, on average?
    • Often expressed as Design Effects
      • Ratio of RDS to SRS sampling variance
      • Interpretable as sample size multiplier
  • Root Mean Square Error (RMSE)
    • Balancing accuracy and precision
      • There are many other error metrics

Verdery, Merli, et al. Epid. 2015.

Just right?

32 of 54

Contrast with SRS

Network: Project 90 (N=4413)

Variable: Percent White

RDS

    • Unbiased, 10 seeds, 3 coupons
    • Without replacement
    • N=150

SRS

    • Without replacement
    • n=150

Ashton M. Verdery

32

Project 90 network, red nodes=non-white

Verdery et al. 2017

33 of 54

Contrast with SRS

Ashton M. Verdery

33

34 of 54

Contrast with SRS

Ashton M. Verdery

34

35 of 54

Contrast with SRS

Ashton M. Verdery

35

36 of 54

Contrast with SRS�(n=400)

Ashton M. Verdery

36

37 of 54

Bias, sampling variance, & uncertainty

  • Early RDS work focused on bias, but sampling variance is also critical
  • A related concern:
    • Quantifying uncertainty
    • After data collection, can you say:
      • How biased your sample is?
      • How results would vary sample to sample?
    • Key feature of inferential statistics
      • E.g., if sampling conformed to assumptions, we can provide a confidence interval for an estimate and be reasonably sure the confidence interval is accurate
      • Is this true in RDS?

37

38 of 54

Quantifying uncertainty

  • Traditional estimators of RDS sampling variance are bad

  • Example
    • Sampling variance (SV)
      • RDS mean estimators have high SV
    • Estimated sampling variance
      • RDS SV estimators have high bias

Verdery et al., Plos1 2015

39 of 54

Recent progress on estimating �RDS sampling variance

Ashton M. Verdery

39

Baraff, et al., PNAS. 2015

40 of 54

Estimators

  • Of the population mean
    • At least 11 in current use
      • Table on right
      • McCreesh et al. 2013
      • Crawford 2016
      • Gile & Handcock 2015
      • Berchenko 2017
  • Of the sampling variance
    • 5 primary methods in use
      • Bootstrap (Salganik 2006)
      • Analytical (Volz & Heckathorn 2008)
      • Successive Sampling (Gile 2011)
      • Model assisted (Gile & Handcock 2015)
      • Tree Bootstrap (Baraff et al. 2016)

40

Verdery, et al., Epid. 2015

41 of 54

General comments on estimators

  • For the population mean
    • “linked ego networks” is best
      • Requires respondents know peer attributes reasonably well
      • Can’t calculate for many variables of interest
    • Naïve estimator often works
    • Most common
      • Volz-Heckathorn
      • Successive Sampling
        • (In general, SS is better)
  • For the sampling variance
    • Only the tree bootstrap method seems to have anything resembling reasonable properties

41

Verdery, et al., Epid. 2015

42 of 54

Diagnostics

  • Embed questions in the survey to allow you to estimate whether assumptions were met
    • E.g., ask why people recruited those they did, how many people they tried to recruit who had already participated, etc.
  • Assess potential bottlenecks and seed bias with convergence plots

Ashton M. Verdery

42

Johnston, et al., Epid. 2015

43 of 54

A few notes on web-based RDS

  • Developing area with challenges but lots of potential
  • Recommendations
    • Differences from traditional
      • Be prepared to expand to 30-60 seeds; 20+ waves
    • Verification
      • Respondent Uniqueness
        • IP address verification; web-cam interview?
      • Respondent is in target population
        • In geographic area of interest? Fits other criteria?
      • Coupon management
        • Careful with secondary incentives
    • Remember limitations
      • Internet access, etc.

43

44 of 54

If problems…

  • Expand recruitment
    • Expand number of seeds
    • Expand allowable recruits
    • Raise incentives
    • Reduce burdens
      • Greater emphasis on anonymity
      • Shorten survey
      • Drop secondary interview
  • If all else fails…
    • Convenience sample
    • Lean on other features
  • It won’t always look like it does on paper

Ashton M. Verdery

44

45 of 54

My recommendations

  • 1) Embed additional data collection in RDS
    • Qualitative interviews
    • Ego network rosters
    • Minimally identifiable information about alters
  • 2) Examine more than just prevalence
    • Population size
    • Network structure
    • Multivariate relationships

Ashton M. Verdery

45

46 of 54

Promises & pitfalls

Weighting/estimation can yield asymptotically unbiased estimates of population mean

    • Unrealistic, hard to verify assumptions required

Design effects remain high

    • Orders of magnitude larger N needed

But…

    • New data on understudied populations
    • Effective, fast method (50 cases/week)
    • Possible to learn a lot about networks (underutilized)

Ashton M. Verdery

46

47 of 54

Thank you!

Portions of this work were supported by a grant from the National Institutes of Health (1 R03 SH000056-01; Verdery PI): “Multivariate Regression with Respondent-Driven Sampling Data.”

I also appreciate assistance from the Justice Center for Research, the Institute for CyberScience, the Social Science Research Institute, the College of the Liberal Arts, and the Population Research Institute at Penn State University, the last of which is supported by an infrastructure grant from the Eunice Kennedy Shriver National Institute of Child Health and Human Development (P2CHD041025 & R24 HD041025).

Other portions of this work benefitted from support from the Duke Network Analysis Center, the Duke Population Research Institute, and the Carolina Population Center.

Ashton M. Verdery: amv5430@psu.edu

I thank many coauthors: M. Giovanna Merli, James Moody, Ted Mouw, Peter J. Mucha, Jacob C. Fisher, Shawn Bauldry, Nalyn Siripong, Jeff Smith, Kahina Abdessalem, Sergio Chavez, Heather Edelblute, Jing Li, Jose Luis Molina, Miranda Lubbers, Sara Francisco, Claire Kelling, Anne DeLessio-Parson, & David Hunter.

48 of 54

alternate link-tracing designs

  • Network Sampling with Memory
    • collect network data from respondents
    • minimally identifying information to link nominated but not sampled individuals
    • “search” algorithm to explore the network more efficiently based on currently uncovered data
    • recovers sampling frame
    • “list” algorithm samples frame as if at random

48

Mouw & Verdery. 2012. Sociological Methodology.

49 of 54

network sampling with memory

  • Two sampling modes:
    • Search
      • Push sample to explore network by seeking bridge ties
    • List
      • Keep a list L of unique members, both nominated & sampled
      • Sample with replacement from L
      • “Even” sampling of nodes ensured by probabilistic selection
      • When whole network nominated, converges to SRS
  • Simulated sampling showed hybrid (S -> L) best

Ashton M. Verdery

49

50 of 54

network sampling with memory

51 of 54

test network

  • Add Health high school
    • 1,281 students
    • 67.3% white
    • 10,414 edges in data
    • 587 cross race ties (w->nw)
    • 8% of whites’ friends n.w.
  • Conclusions:
    • Homophily in the data
    • But no “choke points”
    • Lots of cross group ties
  • Method
    • Test simulated FNSM
      • 500 samples, 500 cases each
        • RDS, NSM, FNSM
      • Calculate Cis and DEs

51

52 of 54

results in test network

Ashton M. Verdery

52

53 of 54

empirical results

54 of 54

Key concepts

  •  

Verdery, Merli, et al. Epid. 2015.

Just right?