1 of 33

Community Detection

MGT 780 SPRING 2022

STEVE BORGATTI

(C) 2022 STEPHEN P BORGATTI

1

05-Mar-22

8 MAR

2 of 33

Agenda

  • Cliques
  • Hierarchical clustering

(C) 2022 STEPHEN P BORGATTI

2

05-Mar-22

3 of 33

Why we care

  • They exist!
    • Real networks often clumpy. We want to know it �when it happens
    • Number of groups can predict ability of network to �accomplish something as a whole
  • Need for data reduction
    • Can analyze each group separately
    • Can aggregate subgroups into “super-nodes” and visualize network of groups
      • Build simplified model of a complex network (blockmodeling)

(C) 2021 STEPHEN P BORGATTI

3

5 March 2022

4 of 33

Why we care – cont.

  • Groups affect social processes we care about
    • Groups are powerful influencers/constrainers of members
      • Create homogeneity with respect to attitudes, behaviors. i.e., contagion
        • Canonical hypotheses: people in groups become more similar to each other
      • Disease, information, customs spread more quickly within groups than between
      • Intellectual communities, communities of practice, invisible colleges
  • Conflict often emerges along subgroup fault lines
    • Ingroup/outgroup attributions
    • Group fission
  • Identify roles such as “inner group member” or “boundary spanner”

(C) 2021 STEPHEN P BORGATTI

4

5 March 2022

UCINET: Examine pv504

Social conformity, enforcement of norms, culture

5 of 33

Why we care …

  • Groups affect social processes we care about
    • Groups are powerful influencers/constrainers of members
      • Create homogeneity with respect to attitudes, behaviors. i.e., contagion
        • Canonical hypotheses: people in groups become more similar to each other
      • Disease, information, customs spread more quickly within groups than between
      • Intellectual communities, communities of practice, invisible colleges
  • Conflict often emerges along subgroup faultlines
    • Ingroup/outgroup attributions
    • Group fission
  • Identify roles such as “inner group member” or “boundary spanner”

(C) 2021 STEPHEN P BORGATTI

5

5 March 2022

UCINET: Examine pv504 in netdraw

Social conformity, enforcement of norms, culture

6 of 33

Why we care … cont.

  • Groups are the outcomes of social processes we care about
    • Emergent patterns from local rules
      • E.g., transitivity due to balance theoretic causes
    • Homophily
      • Homophily implies transitivity
    • Propinquity provides�opportunity

(C) 2021 STEPHEN P BORGATTI

6

5 March 2022

7 of 33

Multiple approaches/types

  • Definitional vs algorithmic
    • Properties vs methods
  • Network-specific vs general
    • Graph partitioning algorithms
      • Girvan-Newman
    • Multivariate cluster analysis

  • Mutually exclusive vs overlapped
  • Flat vs hierarchical
    • Optimal grouping or spectrum of refinement

(C) 2021 STEPHEN P BORGATTI

7

5 March 2022

8 of 33

Definitional

  • Mathematical ethnography
    • Take an idea from sociology / social psychology and formalize it
    • Once mathematically formalized, can be found by an algorithm
  • Thinking through key characteristics
    • Dense – many ties within
    • Close – short distances within
    • Robust – difficult to disconnect through loss of ties or nodes

(C) 2021 STEPHEN P BORGATTI

8

5 March 2022

9 of 33

Clique

  • Maximal complete subgraph
  • Complete
    • Every member of the group has direct tie to every other
  • Maximal
    • There is no other node outside the clique that could be added to the clique without violating the completeness condition
    • IOW, if anyone outside the clique is connected to every member of the clique, then they must be added to the clique
  • Characteristics
    • Density = 1.0 ~ number of ties aspect
    • Avg geodesic distance = 1.0 ~ length of paths aspect

(C) 2021 STEPHEN P BORGATTI

9

5 March 2022

Example of the definitional approach

10 of 33

Difficulties with cliques

  • Real networks tend to have very small cliques
  • Often very numerous
  • Often highly overlapping

(C) 2021 STEPHEN P BORGATTI

10

5 March 2022

1: HOLLY MICHAEL DON HARRY

2: BRAZEY LEE STEVE BERT

3: CAROL PAT PAULINE

4: CAROL PAM PAULINE

5: PAM JENNIE ANN

6: PAM PAULINE ANN

7: MICHAEL BILL DON HARRY

8: JOHN GERY RUSS

9: GERY STEVE RUSS

10: STEVE BERT RUSS

->c = clique(campnet)�->draw c

Example of the definitional approach

11 of 33

K-plexes

  • A set of n nodes is a k-plex if every node in the set is connected to at least n-k others in the set
    • A clique is a 1-plex
  • K-plex is a relaxation of a clique such that every node doesn’t have to be connected to every single other node in the set.
    • You can miss up to k ties
  • In a 2-plex, each node is connected to n-2 others

(C) 2021 STEPHEN P BORGATTI

11

5 March 2022

1

HOLLY

CAROL

PAM

PAT

2

HOLLY

PAM

PAT

JENNIE

3

HOLLY

PAM

PAT

PAULINE

4

HOLLY

PAM

ANN

5

HOLLY

PAM

MICHAEL

6

HOLLY

PAM

DON

7

HOLLY

PAM

HARRY

8

HOLLY

PAT

MICHAEL

9

HOLLY

PAT

DON

10

HOLLY

PAT

HARRY

11

HOLLY

MICHAEL

BILL

DON

HARRY

12

HOLLY

MICHAEL

GERY

13

BRAZEY

LEE

STEVE

BERT

14

BRAZEY

GERY

STEVE

15

BRAZEY

STEVE

BERT

RUSS

16

CAROL

PAM

PAT

JENNIE

17

CAROL

PAM

PAT

PAULINE

18

CAROL

PAM

PAULINE

ANN

19

CAROL

PAULINE

JOHN

20

PAM

PAT

JENNIE

PAULINE

21

PAM

JENNIE

PAULINE

ANN

22

PAM

PAULINE

JOHN

23

PAT

JENNIE

PAULINE

ANN

24

PAT

PAULINE

JOHN

25

PAULINE

ANN

JOHN

26

PAULINE

JOHN

GERY

27

PAULINE

JOHN

RUSS

28

MICHAEL

BILL

GERY

29

MICHAEL

DON

GERY

30

MICHAEL

JOHN

GERY

31

MICHAEL

HARRY

GERY

32

MICHAEL

GERY

STEVE

33

MICHAEL

GERY

RUSS

34

LEE

GERY

STEVE

35

LEE

STEVE

BERT

RUSS

36

JOHN

GERY

STEVE

RUSS

37

JOHN

BERT

RUSS

38

GERY

STEVE

BERT

RUSS

12 of 33

Definitional approaches

  • Clique. Maximal sets of nodes in which all have tie to all
  • N-clique. Maximal sets in which no pair of nodes is separated by more than N links (i.e., distance btw every pair of nodes is <= n)
  • N-clan. N-cliques in which diameter of induced subgraph <= N
  • N-club. Maximal subgraphs s.t. diameter of induced subgraph <= N
  • K-plex. Maximal subgraphs of size n in which every node is connected to n-k others
  • LS-sets. Let V be the set of all nodes, H a subset of V, and K a proper subset of H. Then H is an LS-set if α(K,H-K) > α(K,V-H) for all K, where α(A,B) means # of ties from members of A to members of B
    • Every subset is more connected to the ls set than to the outside world

(C) 2021 STEPHEN P BORGATTI

12

5 March 2022

13 of 33

Clustering methods

FROM MULTIVARIATE STATISTICS

(C) 2021 STEPHEN P BORGATTI

13

5 March 2022

14 of 33

Johnson’s Hierarchical Clustering -- HiClus

  • Output is a set of nested partitions, starting with identity partition and ending with the complete partition
    • Partition represented as vector that associates each node with one and only one “group” (mutually exclusive) – is a cluster id variable
    • Identity partition is where every node gets its own cluster
    • Complete partition is where there is only one cluster and every node belongs to it
  • Different flavors of hiclus based on how distance from a cluster to outside point/node is defined
    • Single linkage; connectedness; minimum
    • Complete linkage; diameter; maximum
    • Average, median, etc.

(C) 2021 STEPHEN P BORGATTI

14

5 March 2022

From multivariate statistics

15 of 33

(C) 2021 STEPHEN P BORGATTI

15

5 March 2022

 

BOS

NY

DC

MIA

CHI

SEA

SF

LA

DEN

BOS

0

206

429

1504

963

2976

3095

2979

1949

NY

206

0

233

1308

802

2815

2934

2786

1771

DC

429

233

0

1075

671

2684

2799

2631

1616

MIA

1504

1308

1075

0

1329

3273

3053

2687

2037

CHI

963

802

671

1329

0

2013

2142

2054

996

SEA

2976

2815

2684

3273

2013

0

808

1131

1307

SF

3095

2934

2799

3053

2142

808

0

379

1235

LA

2979

2786

2631

2687

2054

1131

379

0

1059

DEN

1949

1771

1616

2037

996

1307

1235

1059

0

Closest distance is NY-BOS = 206, so merge these.

16 of 33

(C) 2021 STEPHEN P BORGATTI

16

5 March 2022

BOS�NY

DC

MIA

CHI

SEA

SF

LA

DEN

BOS/ NY

0

233

1308

802

2815

2934

2786

1771

DC

233

0

1075

671

2684

2799

2631

1616

MIA

1308

1075

0

1329

3273

3053

2687

2037

CHI

802

671

1329

0

2013

2142

2054

996

SEA

2815

2684

3273

2013

0

808

1131

1307

SF

2934

2799

3053

2142

808

0

379

1235

LA

2786

2631

2687

2054

1131

379

0

1059

DEN

1771

1616

2037

996

1307

1235

1059

0

Closest pair is DC to BOSNY combo @ 233. So merge these.

17 of 33

(C) 2021 STEPHEN P BORGATTI

17

5 March 2022

 

BOS/NY/�DC

MIA

CHI

SEA

SF

LA

DEN

BOS/NY�DC

0

1075

671

2684

2799

2631

1616

MIA

1075

0

1329

3273

3053

2687

2037

CHI

671

1329

0

2013

2142

2054

996

SEA

2684

3273

2013

0

808

1131

1307

SF

2799

3053

2142

808

0

379

1235

LA

2631

2687

2054

1131

379

0

1059

DEN

1616

2037

996

1307

1235

1059

0

18 of 33

(C) 2021 STEPHEN P BORGATTI

18

5 March 2022

BOS/

NY/DC

MIA

CHI

SEA

SF/LA

DEN

BOS/NY/DC

0

1075

671

2684

2631

1616

MIA

1075

0

1329

3273

2687

2037

CHI

671

1329

0

2013

2054

996

SEA

2684

3273

2013

0

808

1307

SF/LA

2631

2687

2054

808

0

1059

DEN

1616

2037

996

1307

1059

0

19 of 33

(C) 2021 STEPHEN P BORGATTI

19

5 March 2022

BOS/NY/DC/

CHI

MIA

SEA

SF/LA

DEN

BOS/NY/DC/CHI

0

1075

2013

2054

996

MIA

1075

0

3273

2687

2037

SEA

2013

3273

0

808

1307

SF/LA

2054

2687

808

0

1059

DEN

996

2037

1307

1059

0

20 of 33

(C) 2021 STEPHEN P BORGATTI

20

5 March 2022

BOS/NY/DC/CHI

MIA

SF/LA/SEA

DEN

BOS/NY/DC/CHI

0

1075

2013

996

MIA

1075

0

2687

2037

SF/LA/SEA

2054

2687

0

1059

DEN

996

2037

1059

0

21 of 33

(C) 2021 STEPHEN P BORGATTI

21

5 March 2022

BOS/NY/DC/CHI/DEN

MIA

SF/LA/SEA

BOS/NY/DC/�CHI/DEN

0

1075

1059

MIA

1075

0

2687

SF/LA/SEA

1059

2687

0

22 of 33

(C) 2021 STEPHEN P BORGATTI

22

5 March 2022

BOS/NY/DC/CHI/DEN/SF/LA/SEA

MIA

BOS/NY/DC/CHI/DEN/SF/LA/SEA

0

1075

MIA

1075

0

Done!

23 of 33

Applying HiClus to Network Data

  • Wants symmetric data
  • Doesn’t work well with matrix of 0s and 1s – not enough variation to play with
  • So we must symmetrize, and if the data are not valued, then we must construct a valued matrix from the adjacency matrix.
    • What can we use?

(C) 2021 STEPHEN P BORGATTI

23

5 March 2022

Geodesic Distances

1 1 1 1 1 1 1 1 1

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8

H B C P P J P A M B L D J H G S B R

- - - - - - - - - - - - - - - - - -

1 HOLLY 0 4 2 1 1 2 2 2 1 2 4 1 3 1 2 3 4 3

2 BRAZEY 4 0 5 5 5 6 4 5 3 4 1 4 3 4 2 1 1 2

3 CAROL 2 5 0 1 1 2 1 2 3 4 5 3 2 3 3 4 4 3

4 PAM 1 5 1 0 2 1 1 1 2 3 5 2 2 2 3 4 4 3

5 PAT 1 5 1 2 0 1 1 2 2 3 5 2 2 2 3 4 4 3

6 JENNIE 2 6 2 1 1 0 2 1 3 4 6 3 3 3 4 5 5 4

7 PAULINE 2 4 1 1 1 2 0 1 3 4 4 3 1 3 2 3 3 2

8 ANN 2 5 2 1 2 1 1 0 3 4 5 3 2 3 3 4 4 3

9 MICHAEL 1 3 3 2 2 3 3 3 0 1 3 1 2 1 1 2 3 2

10 BILL 2 4 4 3 3 4 4 4 1 0 4 1 3 1 2 3 4 3

11 LEE 4 1 5 5 5 6 4 5 3 4 0 4 3 4 2 1 1 2

12 DON 1 4 3 2 2 3 3 3 1 1 4 0 3 1 2 3 4 3

13 JOHN 3 3 2 2 2 3 1 2 2 3 3 3 0 3 1 2 2 1

14 HARRY 1 4 3 2 2 3 3 3 1 1 4 1 3 0 2 3 4 3

15 GERY 2 2 3 3 3 4 2 3 1 2 2 2 1 2 0 1 2 1

16 STEVE 3 1 4 4 4 5 3 4 2 3 1 3 2 3 1 0 1 1

17 BERT 4 1 4 4 4 5 3 4 3 4 1 4 2 4 2 1 0 1

18 RUSS 3 2 3 3 3 4 2 3 2 3 2 3 1 3 1 1 1 0

24 of 33

Hiclus of geo distances

(C) 2021 STEPHEN P BORGATTI

24

5 March 2022

P M

A J I B

C U H E C H R S

A L O N B H A B A T G J R

P R I P L N A I A R D L E Z E E O U

A O N A L I N L E R O E R E V R H S

T L E M Y E N L L Y N E T Y E Y N S

1 1 1 1 1 1 1 1 1

Level 5 3 7 4 1 6 8 0 9 4 2 1 7 2 6 5 3 8

----- - - - - - - - - - - - - - - - - - -

1.000 XXXXX XXX XXX XXXXXXX XXXXXXX XXXXX

1.333 XXXXX XXXXXXX XXXXXXX XXXXXXX XXXXX

1.457 XXXXX XXXXXXX XXXXXXX XXXXXXXXXXXXX

1.481 XXXXXXXXXXXXX XXXXXXX XXXXXXXXXXXXX

2.723 XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXX

3.142 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

25 of 33

Reciprocal Distance

(C) 2021 STEPHEN P BORGATTI

25

5 March 2022

  • For undefined distances, we can define the reciprocal distance to be 0
  • Directed data often have undefined distances
  • So, as a general solution for all kinds of data, use reciprocal distance

26 of 33

Alters in common

  • For a given pair of nodes, count # of alters in common (how big is the intersection between N(u) and N(v)?)
    • Outgoing ties
    • Incoming
    • Symmetrized
  • If we add adjacency matrix (AIC + ADJ) we get better sense of proximity of each pair
  • Same concept as embeddedness, except embeddedness normally applied only to ties
    • Simmelian ties are embedded ties which are also reciprocated

(C) 2021 STEPHEN P BORGATTI

26

5 March 2022

-> common = aic(campnet outgoing)

27 of 33

Clique redux

  • By itself, clique analysis is a bust – too many highly overlapping cliques.
  • However, we can apply secondary analyses to the clique results
    • 2-mode analysis of the node-by-clique participation matrix
    • 1-mode analysis of node-by-node clique overlap matrix

(C) 2021 STEPHEN P BORGATTI

27

5 March 2022

28 of 33

The node-by-clique participation matrix

  • Analyze the undirected rdgam matrix from with Wiring dataset

(C) 2021 STEPHEN P BORGATTI

28

5 March 2022

5 cliques found.

1: I1 W1 W2 W3 W4

2: W1 W2 W3 W4 S1

3: W1 W3 W4 W5 S1

4: W6 W7 W8 W9

5: W7 W8 W9 S4

1 2 3 4 5

----- ----- ----- ----- -----

1 I1 1.000 0.800 0.600 0.000 0.000

2 I3 0.000 0.000 0.000 0.000 0.000

3 W1 1.000 1.000 1.000 0.000 0.000

4 W2 1.000 1.000 0.800 0.000 0.000

5 W3 1.000 1.000 1.000 0.000 0.000

6 W4 1.000 1.000 1.000 0.000 0.000

7 W5 0.600 0.800 1.000 0.250 0.250

8 W6 0.000 0.000 0.000 1.000 0.750

9 W7 0.000 0.000 0.200 1.000 1.000

10 W8 0.000 0.000 0.000 1.000 1.000

11 W9 0.000 0.000 0.000 1.000 1.000

12 S1 0.800 1.000 1.000 0.000 0.000

13 S2 0.000 0.000 0.000 0.000 0.000

14 S4 0.000 0.000 0.000 0.750 1.000

29 of 33

Visualizing node-by-clique participation

(C) 2021 STEPHEN P BORGATTI

29

5 March 2022

30 of 33

Factor analysis of clique participation matrix

(C) 2021 STEPHEN P BORGATTI

30

5 March 2022

EIGENVALUES

FACTOR VALUE PERCENT CUM % RATIO

------- -------- ------- ------- -------

1: 4.23198 84.6 84.6 6.290

2: 0.67282 13.5 98.1 9.811

3: 0.06858 1.4 99.5 3.115

4: 0.02201 0.4 99.9 4.771

5: 0.00461 0.1 100.0

======= ======== ======= ======= =======

5.00000 100.0

Unrotated Factor Loadings

1 0.951

2 0.959

3 0.923

4 -0.882

5 -0.882

Factor scores

1 I1 0.809

2 I3 -0.338

3 W1 1.095

4 W2 0.999

5 W3 1.095

6 W4 1.095

7 W5 0.576

8 W6 -1.148

9 W7 -1.168

10 W8 -1.264

11 W9 -1.264

12 S1 0.999

13 S2 -0.338

14 S4 -1.148

31 of 33

Clique participation with factor scores

(C) 2021 STEPHEN P BORGATTI

31

5 March 2022

Try zache

32 of 33

Node-by-node clique overlap matrix

(C) 2021 STEPHEN P BORGATTI

32

5 March 2022

1 1 1 1 1

1 2 3 4 5 6 7 8 9 0 1 2 3 4

I I W W W W W W W W W S S S

- - - - - - - - - - - - - -

1 I1 1 0 1 1 1 1 0 0 0 0 0 0 0 0

2 I3 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 W1 1 0 3 2 3 3 1 0 0 0 0 2 0 0

4 W2 1 0 2 2 2 2 0 0 0 0 0 1 0 0

5 W3 1 0 3 2 3 3 1 0 0 0 0 2 0 0

6 W4 1 0 3 2 3 3 1 0 0 0 0 2 0 0

7 W5 0 0 1 0 1 1 1 0 0 0 0 1 0 0

8 W6 0 0 0 0 0 0 0 1 1 1 1 0 0 0

9 W7 0 0 0 0 0 0 0 1 2 2 2 0 0 1

10 W8 0 0 0 0 0 0 0 1 2 2 2 0 0 1

11 W9 0 0 0 0 0 0 0 1 2 2 2 0 0 1

12 S1 0 0 2 1 2 2 1 0 0 0 0 2 0 0

13 S2 0 0 0 0 0 0 0 0 0 0 0 0 0 0

14 S4 0 0 0 0 0 0 0 0 1 1 1 0 0 1

HIERARCHICAL CLUSTERING OF OVERLAP MATRIX

I I W W W W W S S W W W W S

3 1 5 2 1 3 4 1 2 6 7 8 9 4

1 1 1 1 1

Level 2 1 7 4 3 5 6 2 3 8 9 0 1 4

----- - - - - - - - - - - - - - -

3.000 . . . . XXXXX . . . . . . .

2.000 . . . XXXXXXX . . . XXXXX .

1.800 . . . XXXXXXXXX . . XXXXX .

1.000 . . . XXXXXXXXX . XXXXXXX .

0.911 . . XXXXXXXXXXX . XXXXXXX .

0.800 . . XXXXXXXXXXX . XXXXXXXXX

0.381 . XXXXXXXXXXXXX . XXXXXXXXX

0.000 XXXXXXXXXXXXXXXXXXXXXXXXXXX

33 of 33

Rdgam with 4 cluster solution imposed

(C) 2021 STEPHEN P BORGATTI

33

5 March 2022