1 of 17

Semi-Supervised Community Detection Using Structure and Size

Arjun Bakshi, Srinivasan Parthasarathy, and Kannan Srinivasan

Presented at: IEEE International Conference on Data Mining

November 17-20, 2018 in Singapore

2 of 17

Community Detection

  • What if some communities are known?

Source: http://snap.stanford.edu/agm/

  • Based on connectedness
  • Based on node attributes or embeddings

Source: Learning Community Embedding with Community Detection and Node Embedding on Graphs, CIKM’17

3 of 17

Semi-Supervised Community Detection

  • If some communities in the graph are known, then can we detect others like them?
  • What do we mean by “like them”?
    • Same size distribution as known communities
    • Similar patterns in structure or composition as known communities

4 of 17

Size Information

  • Sizes of communities depends on dataset
  • Detected communities too large → low precision
  • Detected communities too small → low recall

Data source: http://snap.stanford.edu/data/

5 of 17

Structure in Communities

  • Nodes → Roles
    • Core,
    • Peripheral
    • Bridge
    • Homophilic, etc.

  • Community → Collection of nodes with certain roles
    • Nodes with same labels
    • Specific distribution of labels

Image source: RolX/Discovering Roles and Anomalies in Graphs: Theory and Applications, Eliassi-Rad et al. ECML PKDD 2013

6 of 17

Defining Structure of a Community

  • Given:
    • G, labeled graph
    • C, a community within it
  • Goal:
    • A feature vector for community’s structure/composition

*

*

Generate labels using existing techniques

un

7 of 17

Patterns in Communities vs Random Subgraphs

  • Compare random subgraphs vs communities in the defined feature space

  • DBLP dataset

More common in random subgraphs

More common in communities

8 of 17

Semi-Supervised Community Detection

  • If some communities in the graph are known, then can we detect others like them?

  • What do we mean by “like them”?
    • Same size distribution as known communities
    • Similar patterns in structure or composition as known communities

  • How do we detect them in a large graph?
    • Motif matching: Computationally expensive
    • Heuristic based:
      • Summarize patterns in communities
      • Find a node neighborhood with similar patterns
      • Expand from that node

9 of 17

Summarizing Community Patterns

DBLP Communities

DBLP Community Patterns

CP1

CP2

CP3

CP4

CP5

10 of 17

Identifying Good Seed Nodes

  • For a community pattern CPi and a node n:
    • Generate feature vector for ego net of n, FVn
    • Update FVn based on FVs of n’s neighbors
    • Score(CPi, n) = Similarity of CPi and FVn

  • If Score(CPi, n) is high, then n is a good seed for generating a community with a pattern CPi

Image source: RolX/Discovering Roles and Anomalies in Graphs: Theory and Applications, Eliassi-Rad et al. ECML PKDD 2013

Communities

Node ego nets

11 of 17

Putting it together: The Bespoke Algorithm

Input:

Graph + known communities

Node CP1 CP2 CP3 CP4

n1 0.9 0.2 0.5 0.3

n2 0.1 0.9 0.4 0.2

n3 0.3 0.2 0.1 0.1

CP1

CP4

Size of communities in CP1

Size of communities in CP4

Size of communities in CP3

Size of communities in CP2

CP2

CP3

Training

12 of 17

Putting it together: The Bespoke Algorithm

  • Community detection:
    • Pick CPi based on frequency in the training dataset
    • Pick a node, n, with a high Score(CPi, n)
    • Pick a target community size, T, from size distribution associated with CPi
    • Expand in Breadth First manner till T nodes are reached

CP1

CP4

Size of communities in CP1

Size of communities in CP4

Size of communities in CP3

Size of communities in CP2

CP2

CP3

Node CP1 CP2 CP3 CP4

n1 0.9 0.2 0.5 0.3

n2 0.1 0.9 0.4 0.2

n3 0.3 0.2 0.1 0.1

13 of 17

Results: Datasets

  • SNAP Stanford data repository 1
  • Bespoke trains on <100 communities in all cases

Nodes

Edges

Communities

Amazon product network

334,863

935,873

5,000

DBLP collaborations

317,080

1,049,866

5,000

Youtube

1,134,890

2,987,624

5,000

Twitter (ego)

81,306

1,768,149

3,662

Facebook (ego)

4,039

88.234

191

1: http://snap.stanford.edu/data/

14 of 17

Results: F1 scores

  • Better than meta-data (CESNA), embedding (ComE), and supervised (BigClam-Assisted) methods

15 of 17

Results: F1 Scores

  • Importance of Size and Structure information

16 of 17

Results: Community Patterns (Amazon)

Bespoke

Actual Communities

BigClam

Random Subgraphs

17 of 17

Conclusion

  • Present a feature space for exploring patterns in communities
    • Uses quick node labeling method for unlabeled graphs
  • Show significantly divergent patterns in communities vs random subgraphs
  • Proposed algorithm achieves higher F1 scores, lower run times