1 of 17

Semi-Supervised Community Detection Using Structure and Size

Arjun Bakshi, Srinivasan Parthasarathy, and Kannan Srinivasan

Presented at: IEEE International Conference on Data Mining

November 17-20, 2018 in Singapore

2 of 17

Community Detection

What if some communities are known?

Source: http://snap.stanford.edu/agm/

Based on connectedness

Based on node attributes or embeddings

Source: Learning Community Embedding with Community Detection and Node Embedding on Graphs, CIKM’17

3 of 17

Semi-Supervised Community Detection

If some communities in the graph are known, then can we detect others like them?
What do we mean by “like them”?

Same size distribution as known communities
Similar patterns in structure or composition as known communities

4 of 17

Size Information

Sizes of communities depends on dataset
Detected communities too large → low precision
Detected communities too small → low recall

Data source: http://snap.stanford.edu/data/

5 of 17

Structure in Communities

Nodes → Roles

Core,
Peripheral
Bridge
Homophilic, etc.

Community → Collection of nodes with certain roles

Nodes with same labels
Specific distribution of labels

Image source: RolX/Discovering Roles and Anomalies in Graphs: Theory and Applications, Eliassi-Rad et al. ECML PKDD 2013

6 of 17

Defining Structure of a Community

Given:

G, labeled graph
C, a community within it

Goal:

A feature vector for community’s structure/composition

*

Generate labels using existing techniques

un

7 of 17

Patterns in Communities vs Random Subgraphs

Compare random subgraphs vs communities in the defined feature space

DBLP dataset

More common in random subgraphs

More common in communities

8 of 17

Semi-Supervised Community Detection

If some communities in the graph are known, then can we detect others like them?

What do we mean by “like them”?

Same size distribution as known communities
Similar patterns in structure or composition as known communities

How do we detect them in a large graph?

Motif matching: Computationally expensive
Heuristic based:

Summarize patterns in communities
Find a node neighborhood with similar patterns
Expand from that node

9 of 17

Summarizing Community Patterns

DBLP Communities

DBLP Community Patterns

CP₁

CP₂

CP₃

CP₄

CP₅

10 of 17

Identifying Good Seed Nodes

For a community pattern CP_i and a node n:

Generate feature vector for ego net of n, FV_n
Update FV_n based on FVs of n’s neighbors
Score(CP_i, n) = Similarity of CP_i and FV_n

If Score(CP_i, n) is high, then n is a good seed for generating a community with a pattern CP_i

Image source: RolX/Discovering Roles and Anomalies in Graphs: Theory and Applications, Eliassi-Rad et al. ECML PKDD 2013

Communities

Node ego nets

✅

11 of 17

Putting it together: The Bespoke Algorithm

Input:

Graph + known communities

Node CP₁CP₂ CP₃ CP₄

n₁ 0.9 0.2 0.5 0.3

n₂ 0.1 0.9 0.4 0.2

n₃0.3 0.2 0.1 0.1

CP₁

CP₄

Size of communities in CP₁

Size of communities in CP₄

Size of communities in CP₃

Size of communities in CP₂

CP₂

CP₃

Training

12 of 17

Putting it together: The Bespoke Algorithm

Community detection:

Pick CP_i based on frequency in the training dataset
Pick a node, n, with a high Score(CP_i, n)
Pick a target community size, T, from size distribution associated with CP_i
Expand in Breadth First manner till T nodes are reached

CP₁

CP₄

Size of communities in CP₁

Size of communities in CP₄

Size of communities in CP₃

Size of communities in CP₂

CP₂

CP₃

Node CP₁CP₂ CP₃ CP₄

n₁ 0.9 0.2 0.5 0.3

n₂ 0.1 0.9 0.4 0.2

n₃0.3 0.2 0.1 0.1

13 of 17

Results: Datasets

SNAP Stanford data repository ¹

Bespoke trains on <100 communities in all cases

	Nodes	Edges	Communities
Amazon product network	334,863	935,873	5,000
DBLP collaborations	317,080	1,049,866	5,000
Youtube	1,134,890	2,987,624	5,000
Twitter (ego)	81,306	1,768,149	3,662
Facebook (ego)	4,039	88.234	191

1: http://snap.stanford.edu/data/

14 of 17

Results: F1 scores

Better than meta-data (CESNA), embedding (ComE), and supervised (BigClam-Assisted) methods

15 of 17

Results: F1 Scores

Importance of Size and Structure information

16 of 17

Results: Community Patterns (Amazon)

Bespoke

Actual Communities

BigClam

Random Subgraphs

17 of 17

Conclusion

Present a feature space for exploring patterns in communities

Uses quick node labeling method for unlabeled graphs

Show significantly divergent patterns in communities vs random subgraphs
Proposed algorithm achieves higher F1 scores, lower run times