1 of 74

Graphs

Introduction to Graph Theory

Fardina Fathmiul Alam

CMSC 320 - Introduction to Data Science

2025

2 of 74

When you think about graph what comes to your mind ?

BUT WE ARE NOT TALKING ABOUT THEM

3 of 74

Outline

Today we will talk about “Graph Theory”

Graph Processing

  • What is Graph and its Applications
  • Representing graphs- Storing a Graph - 3 ways
  • Graph Featurization types
  • Importance of vertices: Centrality measures
    • Common Measures:
      • Degree Centrality
      • Closeness Centrality
      • Betweenness Centrality

4 of 74

The branch of data science that deals with extracting information from graphs by performing analysis on them is known as “Graph Analytics”.

The advent of social networks, big data and e-commerce has re-emphasized the importance of analyzing “Graph” - a unique type of data structure- which depicts relationships among its entities.

5 of 74

NETWORKS? GRAPHS?

Networks are systems of interrelated objects

Graphs are the mathematical models used to represent networks

In data science, we use algorithms on graphs to answer questions about real-world networks.

5

Trump

Obama

6 of 74

Graph Related Problems

  • Find the shortest distance between nodes
  • Predict who will become friends if presented
  • Pages you may like
  • Many more

7 of 74

GRAPHS

A graph G = (V,E) is a set of vertices V and edges E

Edges can be undirected or directed

7

C

B

D

A

V = {A, B, C, D}

E = {(A,B), (B,C), (C,D), (A,C)}

C

B

D

A

V = {A, B, C, D}

E = {(A,C), (C,A), (B,C), (B,D)}

Examples of directed vs undirected graphs

Nodes = Vertices

Edges = Arcs

8 of 74

GRAPHS

Edges can be unweighted or weighted

  • Unweighted : all edges have unit weight

8

C

B

D

A

C

B

D

A

Unweighted

Weighted

1

2

4

4

Examples of unweighted and weighted graphs

9 of 74

Where do we see them?

10 of 74

Text as graphs

GRAPHS AND THE NETWORKS THEY REPRESENT

Text as graphs

11 of 74

Molecular Structures as Graph

(Left) 3d representation of the Citronellal molecule

(Center) Adjacency matrix of the bonds in the molecule

(Right) Graph representation of the molecule.

12 of 74

Social Networks as Graphs

A graph representing groups of people by modelling individuals as nodes, and their relationships as edges.

(Left) Image of a scene from the play “Othello”. (Center) Adjacency matrix of the interaction between characters in the play. (Right) Graph representation of these interactions.

13 of 74

IMAGES CAN BE VIEWED AS GRAPHS

A way of visualizing the connectivity of a graph is through its adjacency matrix.

14 of 74

NETWORKX

NetworkX is a Python library for storing, manipulating, and analyzing (small- and medium-sized) graphs

  • Uses Matplotlib for rendering
  • https://networkx.github.io/
  • conda install -c anaconda networkx

14

import networkx as nx

G=nx.Graph()

G.add_node("spam”)

G.add_edge(1,2)

print(list(G.nodes()))

print(list(G.edges()))

[1, 2, ‘spam’]

[(1,2)]

15 of 74

STORING A GRAPH

Three main ways to represent a graph in memory:

  1. Adjacency lists
  2. Adjacency dictionaries
  3. Adjacency matrix

15

16 of 74

Graph Terms

Adjacency: Two vertices are adjacent if they are connected by an edge.

E.g: A and C are adjacent, D and F are adjacent etc.

17 of 74

(1) ADJACENCY LISTS

For each vertex, store an array of the vertices it connects to

17

Vertex

Neighbors

A

[C]

B

[C, D]

C

[A]

D

[]

C

B

D

A

Pros:

Simple to use for edge-specific operations:

  • Easy to iterate over all outgoing edges and the neighbors of any vertex.
  • Easy to add an edge (simply append it)

Cons: Slow for vertex-specific operations:

To find all neighbors of a vertex, you need to scan all edges (O(|E|)). (O(|V|)), deleting is hard (involves finding and removing a specific neighbor from the list of neighbors of a vertex)

  • store each vertex neighbors in a list

18 of 74

(2) ADJACENCY DICTIONARIES

For each vertex, store a dictionary of vertices it connects to

18

Vertex

Neighbors

A

{C: 1.0}

B

{C: 1.0, D: 1.0}

C

{A: 1.0}

D

{}

C

B

D

A

Pros: O(1) to add, remove, query edges

  • Accessing a specific key in a dictionary is fast and predictable.

Cons: Overhead (memory, caching, etc).

  • Each vertex maintains a dictionary for its neighbors, potentially increasing memory usage.
  • Instead of lists, adjacency dictionaries use key-value pairs, where the key represents a vertex and the value is a list of its neighbors

19 of 74

(3) ADJACENCY MATRIX

Store the connectivity of the graph in a matrix

Pros:

Easy to implement, Fast Edge Queries (O(1)):

Cons:

  • O(|V|2) space regardless of the number of edges

Almost always stored as a sparse matrix

19

C

B

D

A

A

B

C

D

A

0

0

1

0

B

0

0

0

0

C

1

1

0

0

D

0

1

0

0

From

To

  • represents connections between vertices in a matrix (2D array), where each cell indicates whether there's an edge between two vertices.

[i][j] represents the presence (or weight) of an edge between vertex i and vertex j.

20 of 74

(3) ADJACENCY MATRIX

20

[i][j] represents the presence (or weight) of an edge between vertex i and vertex j.

21 of 74

Choosing the Right Graph Representation: Adjacency List, Dictionary, and Matrix

Sparse Graphs (few edges compared to vertices):

    • Adjacency List: Best for memory efficiency and quick neighbor iteration.
    • Adjacency Dictionary: Great for fast edge lookups and when you need to store extra data on edges (e.g., edge weights).

Dense Graphs (many edges compared to vertices):

    • Adjacency Matrix: Best choice, as it allows quick edge lookups and is space-efficient when the graph is dense.

Graphs with Many Vertex and Edge Modifications (dynamic graphs):

    • Adjacency List: Easy to update, especially for adding or removing edges.
    • Adjacency Dictionary: Good for fast edge modifications and adding attributes to edges.

Graph Traversal (BFS, DFS, Dijkstra's Algorithm, etc.):

    • Adjacency List: Typically the best choice for performance in traversal operations, especially with sparse graphs.

21

22 of 74

Graph Featurization

By transforming graph data into suitable numerical features or representations, machine learning models can be applied to various tasks.

The most natural way to do this for a graph is to use something like the adjacency matrix, but each graph can have many different ones!

Featurization = turning a graph (or its nodes, edges, entire structure) into numerical vectors that ML models can use.

23 of 74

Graph Featurization

Graph featurization transforms graph structural information into numerical features that can be used for tasks such as classification, clustering, or link prediction.

  1. Graph Level Features: features describe the entire graph as a whole and can capture global structural patterns (such as graph diameter, average path length, modularity, graph centralization etc.).

  1. Edge Level Features: features describe the relationships or connections between two nodes (Edge Betweenness Centrality, Common Neighbors etc) .

  1. Node Level Features (Today’s Topic): features describe individual nodes in the graph.
    1. Degree ( In and Out)
    2. Centrality Measure
  • Nodes → feature vectors
  • Edges → feature vectors
  • Whole graph → single vector (graph embedding)

Purpose: classification, link prediction, clustering, graph regression, etc.

We convert:

24 of 74

Graph Level Feature

Features that describe the entire graph.

Example:� If a graph has 10 shortest-path distances totaling 24,

Average Path Length=2410=2.4\text{Average Path Length} = \frac{24}{10} = 2.4Average Path Length=1024​=2.4

25 of 74

Edge Level Features

Edge-level features describe the relationship between two nodes. These are essential for link prediction, relationship ranking, recommendation systems, and graph similarity tasks.

Counts how many neighbors two nodes share.

Example:� Nodes A and B:� Neighbors of A = {C, D}� Neighbors of B = {C, D, E}� Common neighbors = 2 (C and D)

Used heavily in link prediction.

26 of 74

Edge Level Features

Edge-level features describe the relationship between two nodes. These are essential for link prediction, relationship ranking, recommendation systems, and graph similarity tasks.

Jaccard Coefficient

Normalizes common neighbors by total neighbors.

Example:� If A and B share 2 neighbors out of 5 total unique neighbors:

J=25=0.4J = \frac{2}{5} = 0.4J=52​=0.4

27 of 74

Graph Featurization: Examples

  1. Graph Level Features: Describe the entire graph structure, capturing global patterns.

i.e: Describe properties of a graph: in social network classification, computing graph density helps identify if the network is sparse or tightly connected.

  1. Edge Level Features: Focus on relationships or connections between two nodes.

i.e Common Neighbors feature can help predict new friendships between users who share several mutual friends (usefully for link prediction).

  1. Node Level Features : describe individual nodes in the graph.

i.e In a social media network like Twitter

    • Degree ( In and Out): Reflect how popular they are (how many followers or friends they have).
    • Centrality Measure: Provide insights into a user's importance within the network.

28 of 74

THE VALUE (IMPORTANCE) OF A VERTEX

28

Node Level Features

29 of 74

Centrality

Represents a “measure of importance”

A very important concept in identifying important nodes in a graph in Graph Analytics.

Centrality measures how "central" or important a node is within a graph.

  • Usually for nodes.
  • Some measures can also be defined for edges (or subgraphs, in general).

There are various types of centrality metrics, each providing a different perspective on a node's importance in the network.

30 of 74

IMPORTANCE OF VERTICES

Not all vertices are equally important

Centrality Analysis:

  • Find out the most important node(s) in one network
  • Used as a feature in classification, for visualization, etc …
    • highlighting key nodes based on their significance in the network

Commonly-used Measures

  1. Degree Centrality
  2. Closeness Centrality
  3. Betweenness Centrality
  4. Eigenvector Centrality (will not cover)

30

31 of 74

1. DEGREE CENTRALITY

32 of 74

1. DEGREE CENTRALITY

Degree Centrality: defines the importance of a node based on the degree of that node. The higher the degree, the more crucial it becomes in the graph.

  • Quantifies the number of direct connections a node has in a graph.
  • Used to find the individual nodes’ “popularity”, by looking into the number of incoming and outgoing relationships.

33 of 74

Total Degree Centrality(v)=

In-degree(v)+Out-degree(v)

34 of 74

1. DEGREE CENTRALITY: EQUATIONS

Degree Centrality:

Normalized Degree Centrality:

34

n = the number of nodes in the network

di = Number of edges connected to node vi

35 of 74

1. DEGREE CENTRALITY

The importance of a vertex is determined by the number of vertices adjacent to it

  • The larger the degree, the more important the vertex is
  • Only a small number of vertex have high degrees in many real-life networks

35

For Vertex 1,

  • degree centrality is 3;
  • Normalized degree centrality is 3/(9-1)=3/8.

Example: Consider a social network where each node represents a person, and edges represent friendships. Nodes with higher degree centrality are those with more friends (such as popular individuals in social network).

36 of 74

2. CLOSENESS CENTRALITY

37 of 74

2. CLOSENESS CENTRALITY

“Central” vertices are important;

Closeness centrality identifies a node's importance based on how close it is to all the other nodes in the graph.

  • The closeness is also known as geodesic distance (GD).

37

Intuition

A node has high closeness if it can reach all others quickly.

38 of 74

Quickly Recap: Geodesic Distance or shortest path distance

Geodesic distance is the shortest path between two nodes in a network, measured in terms of the number of edges traversed.

  • It represents the minimum number of edges that must be traversed to go from one node to another in a graph.

To measure “closeness centrality” , we calculate the average of the shortest path length from “the node” to every other node in the network/graph.

39 of 74

Geodesic Distance

The Geodesic distance d between two nodes a and b is defined as the number of edges/links between these two nodes on the shortest path(path with minimum number of edges) between them.

40 of 74

2. CLOSENESS CENTRALITY: Example

Emphasizing nodes that can quickly interact with the rest of the network as they can reach the whole network more quickly than non-central vertices (Celebrities ideal for marketing or an influencing strategy).

A high closeness centrality means that the node can reach all other nodes quickly, i.e., it is near to many nodes. This is an indicator of being a central or influential node in the network

    • A low average shortest distance indicates higher centrality, suggesting the node is well-connected and central to the network.

40

41 of 74

2. CLOSENESS CENTRALITY: Equations

Importance measured by how close (central) a vertex is to other vertices.

Normalized Closeness Centrality:

41

d(vi,vj) = the distance (length of the shortest path) between vertices

vi and vj

42 of 74

Steps to Calculate CLOSENESS CENTRALITY

43 of 74

2. CLOSENESS CENTRALITY

43

Vertex 4 is more central than vertex 3

Table shows the shortest path from each one of the nodes to all others.

Step 01

Step 02:

44 of 74

Closeness Centrality: For Weighted Network

In this case, the shortest path is not the one with the least hops but the one with the minimum total weighted distance. For instance, in the example shown in Figure, accounting for weighted paths, the shortest distance between nodes D and G includes two hopes (D-F-G=2), instead of the single hop option (D-G=5)

45 of 74

Next: Betweenness Centrality

46 of 74

Degree Centrality vs Closeness Centrality vs Betweenness centrality

"You know everyone" - Degree Centrality:

If a node has a high degree centrality, it means that it is directly connected to a large number of other nodes in the network. In social network terms, this could be interpreted as "knowing" many people directly.

"You know a bunch of people who know everyone" - Closeness Centrality:

Closeness centrality measures how quickly a node can reach all other nodes in the network. If a node has high closeness centrality, it is, on average, close to all other nodes. In social network terms, it means that you (or the node) are well-connected to others either directly or through a short path.

Another concept: “betweenness centrality”

47 of 74

Degree Centrality vs Closeness Centrality vs Betweenness centrality

“You're like the go-to person (bottleneck) who sits on many important communication routes (network), making sure everyone stays connected” - Betweenness Centrality

If you know a bunch of people who act as bridges between different groups in the network, you might have high betweenness centrality.

  • Play a crucial role in connecting different parts of the network, even if don't have the highest number of direct connections (high degree).

This is a different measure than other two!

Measures the importance of a node in a network based upon how many times it occurs in the shortest path between all pairs of nodes (finding bridge nodes) in a graph.

48 of 74

3. BETWEENNESS CENTRALITY

49 of 74

3. BETWEENNESS CENTRALITY

Betweenness centrality is a measure in graph theory that identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network.

For example, in a telecommunications network, a node with higher betweenness centrality would have more control over the network, because more information will pass through that node

  • A node with high betweenness centrality is one that lies on many of the shortest paths between other nodes in the network.

Intuition

A node has high betweenness if it acts as a bridge between different parts of the graph.

50 of 74

3. BETWEENNESS CENTRALITY

  • Vertex betweenness (traditional): measures the importance of nodes (vertices) in a graph. It indicates how many shortest paths between other nodes pass through a particular node.

  • Edge Betweenness: measures the importance of edges in a network, rather than nodes. It calculates how many shortest paths between pairs of nodes pass through a particular edge. This can help identify critical connections (edges) that hold parts of the network together.

We focus on “Vertex betweenness” for now!

51 of 74

3. BETWEENNESS CENTRALITY: EQUATION

Vertex betweenness counts the number of shortest paths that pass through one vertex

Betweenness Centrality:

51

The number of shortest paths between s and t

The number of shortest paths between s and t that pass vi

52 of 74

3. BETWEENNESS CENTRALITY

53 of 74

Steps to Calculate BETWEENNESS CENTRALITY

54 of 74

3. BETWEENNESS CENTRALITY: EXAMPLE

54

The number of shortest paths between s and t

The number of shortest paths between s and t that pass vi

What is the betweenness centrality for node 4 ?????????

55 of 74

EXAMPLE 1

3. BETWEENNESS CENTRALITY

56 of 74

EXAMPLE 1: BETWEENNESS CENTRALITY

57 of 74

EXAMPLE 1: BETWEENNESS CENTRALITY

58 of 74

EXAMPLE 1

3. BETWEENNESS CENTRALITY

59 of 74

EXAMPLE 1

3. BETWEENNESS CENTRALITY

Try Yourself:

60 of 74

3. Example: BETWEENNESS CENTRALITY

Betweenness centrality is a measure in graph theory that identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network.

Alice is an employee who regularly interacts with people from various departments.

Betweenness Centrality: Alice has high betweenness centrality because she acts as a bridge between different groups. Even though Alice may not have the most direct connections (high degree), her position allows her to control or influence the flow of information between departments significantly.

Impact: If Alice were to be absent or unavailable, it might disrupt communication between departments because she plays a crucial role in ensuring information flows smoothly across the organization.

61 of 74

4. EIGENVECTOR CENTRALITY

A vertex’s importance is determined by the importance of the friends of that vertex

If one has many important friends, he should be important as well.

The centrality corresponds to the top eigenvector of the adjacency matrix A.

A variant of this eigenvector centrality is the PageRank score.

61

62 of 74

NETWORKX: CENTRALITY

Many other centrality measures implemented for you!

Degree, in-degree, out-degree

Closeness

Betweenness

  • Applied to both edges and vertices; hard to compute

Load: similar to betweenness

Eigenvector, Katz (provides additional weight to close neighbors)

62

63 of 74

Summary

  • A graph consists of nodes (vertices) and edges, representing relationships.
  • Representing & Storing a Graph
    • Adjacency Matrix: Matrix of connections between nodes.
    • Adjacency List: List of neighbors for each node.
    • Adjacency Dictionary: Representing a graph using a dictionary
  • Graph Featurization Types
    • Graph-Level: Global properties (e.g., density, diameter).
    • Node-Level: Properties of individual nodes (e.g., degree, centrality).
    • Edge-Level: Features of relationships between nodes (e.g., edge betweenness).
  • Centrality Measures (Vertex Importance)
    • Degree Centrality: Number of direct connections a node has.
    • Closeness Centrality: Proximity to all other nodes.
    • Betweenness Centrality: Role as a bridge in the network.

64 of 74

HELPING SLIDES

65 of 74

What is Graph

In graph theory, a "graph" refers to a mathematical structure that the relations between a collection of object

(edges)

(nodes)

A Graph G Consists of:

  • Vertices V: a set of nodes

  • Edges E: A set of edges connecting pairs of nodes

Formally defined as : G= (V,E)

Vertex

Edge

66 of 74

Graph Terms

There are 6 vertices (A,B,C,D,E,F) and 6 edges in the Graph below

67 of 74

Graph Terms

Adjacency: Two vertices are adjacent if they are connected by an edge.

E.g: A and C are adjacent, D and F are adjacent etc.

68 of 74

Graph Terms

Degree of a vertex(node): number of edges incident/connected to that vertex.

E.g: Degree of A = 3; degree of F = 2.

In a directed graph, nodes have both an in-degree (number of incoming edges) and an out-degree (number of outgoing edges).

69 of 74

Graph Terms

  • There can be multiple edges between 2 vertices
  • An edge can be a loop, i.e. an edge between a vertex and itself
  • E.g:
    • 2 edges between A and D
    • There is a loop (an edge from C to C)

70 of 74

Different types of graph

Directed Graph (Digraph): edges have a direction (go from one node to another). A →B, doesn't necessarily mean there is an edge from B to A.

Undirected Graph: edges have no direction; traverse the edge in either direction.

Weighted Graph: Each edge is assigned a numerical value or weight, representing some measure such as distance, cost, or time.

71 of 74

In - Degree and Out-Degree in a Directed Graph

In-degree : count of incoming edges to a node.

In-degree : count of outgoing edges from a node.

72 of 74

73 of 74

Examples:

V (Vertices) = {0, 1, 2, 3}

E (Edges) = {(0,1), (1,2), (2,3), (0,3)}

G (Graph) = {V, E}

74 of 74

Graph Representation: Adjacency Matrix

Adjacency matrix (2D matrix) is mostly used to represent a graph where rows and columns are denote vertices.

Tells us which nodes are connected

The values of this matrix Aij are defined as: