Graphs
Introduction to Graph Theory
Fardina Fathmiul Alam
CMSC 320 - Introduction to Data Science
2025
When you think about graph what comes to your mind ?
BUT WE ARE NOT TALKING ABOUT THEM
Outline
Today we will talk about “Graph Theory”
Graph Processing
The branch of data science that deals with extracting information from graphs by performing analysis on them is known as “Graph Analytics”.
The advent of social networks, big data and e-commerce has re-emphasized the importance of analyzing “Graph” - a unique type of data structure- which depicts relationships among its entities.
NETWORKS? GRAPHS?
Networks are systems of interrelated objects
Graphs are the mathematical models used to represent networks
In data science, we use algorithms on graphs to answer questions about real-world networks.
5
Trump
Obama
Graph Related Problems
GRAPHS
A graph G = (V,E) is a set of vertices V and edges E
Edges can be undirected or directed
7
C
B
D
A
V = {A, B, C, D}
E = {(A,B), (B,C), (C,D), (A,C)}
C
B
D
A
V = {A, B, C, D}
E = {(A,C), (C,A), (B,C), (B,D)}
Examples of directed vs undirected graphs
Nodes = Vertices
Edges = Arcs
GRAPHS
Edges can be unweighted or weighted
8
C
B
D
A
C
B
D
A
Unweighted
Weighted
1
2
4
4
Examples of unweighted and weighted graphs
Where do we see them?
Text as graphs
GRAPHS AND THE NETWORKS THEY REPRESENT
Text as graphs
Molecular Structures as Graph
(Left) 3d representation of the Citronellal molecule
(Center) Adjacency matrix of the bonds in the molecule
(Right) Graph representation of the molecule.
Social Networks as Graphs
A graph representing groups of people by modelling individuals as nodes, and their relationships as edges.
(Left) Image of a scene from the play “Othello”. (Center) Adjacency matrix of the interaction between characters in the play. (Right) Graph representation of these interactions.
IMAGES CAN BE VIEWED AS GRAPHS
A way of visualizing the connectivity of a graph is through its adjacency matrix.
NETWORKX
NetworkX is a Python library for storing, manipulating, and analyzing (small- and medium-sized) graphs
14
import networkx as nx
G=nx.Graph()
G.add_node("spam”)
G.add_edge(1,2)
print(list(G.nodes()))
print(list(G.edges()))
[1, 2, ‘spam’]
[(1,2)]
STORING A GRAPH
Three main ways to represent a graph in memory:
15
Graph Terms
Adjacency: Two vertices are adjacent if they are connected by an edge.
E.g: A and C are adjacent, D and F are adjacent etc.
(1) ADJACENCY LISTS
For each vertex, store an array of the vertices it connects to
17
Vertex | Neighbors |
A | [C] |
B | [C, D] |
C | [A] |
D | [] |
C
B
D
A
Pros:
Simple to use for edge-specific operations:
Cons: Slow for vertex-specific operations:
To find all neighbors of a vertex, you need to scan all edges (O(|E|)). (O(|V|)), deleting is hard (involves finding and removing a specific neighbor from the list of neighbors of a vertex)
(2) ADJACENCY DICTIONARIES
For each vertex, store a dictionary of vertices it connects to
18
Vertex | Neighbors |
A | {C: 1.0} |
B | {C: 1.0, D: 1.0} |
C | {A: 1.0} |
D | {} |
C
B
D
A
Pros: O(1) to add, remove, query edges
Cons: Overhead (memory, caching, etc).
(3) ADJACENCY MATRIX
Store the connectivity of the graph in a matrix
Pros:
Easy to implement, Fast Edge Queries (O(1)):
Cons:
Almost always stored as a sparse matrix
19
C
B
D
A
| A | B | C | D |
A | 0 | 0 | 1 | 0 |
B | 0 | 0 | 0 | 0 |
C | 1 | 1 | 0 | 0 |
D | 0 | 1 | 0 | 0 |
From
To
[i][j] represents the presence (or weight) of an edge between vertex i and vertex j.
(3) ADJACENCY MATRIX
20
[i][j] represents the presence (or weight) of an edge between vertex i and vertex j.
Choosing the Right Graph Representation: Adjacency List, Dictionary, and Matrix
Sparse Graphs (few edges compared to vertices):
Dense Graphs (many edges compared to vertices):
Graphs with Many Vertex and Edge Modifications (dynamic graphs):
Graph Traversal (BFS, DFS, Dijkstra's Algorithm, etc.):
21
Graph Featurization
By transforming graph data into suitable numerical features or representations, machine learning models can be applied to various tasks.
The most natural way to do this for a graph is to use something like the adjacency matrix, but each graph can have many different ones!
Featurization = turning a graph (or its nodes, edges, entire structure) into numerical vectors that ML models can use.
Graph Featurization
Graph featurization transforms graph structural information into numerical features that can be used for tasks such as classification, clustering, or link prediction.
Purpose: classification, link prediction, clustering, graph regression, etc.
We convert:
Graph Level Feature
Features that describe the entire graph.
Example:� If a graph has 10 shortest-path distances totaling 24,
Average Path Length=2410=2.4\text{Average Path Length} = \frac{24}{10} = 2.4Average Path Length=1024=2.4
Edge Level Features
Edge-level features describe the relationship between two nodes. These are essential for link prediction, relationship ranking, recommendation systems, and graph similarity tasks.
Counts how many neighbors two nodes share.
Example:� Nodes A and B:� Neighbors of A = {C, D}� Neighbors of B = {C, D, E}� Common neighbors = 2 (C and D)
Used heavily in link prediction.
Edge Level Features
Edge-level features describe the relationship between two nodes. These are essential for link prediction, relationship ranking, recommendation systems, and graph similarity tasks.
Jaccard Coefficient
Normalizes common neighbors by total neighbors.
Example:� If A and B share 2 neighbors out of 5 total unique neighbors:
J=25=0.4J = \frac{2}{5} = 0.4J=52=0.4
Graph Featurization: Examples
i.e: Describe properties of a graph: in social network classification, computing graph density helps identify if the network is sparse or tightly connected.
i.e Common Neighbors feature can help predict new friendships between users who share several mutual friends (usefully for link prediction).
i.e In a social media network like Twitter
THE VALUE (IMPORTANCE) OF A VERTEX
28
Node Level Features
Centrality
Represents a “measure of importance”
A very important concept in identifying important nodes in a graph in Graph Analytics.
Centrality measures how "central" or important a node is within a graph.
There are various types of centrality metrics, each providing a different perspective on a node's importance in the network.
IMPORTANCE OF VERTICES
Not all vertices are equally important
Centrality Analysis:
Commonly-used Measures
30
1. DEGREE CENTRALITY
1. DEGREE CENTRALITY
Degree Centrality: defines the importance of a node based on the degree of that node. The higher the degree, the more crucial it becomes in the graph.
Total Degree Centrality(v)=
In-degree(v)+Out-degree(v)
1. DEGREE CENTRALITY: EQUATIONS
Degree Centrality:
Normalized Degree Centrality:
34
n = the number of nodes in the network
di = Number of edges connected to node vi
1. DEGREE CENTRALITY
The importance of a vertex is determined by the number of vertices adjacent to it
35
For Vertex 1,
Example: Consider a social network where each node represents a person, and edges represent friendships. Nodes with higher degree centrality are those with more friends (such as popular individuals in social network).
2. CLOSENESS CENTRALITY
2. CLOSENESS CENTRALITY
“Central” vertices are important;
Closeness centrality identifies a node's importance based on how close it is to all the other nodes in the graph.
37
Intuition
A node has high closeness if it can reach all others quickly.
Quickly Recap: Geodesic Distance or shortest path distance
Geodesic distance is the shortest path between two nodes in a network, measured in terms of the number of edges traversed.
To measure “closeness centrality” , we calculate the average of the shortest path length from “the node” to every other node in the network/graph.
Geodesic Distance
The Geodesic distance d between two nodes a and b is defined as the number of edges/links between these two nodes on the shortest path(path with minimum number of edges) between them.
2. CLOSENESS CENTRALITY: Example
Emphasizing nodes that can quickly interact with the rest of the network as they can reach the whole network more quickly than non-central vertices (Celebrities ideal for marketing or an influencing strategy).
A high closeness centrality means that the node can reach all other nodes quickly, i.e., it is near to many nodes. This is an indicator of being a central or influential node in the network
40
2. CLOSENESS CENTRALITY: Equations
Importance measured by how close (central) a vertex is to other vertices.
Normalized Closeness Centrality:
41
d(vi,vj) = the distance (length of the shortest path) between vertices
vi and vj
Steps to Calculate CLOSENESS CENTRALITY
2. CLOSENESS CENTRALITY
43
Vertex 4 is more central than vertex 3
Table shows the shortest path from each one of the nodes to all others.
Step 01
Step 02:
Closeness Centrality: For Weighted Network
In this case, the shortest path is not the one with the least hops but the one with the minimum total weighted distance. For instance, in the example shown in Figure, accounting for weighted paths, the shortest distance between nodes D and G includes two hopes (D-F-G=2), instead of the single hop option (D-G=5)
Next: Betweenness Centrality
Degree Centrality vs Closeness Centrality vs Betweenness centrality
"You know everyone" - Degree Centrality:
If a node has a high degree centrality, it means that it is directly connected to a large number of other nodes in the network. In social network terms, this could be interpreted as "knowing" many people directly.
"You know a bunch of people who know everyone" - Closeness Centrality:
Closeness centrality measures how quickly a node can reach all other nodes in the network. If a node has high closeness centrality, it is, on average, close to all other nodes. In social network terms, it means that you (or the node) are well-connected to others either directly or through a short path.
Another concept: “betweenness centrality”
Degree Centrality vs Closeness Centrality vs Betweenness centrality
“You're like the go-to person (bottleneck) who sits on many important communication routes (network), making sure everyone stays connected” - Betweenness Centrality
If you know a bunch of people who act as bridges between different groups in the network, you might have high betweenness centrality.
This is a different measure than other two!
Measures the importance of a node in a network based upon how many times it occurs in the shortest path between all pairs of nodes (finding bridge nodes) in a graph.
3. BETWEENNESS CENTRALITY
3. BETWEENNESS CENTRALITY
Betweenness centrality is a measure in graph theory that identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network.
For example, in a telecommunications network, a node with higher betweenness centrality would have more control over the network, because more information will pass through that node
Intuition
A node has high betweenness if it acts as a bridge between different parts of the graph.
3. BETWEENNESS CENTRALITY
We focus on “Vertex betweenness” for now!
3. BETWEENNESS CENTRALITY: EQUATION
Vertex betweenness counts the number of shortest paths that pass through one vertex
Betweenness Centrality:
51
The number of shortest paths between s and t
The number of shortest paths between s and t that pass vi
3. BETWEENNESS CENTRALITY
Steps to Calculate BETWEENNESS CENTRALITY
3. BETWEENNESS CENTRALITY: EXAMPLE
54
The number of shortest paths between s and t
The number of shortest paths between s and t that pass vi
What is the betweenness centrality for node 4 ?????????
EXAMPLE 1
3. BETWEENNESS CENTRALITY
EXAMPLE 1: BETWEENNESS CENTRALITY
EXAMPLE 1: BETWEENNESS CENTRALITY
EXAMPLE 1
3. BETWEENNESS CENTRALITY
EXAMPLE 1
3. BETWEENNESS CENTRALITY
Try Yourself:
3. Example: BETWEENNESS CENTRALITY
Betweenness centrality is a measure in graph theory that identifies nodes that act as bridges along the shortest paths between pairs of nodes in a network.
Alice is an employee who regularly interacts with people from various departments.
Betweenness Centrality: Alice has high betweenness centrality because she acts as a bridge between different groups. Even though Alice may not have the most direct connections (high degree), her position allows her to control or influence the flow of information between departments significantly.
Impact: If Alice were to be absent or unavailable, it might disrupt communication between departments because she plays a crucial role in ensuring information flows smoothly across the organization.
4. EIGENVECTOR CENTRALITY
A vertex’s importance is determined by the importance of the friends of that vertex
If one has many important friends, he should be important as well.
The centrality corresponds to the top eigenvector of the adjacency matrix A.
A variant of this eigenvector centrality is the PageRank score.
61
NETWORKX: CENTRALITY
Many other centrality measures implemented for you!
Degree, in-degree, out-degree
Closeness
Betweenness
Load: similar to betweenness
Eigenvector, Katz (provides additional weight to close neighbors)
62
Summary
HELPING SLIDES
What is Graph
In graph theory, a "graph" refers to a mathematical structure that the relations between a collection of object
(edges)
(nodes)
A Graph G Consists of:
Formally defined as : G= (V,E)
Vertex
Edge
Graph Terms
There are 6 vertices (A,B,C,D,E,F) and 6 edges in the Graph below
Graph Terms
Adjacency: Two vertices are adjacent if they are connected by an edge.
E.g: A and C are adjacent, D and F are adjacent etc.
Graph Terms
Degree of a vertex(node): number of edges incident/connected to that vertex.
E.g: Degree of A = 3; degree of F = 2.
In a directed graph, nodes have both an in-degree (number of incoming edges) and an out-degree (number of outgoing edges).
Graph Terms
Different types of graph
Directed Graph (Digraph): edges have a direction (go from one node to another). A →B, doesn't necessarily mean there is an edge from B to A.
Undirected Graph: edges have no direction; traverse the edge in either direction.
Weighted Graph: Each edge is assigned a numerical value or weight, representing some measure such as distance, cost, or time.
In - Degree and Out-Degree in a Directed Graph
In-degree : count of incoming edges to a node.
In-degree : count of outgoing edges from a node.
Examples:
V (Vertices) = {0, 1, 2, 3}
E (Edges) = {(0,1), (1,2), (2,3), (0,3)}
G (Graph) = {V, E}
Graph Representation: Adjacency Matrix
Adjacency matrix (2D matrix) is mostly used to represent a graph where rows and columns are denote vertices.
Tells us which nodes are connected
The values of this matrix Aij are defined as: