1 of 33

Module 6

Big Data Analytics Applications

2 of 33

Contents

  • Recommendation Systems
    • Introduction
    • A Model for Recommendation Systems: Utility matrix, Long-tail
    • Collaborative-Filtering System
  • Mining Social-Network Graphs
    • Social Networks as Graphs
    • Types of Social-Network
    • Clustering of Social Graphs

3 of 33

Recommendation Systems: Introduction

  • A typical recommendation system involves predicting user responses to options
  • Recommendation systems use several different technologies which can be broadly classified into two groups:
  • Content-based systems examine properties of the items recommended
    • For instance, if a Netflix user has watched many action movies, then recommend a movie classified in the database as having the “action” genre
  • Collaborative filtering systems recommend items based on similarity measures between users and/or items
    • The items recommended to a user are those preferred by similar users

4 of 33

Few common applications of recommendation systems

Product Recommendations

Movie Recommendations

News Articles

  • The most important use of recommendation systems is at online retailers
  • Amazon or similar online vendors strive to present each returning user with some suggestions of products that they might like to buy
  • These suggestions are not random, but are based on the purchasing decisions made by similar customers
  • Netflix offers its customers recommendations of movies they might like
  • These recommendations are based on ratings provided by users in general
  • News services have attempted to identify articles of interest to readers, based on the articles that they have read in the past
  • The similarity might be based on the similarity of important words in the documents, or on the articles that are read by people with similar reading tastes
  • The same principles apply to recommending blogs from among the millions of blogs available, videos on YouTube, or other sites where content is provided regularly

5 of 33

Utility Matrix

  • In a recommendation-system application there are two classes of entities, ‘users’ and ‘items’
  • Users have preferences for certain items, and these preferences must be teased out of the data
  • The data is represented as a utility matrix, giving for each user-item pair, a value that represents what is known about the degree of preference of that user for that item
  • Values come from an ordered set, e.g., integers 1–5 representing the number of stars that the user gave as a rating for that item
  • One can assume that the matrix is sparse, meaning that most entries are “unknown”
  • An unknown rating implies that we have no explicit information about the user’s preference for the item

6 of 33

Utility Matrix (Contd..)

  • The given matrix represents users’ ratings of movies on a 1–5 scale, with 5 the highest rating
  • Blanks represent the situation where the user has not rated the movie
  • The movie names are HP1, HP2, and HP3 for Harry Potter I, II, and III, TW for Twilight, and SW1, SW2, and SW3 for Star Wars episodes 1, 2, and 3
  • The users are represented by capital letters A through D
  • Most user-movie pairs have blanks, meaning the user has not rated the movie
  • In practice, the matrix would be even sparser, with the typical user rating only a tiny fraction of all available movies
  • The goal of a recommendation system is to predict the blanks in the utility matrix
  • For example, would user A like SW2?
  • One may design a recommendation system to take into account properties of movies, such as their producer, director, stars, or even the similarity of their names
  • If so, one might then note the similarity between SW1 and SW2, and then conclude that since A did not like SW1, they were unlikely to enjoy SW2 either
  • Alternatively, with much more data, one might observe that the people who rated both SW1 and SW2 tended to give them similar ratings
  • Thus, we could conclude that A would also give SW2 a low rating, similar to A’s rating of SW1

7 of 33

Utility Matrix (Contd..)

  • Without a utility matrix, it is almost impossible to recommend items
  • However, acquiring data from which to build a utility matrix is often difficult
  • There are two general approaches to discovering the value users place on items

Approach 1

Approach 2

  • Users can be asked to rate items
  • Movie ratings are generally obtained this way, and some online stores try to obtain ratings from their purchasers
  • Sites providing content, such as some news sites or YouTube also ask users to rate items
  • This approach is limited in its effectiveness, since generally users are unwilling to provide responses, and the information from those who do, may be biased by the very fact that it comes from people willing to provide ratings
  • One can make inferences from users’ behavior. Most obviously, if a user buys a product at Amazon, watches a movie on YouTube, or reads a news article, then the user can be said to “like” this item
  • Note that this sort of rating system really has only one value: 1 means that the user likes the item
  • Often, we find a utility matrix with this kind of data shown with 0’s rather than blanks where the user has not purchased or viewed the item
  • However, in this case 0 is not a lower rating than 1; it is no rating at all
  • More generally, one can infer interest from behavior other than purchasing
  • For example, if an Amazon customer views information about an item, one can infer that they are interested in the item, even if they don’t buy it

8 of 33

Long tail

  • Physical delivery systems are characterized by a scarcity of resources
  • A physical store has limited shelf space, and can show the customer only a small fraction of all the choices that exist
  • On the other hand, online stores can make anything that exists available to the customer
  • Thus, a physical bookstore may have several thousand books on its shelves, but Amazon offers millions of books
  • Recommendation in the physical world is fairly simple, first, it is not possible to tailor the store to each individual customer
  • Thus, the choice of what is made available is governed only by the aggregate numbers
  • Typically, a bookstore will display only the books that are most popular, and a newspaper will print only the articles it believes the most people will be interested in

9 of 33

Long tail (Contd..)

  • The distinction between the physical and on-line worlds has been called the ‘long tail’ phenomenon
  • The vertical axis represents popularity (the number of times an item is chosen)
  • The items are ordered on the horizontal axis according to their popularity
  • Physical stores provide only the most popular items to the left of the vertical line, while the corresponding online sites provide the entire range of items:
    • The tail as well as the popular items
  • The long-tail phenomenon forces online sites to recommend items to individual users
  • It is not possible to present all available items to the user, the way physical stores can
  • Neither can we expect users to have heard of each of the items they might like

10 of 33

Content-Based systems*

  • Content-Based systems focus on properties of items

  • Similarity of items is determined by measuring the similarity in their properties

  • In a content-based system, a profile for each item needs to be constructed, which is a record or collection of records representing important characteristics of that item

  • In simple cases, the profile consists of some characteristics of the item that are easily discovered

*Not explicitly mentioned in syllabus, however there is no harm to study something beyond syllabus!

11 of 33

Content-Based systems

  • E.g.: Features of a movie that might be relevant to a recommendation system
    • Set of actors
    • Director
    • Year
    • Genre
  • Genre is a vaguer concept
  • However, movie reviews generally assign a genre from a set of commonly used terms
  • For example the Internet Movie Database (IMDB) assigns a genre or genres to every movie

12 of 33

Content-Based systems

  • Many other classes of items also allow us to obtain features from available data, even if that data must at some point be entered by hand
  • For instance, products often have descriptions written by the manufacturer, giving features relevant to that class of product (e.g., for a TV, it will be screen size, resolution, display technology like LED or LCD etc.)
  • Books have descriptions similar to those for movies, so we can obtain features such as author, year of publication, and genre
  • Music products such as CD’s and MP3 downloads have available features such as artist, composer, and genre
  • The ultimate goal for content-based recommendation is to create both an item profile consisting of feature-value pairs and a user profile summarizing the preferences of the user, based of their row of the utility matrix
  • P.S. There are some classes of items where it is not immediately apparent what the values of features should be!!
    • E.g. Documents

13 of 33

E.g.: Discovering Features of Documents

  • Documents do not tend to have readily available information giving features
  • A substitute that has been useful in practice is the identification of words that characterize the topic of a document
  • First, eliminate stop words – the several hundred most common words, which tend to say little about the topic of a document
  • For the remaining words, compute the TF-IDF score for each word in the document
  • The ones with the highest scores are the words that characterize the document
    • Tf-idf stands for term frequency-inverse document frequency
    • This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus
    • Tf-idf weight is composed by two terms:
      • The first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document;
      • The second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears

14 of 33

E.g.: Discovering Features of Documents

  • Consider a document containing 100 words wherein the word cat appears 3 times
  • The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03
  • Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4
  • Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12
  • The higher the TF-IDF score the more important or relevant the term is; as a term gets less relevant, its TF-IDF score will approach 0

15 of 33

Collaborative-Filtering System

  • Collaborative-Filtering systems focus on the relationship between users and items
  • Similarity of items is determined by the similarity of the ratings of those items by the users who have rated both items
  • Instead of using features of items to determine their similarity, the focus is on the similarity of the user ratings for two items
  • That is, in place of the item-profile vector for an item, we use its column in the utility matrix
  • Users are similar if their vectors are close according to some distance measure such as Jaccard or cosine distance
  • Recommendation for a user U is then made by looking at the users that are most similar to U in this sense, and recommending items that these users like
  • The process of identifying similar users and recommending what similar users like is called collaborative filtering

16 of 33

Ways of measuring similarity

  • The first question one must deal with is how to measure similarity of users or items from their rows or columns in the utility matrix
  • This data is too small to draw any reliable conclusions, but its small size will make clear some of the pitfalls in picking a distance measure
  • Observe specifically the users A and C
  • They rated two movies in common, but they appear to have almost diametrically opposite opinions of these movies
  • One would expect that a good distance measure would make them rather far apart

17 of 33

Alternative measures to consider

  • Jaccard distance
  • A and B have an intersection of size 1 and a union of size 5
  • Thus, their Jaccard similarity is 1/5, and their Jaccard distance is 4/5; i.e., they are very far apart
  • In comparison, A and C have a Jaccard similarity of 2/4, so their Jaccard distance is the same, ½
  • Thus, A appears closer to C than to B
  • Yet that conclusion seems intuitively wrong
  • A and C disagree on the two movies they both watched, while A and B seem both to have liked the one movie they watched in common

18 of 33

Alternative measures to consider

  • Cosine Distance
  • Treating blank values as 0 (This can be misleading since it has the effect of treating the lack of a rating as more similar to disliking the movie than liking it)
  • The cosine of the angle between A and B is:

  • The cosine of the angle between A and C is:

  • Since a larger (positive) cosine implies a smaller angle and therefore a smaller distance, this measure tells us that A is slightly closer to B than to C

19 of 33

Alternative measures to consider

  • Rounding the Data
  • One could try to eliminate the apparent similarity between movies a user rates highly and those with low scores by rounding the ratings
  • For instance, we could consider ratings of 3, 4, and 5 as a “1” and consider ratings 1 and 2 as unrated, the utility matrix would then look

  • Now, the Jaccard distance between A and B is 3/4, while between A and C it is 1; i.e., C appears further from A than B does, which is intuitively correct
  • Applying cosine distance the conclusion that can be drawn is: ???

20 of 33

Alternative measures to consider

  • Normalizing Ratings
  • If ratings are normalized, by subtracting from each rating the average rating of that user, low ratings can be turned into negative numbers and high ratings into positive numbers (There are exceptional cases to this too!)
  • Below is the matrix with the values normalized:

21 of 33

Alternative measures to consider

  • Normalizing Ratings (Contd..)
  • If cosine distance is taken, users with opposite views of the movies they viewed in common will have vectors in almost opposite directions and can be considered as far apart as possible
  • However, users with similar opinions about the movies rated in common will have a relatively small angle between them
  • The cosine of the angle between A and B is:

  • The cosine of the angle between A and C is:

Inferences that can be drawn:

  • A and C are much further apart than A and B, and neither pair is very close.
  • Both these observations make intuitive sense, given that A and C disagree on the two movies they rated in common, while A and B give similar scores to the one movie they rated in common

22 of 33

Clustering Users and Items

  • Few drawbacks while trying to draw inferences from a utility matrix:
    • We have little information about user-item pairs in the sparse utility matrix
    • Even if two items belong to the same genre, there are likely to be very few users who bought or rated both
    • Even if two users both like a genre or genres, they may not have bought any items in common
  • How to deal with above scenario:
    • Cluster items and/or users
    • Select any of the distance measures suggested in previous few slides, or any other distance measure, and use it to perform a clustering of, say, items

23 of 33

Clustering Users and Items

  • Columns represent clusters of items
  • The entry for any user U and any cluster C is the average rating that U gave to those members of cluster C that U did rate
  • If U may have rated none of the cluster members, the entry for U and C will be blank
  • The revised utility matrix can be used to cluster users, again using the distance measure we consider most appropriate
  • Revise the utility matrix, so the rows correspond to clusters of users, just as the columns correspond to clusters of items
  • As for item-clusters, compute the entry for a user cluster by averaging the ratings of the users in the cluster
  • The above steps can be repeated several times if one likes i.e. one can cluster the item clusters and again merge the columns of the utility matrix that belong to one cluster
  • One can then turn to the users again, and cluster the user clusters
  • The process can repeat until one has an intuitively reasonable number of clusters of each kind

24 of 33

Exercise: Figure is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, a through h, by three users A, B, and C. Compute the following from the data of this matrix

Q.1. Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the

Jaccard and cosine distance between each pair of users

a

b

c

d

e

f

g

h

A

1

1

1

1

B

1

1

1

C

1

1

1

1

Sr. No.

Ans

1

Jaccard Similarity between A&B

0.4

2

Jaccard Distance between A&B

0.6

3

Jaccard Similarity between B&C

0.1667

4

Jaccard Distance between B&C

0.8333

5

Jaccard Similarity between A&C

0.4

6

Jaccard Distance between A&C

0.6

7

Cosine Distance between A&B

0.5773

8

Cosine Distance between B&C

0.2886

9

Cosine Distance between A&C

0.5

25 of 33

Exercise: Figure is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, a through h, by three users A, B, and C. Compute the following from the data of this matrix

Q.2. Normalize the matrix by subtracting from each nonblank entry the average value for its user and compute the cosine distance between each pair of users

a

b

c

d

e

f

g

h

A

0.67

1.67

1.67

-2.33

-0.33

-1.33

B

0.67

1.67

0.67

-1.33

-0.33

-1.33

C

-1

-2

0

1

2

0

Sr. No.

Ans

1

Cosine Distance between A&B

0.5841

2

Cosine Distance between B&C

-0.739

3

Cosine Distance between A&C

-0.1151

26 of 33

Mining Social-Network Graphs

27 of 33

Introduction

  • The best-known example of a social network is the “friends” relation found on sites like Facebook
  • An important question about a social network is how to identify “communities,” that is, subsets of the nodes (people or other entities that form the network) with unusually strong connections
  • Communities almost never partition the set of nodes in a network, rather, communities usually overlap
    • For example, you may belong to several communities of friends or classmates
  • The people from one community tend to know each other, but people from two different communities rarely know each other

28 of 33

Social Networks as Graphs

  • The essential characteristics of a social network are:
  • There is a collection of entities that participate in the network
    • Typically, these entities are people, but they could be something else entirely
  • There is at least one relationship between entities of the network
    • On Facebook or its ilk, this relationship is called friends
    • Sometimes the relationship is all-or-nothing; two people are either friends or they are not
    • However, in other examples of social networks, the relationship has a degree
    • This degree could be discrete; e.g., friends, family, acquaintances, or none
  • There is an assumption of non-randomness or locality
    • This condition is the hardest to formalize, but the intuition is that relationships tend to cluster
    • That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related

29 of 33

Social Networks as Graphs

  • Social networks are naturally modeled as graphs, which we sometimes refer to as a social graph

  • The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network

  • If there is a degree associated with the relationship, this degree is represented by labeling the edges

30 of 33

Types of Social-Network

  • There are many examples of social networks other than “friends” networks

Telephone Networks

Email Networks

Collaboration Networks

  • Here the nodes represent phone numbers, which are really individuals
  • There is an edge between two nodes if a call has been placed between those phones in some fixed period of time,
  • The edges could be weighted by the number of calls made between these phones during a given period
  • Communities in a telephone network will form from groups of people that communicate frequently: groups of friends, members of a club, or people working at the same company, for example
  • The nodes represent email addresses, which are again individuals
  • An edge represents the fact that there was at least one email in at least one direction between the two addresses
  • Alternatively, we may only place an edge if there were emails in both directions
  • In that way, we avoid viewing spammers as “friends” with all their victims
  • Another approach is to label edges as weak or strong
  • Strong edges represent communication in both directions, while weak edges indicate that the communication was in one direction only
  • The communities seen in email networks come from the same sorts of groupings we mentioned in connection with telephone networks
  • Nodes represent individuals who have published research papers
  • There is an edge between two individuals who published one or more papers jointly
  • Optionally, we can label edges by the number of joint publications
  • The communities in this network are authors working on a particular topic
  • An alternative view of the same data is as a graph in which the nodes are papers
  • Two papers are connected by an edge if they have at least one author in common
  • Now, we form communities that are collections of papers on the same topic
  • Data involved in Collaborative filtering often can be viewed as forming a pair of networks, one for the customers and one for the products. Customers who buy the same sorts of products, e.g., science-fiction books, will form communities, and dually, products that are bought by the same customers will form communities

31 of 33

Clustering of Social Graphs

  • An important aspect of social networks is that they contain communities of entities that are connected by many edges
  • If one were to apply standard clustering techniques to a social-network graph, the first step would be to define a distance measure
  • When the edges of the graph have labels, these labels might be usable as a distance measure, depending on what they represented
  • But when the edges are unlabeled, as in a “friends” graph, there is not much one can do to define a suitable distance
  • The first instinct is to assume that nodes are close if they have an edge between them and distant if not
  • Thus, one could say that the distance d(x, y) is 0 if there is an edge (x, y) and 1 if there is no such edge
  • One could use any other two values, such as 1 and ∞, as long as the distance is closer when there is an edge

32 of 33

Clustering of Social Graphs using ‘Betweenness’

Examples done as numerical*

33 of 33

Clustering of Social Graphs using Girvan-Newman Algorithm

Examples done as numerical*