Let’s imagine we work for a dating website that asks its users to fill out a short 6-item questionnaire. We’ve had 100 users complete the questionnaire. We want to segment our users into groups with similar personality types (as represented by average responses on the questionnaire) to aid us in recommending matches.

For example, our questionnaire might ask “Have you seen a Star Wars film recently?” to which the user can reply with a Yes, No, or Unsure. If there were such a thing as a “Professional” personality type its response profile on the question might look like:

 YESNOUNSURE ******************************

whereas a “Nerdy” personality type’s response profile might look like:

 YESNOUNSURE *****************************

We assume these response profiles are “out there in the world” and we try to recover them with our model, but we do not know a priori what they are.

We assume our data is generated in the following way:

• A user taking the questionnaire is one of three personality types with uniform probability.
• Each personality type corresponds to a questionnaire response profile which gives the probability of a particular response on a particular question for that personality type.
• To answer each question, the user samples from her personality type’s distribution for that question.

It is our task to recover:

• The personality type of every user in our data set
• The response profile of every personality type on each question

Note the chicken-and-egg nature of the problem: if we knew the response profiles a priori we could easily predict which personality type a user belongs to. Or, if we knew the personality type of every user in our dataset, we could then calculate the response profiles for that personality type. But we know neither.

We do, however, have a generative probabilistic model and a dataset of questionnaire responses.

We know how to calculate the probability of a question q response r given the personality type p and response profile :

To calculate an entire questionnaire response we simply multiply the individual question response probabilities together:

But we are interested in the inverse; for questionnaire response i the probability that it was generated by a person of personality type pi and the response profiles :

And, in particular, we are interested in values of  and pi that have relatively high probability under the function.

The generic approach to estimating a probability function is by Gibbs sampling, which will sample states of the model (in this case, the values of , pi, and questionnaire responses) proportional to their probability.

In a Gibbs sampler we start at an arbitrary assignment of the model state, then update each piece of the model by sampling from its conditional distribution given the rest of the model. Eventually, the frequency at which various states of the model state space are visited will be proportional to their probability under the model.

To derive the Gibbs sampler, first note that  gives the multinomial distribution over question responses for a particular question and personality type. Since there are 6 questions and we have assumed 3 personality types,  represents 18 independent multinomial distributions. pi is a vector representing that data item i is of personality type pi.

In order to update pi given , for every personality type we simply lookup in  the probability of that type responding the way the i’th data item did. We then sample from resultant multinomial over personality types (note in the code listing we call pi ahat and   is rhat):

 void updateAhat(){  float[] pr = new float[PRSN_SIZE];  for(int d = 0; d < DATA_SIZE; d++){    for(int p = 0; p < PRSN_SIZE; p++){      pr[p]=1;      for(int q = 0; q < QSTN_SIZE; q++){        pr[p] *= rhat[q][p][data[d][q]];                   }                 }    pr = normalize(pr);    ahat[d] = sampleMultinomial(pr);  }  }

To update  given pi, we simply count response frequencies across the dataset for each question for each personality type as determined by pi:

 void updateRhat(){  //Tabulate the data  rhat = new float[QSTN_SIZE][PRSN_SIZE][3];  for(int d = 0; d < DATA_SIZE; d++){    for(int q = 0; q < QSTN_SIZE; q++){      rhat[q][ahat[d]][data[d][q]]++;      }  }    //Resample  for(int q = 0; q < QSTN_SIZE; q++){    for(int p = 0; p < PRSN_SIZE; p++){      rhat[q][p] = dirichletSample(rhat[q][p]);    }  }   }

We now have a complete Gibbs sampler for the problem domain. We will generate some test questionnaire data using 3 personality types and arbitrary response profiles for each type and then see if our sampler can recover the structure of the actual generating process.

Here is an example run of the sampler:

The colors represent the 3 personality types, a column represents the state of pi at a moment in time, time moves from left to right. One can observe horizontal bands indicating stability in the sampler regarding the type of that data item, elsewhere there is less stability.

To determine how well the sampler is recovering the actual groupings, we randomly picked same-group pairs in the actual assignments and compared if the sampler grouping also placed the pair in the same group. Rather than averaging over a number of samples for the following table of results we just used one sample at the sampler iteration indicated.

 0 0.333 1 0.333-0.436 10 0.600-0.800 10000 0.750

So at 0 iterations of the sampler, the sampler is 33.3% accurate. After 10 iterations it is around 60-80% accurate at any point in time. There is little difference between 10 and 10000 iterations. Note that since the sampler assignments are produced by sampling, one should really average over a number of samples to achieve higher accuracy.

Now let’s compare the sampler personality response profiles with the “real” profiles:

In the above graphic, the first 3 columns represent the response profiles inferred by the sampler. The remaining 3 columns are the actual profiles. The rows are for each question in the questionnaire. Compare the first column (orange) with the fifth (brown). Compare the second (olive) with the last (red). The sampler seems to be recovering the actual generating response profiles with qualitatively good fit. And this is without ever having had access to the actual personality groupings nor the response profile of any group. Inasmuch as a human might be able to understand a particular response profile, the sampler essentially induced “from thin air” a rich concept from the data.