How do we build an automated tool for text sentiment analysis that learns from crowdsourced human annotations? This is the challenge addressed by the Bayesian Classifier Combination with Words (BCCWords) model presented in the paper:
Edwin Simpson, Matteo Venanzi, Steven Reece, Pushmeet Kohli, John Guiver, Stephen Roberts and Nicholas R. Jennings, (2015) Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning. In, 24th International World Wide Web Conference (WWW 2015)
The problem involves classifying the sentiment of a large corpus, i.e., hundreds of thousands, of text snippets, e.g., tweets, using only a small set of crowdsourced sentiment labels provided by human annotators. In particular, this problem is relevant to various text mining tasks such as weather sentiment classification from twitter [1] and disaster response applications, e.g., the UshahidiHaiti project where 40000 emergency reports were received in the first week from victims of the 2010 Haiti earthquake [2].
There are three key aspects of this problem that are relevant to the design of a crowdsourced data driven model for tweet classification. Firstly, each annotator may have different reliabilities of labelling tweets correctly depending on the content of the tweet. In fact, interpreting sentiment or relevance of a piece of text is highly subjective and, along with variations in annotators’ skill levels, it can result in disagreement amongst the annotators. Secondly, typically there are so many tweets that a small number of dedicated expert labellers will be overwhelmed as was the case during the Haiti earthquake. As a result, the human labels may or may not cover the whole set of tweets, so we may have tweets with only one label or multiple, perhaps conflicting, labels, or none. Thirdly, each distinct term of the dictionary has different probabilities to appear in tweets of different sentiment classes. For example, the terms "Good" and "Nice" are more likely to be used for tweets with a positive sentiment. Thus, we must be able to provide reliable classifications of each tweet by leveraging the language model inferred from the aggregated crowdsourced labels to classify the entire corpus (i.e., tweet set).
To solve this problem, BCCWords extends the core structure the Bayesian Classifier Combination model (BCC) for aggregating crowdsourced labels, which was described in this other blog post, to add a new feature relating to learning language models for automated text classification. In detail, BCC represents the reliability of each annotator through a confusion matrix expressing the labelling probabilities for each possible sentiment class. Here are examples of confusion matrices for two annotators rating tweets in five sentiment classes [neutral, positive, not related, unknown] from the CrowdFlower (CF) dataset described in the paper:
As in BCC, we also have K workers classifying N tweets among C possible sentiment classes. In addition, we represent the set of D words including all the words that appear in the tweets (after applying stemming and removing stop words). These counts are shown in the corresponding plates (the rectangles) of the factor graph illustrated below; arrows are included to show the generative flow.
Range n = new Range(N);
Range k = new Range(K);
Range c = new Range(C);
Range d = new Range(D);
Next, to deal with sparsity in the set of worker's labels and the tweets' brevity (typically workers judge only a small subset of tweets and each tweet contains only a small subset of words from the whole dictionary), we use two extra ranges, kn and nw, to represent the subset of tweets labelled by k and the subset of words contained in n, respectively. Thus, after creating the two arrays of tweet counts and word counts, i.e, WorkerTweetCount, WordCount, we initialise these ranges as follows:
VariableArray<int> WorkerTaskCount = Variable.Array<int>(k);
VariableArray<int> WordCount = Variable.Array<int>(n);
Range kn = new Range(WorkerTaskCount[k]);
Range nw = new Range(WordCount[n]);
In BCCWords, we assume that the observed worker’s labels are randomly drawn from the categorical distributions with parameters specified by the rows of the worker’s confusion matrix, WorkerConfusionMatrix[k][c]. This assumption is also common to BCC. However, the key feature of BCCWords is to assume that the observed tweet’s words are randomly drawn from a categorical distribution with parameters conditioned on the tweet’s true sentiment class, ProbWord[c]. Here, both WorkerConfusionMatrix and ProbWord are latent variables so they are intialised with conjugate Dirichlet prior distributions. In Infer.NET, this process can be coded as follows:
WorkerConfusionMatrix = Variable.Array(Variable.Array<Vector>(c), k) ;
WorkerConfusionMatrix[k][c]= Variable<Vector>.Random(ConfusionMatrixPrior[k][c]);
var ProbWord = Variable.Array<Vector>(c) ;
ProbWord[c] = Variable<Vector>.Random(ProbWordPrior).ForEach(c);
using (Variable.ForEach(k))
{
var tweetTrueLabel = Variable.Subarray(TweetTrueLabel, WorkerTaskIndex[k]);
trueLabel.SetValueRange(c);
using (Variable.ForEach(kn))
{
// Label inference
using (Variable.Switch(tweetTrueLabel[kn]))
{
WorkerLabel[k][kn] = Variable.Discrete(WorkerConfusionMatrix[k][trueLabel[kn]]);
}
}
// Words inference
using (Variable.ForEach(n))
{
using (Variable.Switch(tweetTrueLabel[n]))
{
Words[n][nw] = Variable.Discrete(ProbWord[TrueLabel[n]]).ForEach(nw);
}
}
}
Then, we can observe the worker’s labels and the tweet’s words encoded in 2D jagged arrays. Specifically, the first array, WorkerLabel, is indexed by worker and the rows are the worker’s label. The second array, Words, is indexed by tweet and the rows are indices of words in the dictionary:
int[][] workerLabel;
int[][] words;
WorkerLabel.ObservedValue = workerLabel;
Words.ObservedValue = words;
We can query the Infer.NET inference engine to obtain the posterior probabilities of the tweet’s true label and the words in each class as follows:
Discrete[] TweetTrueLabel = Engine.Infer<Discrete[]>(TrueLabel);
Dirichlet[][] WorkerConfusionMatrix = Engine.Infer<Dirichlet[][]>(WorkerConfusionMatrix);
Dirichlet[] ProbWordPosterior = Engine.Infer<Dirichlet[]>(ProbWord);
In this way, we can obtain the estimated confusion matrix of each worker, the true label of each tweet and the word probabilities of each sentiment class.
We have applied BCCWords to the CrowdFlower dataset containing up to five weather sentiment annotations for tens of thousands of tweets annotated by thousands of workers. The model correctly found the correlation between positive, negative, neutral words in the related sentiment class. We can see this in the word clouds below*. These show the most probable words in each class with word size proportional to the estimated probability of the word conditioned on the true label:
*The black boxes hide some swear words that were inferred by BCCWords within the feature set for the tweets with negative sentiment.
We can also identify the most discriminative words in each class by normalizing the word probabilities in each class by the total probability of the word across all the classes. In this way, we obtain a new set of word clouds with the most discriminative words:
These word clouds shows that the words “beautiful” and “perfect” are more discriminative for positive tweets, while the words “stayinghometweet” and “dammit” are more likely to occur in negative tweets. Notice that the word clouds of the neutral and "not related class" mostly contain random words and are thus not related to a particular sentiment class. We also found that some words like “complain”, “snowstorm” and “warm” do not necessarily imply a particular positive or negative sentiment as their interpretation is highly context dependent and therefore most of the annotators classified the relative tweets as “unknown”. You can find out more about the classification accuracy of BCCWords and its ability to exploit the language model to predict labels for the entire set of tweets in the paper.
[1] See http://www.crowdflower.com/blog/2013/12/crowdsourcingatscalesharedtaskchallengewinners
[2] See https://www.linkedin.com/pulse/howsocialmediacaninformunassessmentsduringmajorpatrickmeier
The source code is now available here.
]]>A standard framework to learn the latent worker’s confusion matrix is the Bayesian Classifier Combination model [1]. This model assumes that there are K workers classifying N objects (or tasks) among C possible labels. These counts are shown in the corresponding plates (the rectangles) of the factor graph illustrated below; arrows are included to show the generative flow.
Recall that plates represent parts of the graphical model which are duplicated  if N is the size of a plate, there are N copies of all factors and variables within the plate. In Infer.NET, plates are represented by ranges:
Range n = new Range(N);
Range k = new Range(K);
Range c = new Range(C);
We also have another range kn that is set up to represent the jagged intersection of the worker/task plates; workers typically judge only a small subset of the tasks, and a jagged range is a syntax which allows you to specify this. Then, after creating an array of task counts, WorkerTaskCount, with values given by the number of tasks labelled by each worker (observed at runtime) the kn ranges are initialized as follows:
VariableArray<int> WorkerTaskCount = Variable.Array<int>(k)
Range kn = new Range(WorkerTaskCount[k]);
Then, the generative model of BCC attempts to explain the observed worker's labels as a random draw from a categorical distribution with parameters specified by the rows of the worker's confusion matrix. Specifically, each row is selected by the true task's label, which is an unobserved random variable. In Infer.NET, this process can be coded as follows:
using (Variable.ForEach(k))
{
trueLabel = Variable.Subarray(trueLabel, WorkerTaskIndex[k]))
trueLabel.SetValueRange(c)
using (Variable.ForEach(kn))
{
using (Variable.Switch(trueLabel[kn]))
{
WorkerLabel[k][kn] = Variable.Discrete(WorkerConfusionMatrix[k][trueLabel[kn]]);
}
}
}
For a more efficient implementation, we use the subarray factor to extract the subset of tasks labelled by the user from the taskLabel array, which is faster than using the full array of task labels.
In CommunityBCC, we make a further modelling step by adding a latent worker type variable, which we call community. Communities are represented by confusion matrices that encode similarity patterns among the workers’ confusion matrices. Thus, we assume that the workers’ confusion matrices are not completely random, but rather that they tend follow some underlying clustering patterns – such patterns are readily observable by plotting the confusion matrices of workers as learned by BCC. See this example from a dataset with threepoint scale labels (1, 0, 1):
The CommunityBCC model is designed to encode the assumptions that (i) the crowd is composed by an unknown number of communities, (ii) each worker belongs to one of these communities and (iii) each worker’s confusion matrix is a noisy copy of their community’s confusion matrix. More formally, we assume that there are M communities of workers. Each community m has a score matrix CommunityScoreMatrix[m] representing the unnormalized log probabilities of the community’s confusion matrix CommunityConfusionMatrix[m]. This means that CommunityConfusionMatrix[m] is equal to the softmax of CommunityScoreMatrix[m]. Then, each worker has a community membership Community[k] and a score matrix, WorkerScoreMatrix[k], that represent the log probabilities of the worker's confusion matrix, WorkerConfusionMatrix[k]. To relate workers to their communities, the model assumes that the worker's score matrix is a Gaussian draw from the community's score matrix, which is then converted into a discrete probability vector representing the worker's confusion matrix through the softmax factor. The factor graph of the model is shown below.
To implement this model in Infer.NET, we need to add a community range to the BCC ranges:
Range m = new Range(M);
Then, we code the steps of the generative process of CommunityBCC as follows:
using (Variable.ForEach(k))
{
using (Variable.Switch(Community[k]))
{
ScoreMatrix[k][c] = Variable.VectorGaussianFromMeanAndPrecision(
CommunityScoreMatrix[Community[k]][c],
NoiseMatrix);
}
WorkerConfusionMatrix[k][c] = Variable.Softmax(WorkerScoreMatrix[k][c]);
using (Variable.ForEach(kn))
{
using (Variable.Switch(trueLabel[kn]))
{
WorkerLabel[k][kn] = Variable.Discrete(WorkerConfusionMatrix[k][trueLabel[kn]]);
}
}
}
Notice that the row of the worker’s confusion matrix, ConfusionMatrix[k][c], is used to generate the worker’s label in the same way as described above for BCC. At this point, we can observe a data set encoded as a matrix of task indices (i.e., the indices of the tasks labelled by the workers) and worker’s labels:
WorkerTaskIndex.ObservedValue = taskIndices;
WorkerLabel.ObservedValue = workerLabels;
Finally, we run inference on this model in the standard way by creating an inference engine and calling Infer.
WorkerConfusionMatrix = Engine.Infer<Dirichlet[][]>(ConfusionMatrix);
TrueLabel = Engine.Infer<Discrete[]>(TrueLabel);
CommunityScoreMatrix = Engine.Infer<VectorGaussian[][]>(CommunityScoreMatrix);
CommunityConfusionMatrix = Engine.Infer<Dirichlet[][]>(CommunityConfusionMatrix);
Community = Engine.Infer<Dirichlet>(Community);
How to find the number of communities
For a given data set, we can find the optimal number of communities using standard model selection. In particular, we can perform a model evidence search over a range of community counts. So, if we assume that the community count lies within a range of 1..X communities, we can run CommunityBCC by looping over this range and compute the model evidence of each community count using the Infer.NET engine as follows:
double logEvidence = Engine.Infer<Bernoulli>(Evidence).LogOdds;
Take a look at computing model evidence for model selection in Infer.NET here. Then, we can select the optimal number of communities by looking at the community count with the maximum model evidence.
How to set the other hyperparameters
In addition to the community count, the model has other hyperparameters that need to be set for training:
// Precision of the noisification factor generating a worker's score matrix
double NoisePrecision
// Percommunity on and offdiagonal values of mean of community scorematrix prior
Tuple<double, double>[] ScoreMeanParameters
// Percommunity precision of community scorematrix prior
double[] ScorePrecisionParameters
// Parameter of community membership prior
double CommunityPseudoCount
// Parameter of true label prior
double BackgroundLabelProb
You can set these hyperparameters at run time using ObservedValue. As with community counts you can use model evidence to guide these settings. For training on our data sets, the BackgroundLabelProb and the CommunityPseudoCount are set uninformatively with normalized uniform probabilities across the number of labels and communities, respectively and the NoisePrecision = 5. For the parameters of the community score matrices we provide two ways of specifying the mean ScoreMeanParameters and the precision ScorePrecisionParameters of their multivariate Gaussian prior. Specifically, you can either use ObservedValue or indirectly set the values of the communities confusion matrices and infer the communities score matrices by calling the method GetScoreMatrixPrior(). The latter (which uses the BCC hyperparameter InitialWorkerBelief) is useful since the communities confusion matrices are usually more interpretable than the communities score matrices.
Evaluation
We tested this model on four different crowdsourced datasets and our results show that this model provides a number of advantages over BCC, Majority Voting (MV) and Dawid and Skene’s Expected Maximization (EM) method. Firstly, its community reasoning enables a twoways transfer learning between workers and communities that makes the model more robust against sparse label sets, i.e., sets that contains only a few labels per worker and task. As a result, CommunityBCC converges faster to the highest classification accuracy using less labels. Secondly, the model provides useful information about the number of latent worker communities with their individual confusion matrices and the community membership of each worker. The full set of results and more details about this model can be found in the paper.
Take a look at the full code (details below) and let us know your experiences in using this model.
Enjoy!
Matteo Venanzi, John Guiver and the Infer.NET team.
Full code for this model:
The full C# code can be viewed in the file attached to this post. This code includes classes for BCC, CBCC , the classes for handling their posteriors (BCCPosteriors and CBCCPosteriors), the worker’s label (Datum), and a Data mapping class. The latter provides a parser for data files in the format <workerID, taskID, label, goldLabel (optional)>. This class automatically computes the label range and ensures that mappings between these entities and model indices are handled safely and consistently. There is also an active learning class (ActiveLearning) that provides the functions to reproduce an iterative labelling selection process with enums for the various task selection methods (TaskSelectionMethod) and worker selection methods (WorkerSelectionMethod). The code includes a class to collect the results (Results) and the main class (Program) to run the experiments of the paper. If you want to try out this model, you must download the Infer.NET libraries from here and link them to your visual studio solution that runs this program. You must also provide a data set in a csv file with rows in the format specified above. Finally, note that if your data set is large you may want to configure your Visual Studio project to target x64 to avoid hitting 32 bit memory limits.
References:
[1] H.C. Kim and Z. Ghahramani. Bayesian classier combination. International Conference on Articial Intelligence and Statistics, pages 619627, 2012.
An oft quoted phrase is "correlation does not imply causation". It means that if A tends to be true when B is true (i.e. A and B are correlated), then it is not correct to assume that A causes B (or vice versa).
For example, if you observe that people taking a new cancer drug are surviving longer than people taking the old drug, you cannot assume that the new drug is better. It may be that the new drug is more expensive and so the people who can afford to take it can afford better healthcare in general than those who can't. In order to detect a causal relationship, you need to have data about interventions i.e. where one of the variables is directly controlled. A randomized study is an example of this, where doctors control whether to give a person a new drug or not, according to a random assignment. Because people are randomly assigned to the treated group (given the new drug) or the control group (not given the drug), the only systematic difference between these two groups is whether they had the drug or not. Other factors, like how rich they are, will not vary systematically between the two groups and so misleading conclusions about whether the drug works better can be avoided.
Inferring the probability that a causal link exists between two variables can be highly complex. In today's blog, I will show how you can use Infer.NET to do this inference automatically.
We'll consider a simple example with two binary variables A and B. The question we want to ask is whether A causes B or B causes A (in this example we won't consider any other possibilities e.g. that a third variable C causes both A and B).
We'll first look at how to model 'A causes B'. For this model, we'll assume that A is selected to be true 50% of the time and false 50% of the time (a Bernoulli distribution with parameter 0.5). We're going to have N observations for each of A and B, so in Infer.NET we write this as:
A[N] = Variable.Bernoulli(0.5);
We'll then assume that B is a noisy version of A such that B is set to the same value as A with probability (1q) and set to the opposite value with probability q. So large values of q (e.g. 0.4) means that B is a very noisy version of A, whereas small values of q (e.g. 0.01) mean that B is almost identical to A.
In Infer.NET we write this as:
B[N] = A[N] != Variable.Bernoulli(q);
So that's it for our first model! The complete code with array allocations and loop over N is:
var A = Variable.Array<bool>(N).Named("A"); // Array allocation for A
var B = Variable.Array<bool>(N).Named("B"); // Array allocation for B
// First model code (A causes B)
using (Variable.ForEach(N)) // Loop over N
{
A[N] = Variable.Bernoulli(0.5);
B[N] = A[N] != Variable.Bernoulli(q);
}
We now want to consider the second model where B causes A. The model is going to be defined just as for the first model, but with A and B swapped. However, we don't just want to define this model, we want Infer.NET to work out which of the two models is the right one. To do this we need to introduce a binary switch variable, that I will call AcausesB. If this variable is true, then we will use the first model (A causes B) and if it is false, we will use the second model (B causes A).
The code looks like this:
var AcausesB = Variable.Bernoulli(0.5); // The model switch variable
var A = Variable.Array<bool>(N).Named("A"); // Arrays are allocated once
var B = Variable.Array<bool>(N).Named("B"); // and used for both models
using (Variable.If(AcausesB))
{
// First model code goes here (A causes B)
// i.e. the ForEach loop above
}
using (Variable.IfNot(AcausesB))
{
// Second model code goes here (B causes A)
// This is the same code as for the first model but with A and B swapped
}
To use this code to work out which model is true, we must: create an inference engine, attach data (boolean arrays) to the model by observing the values of A and B, and use the inference engine to infer the posterior distribution of AcausesB. Here's the code:
var engine = new InferenceEngine(); // Create an inference engine
A.ObservedValue = dataA; // Attach data to A
B.ObservedValue = dataB; // Attach data to B
Bernoulli AcausesBdist = engine.Infer<Bernoulli>(AcausesB); // Infer posterior
If you run the whole program, you will find that the posterior distribution AcausesBdist is always Bernoulli(0.5) that is, a 50% chance that each model is true, no matter what data you attach to A and B. In other words, Infer.NET is saying "I don't know". This is because without interventions, it is impossible to say which of the two models is the true one.
To get this example to work, we're going to add in interventions on the variable B. For a subset of our data points, we will intervene to set the value of B directly. We will record which data points we intervene on using another variable doB. When doB is true, it indicates that we set the value of B directly, overriding the existing model above. When doB is false, B will be set according to the existing model.
Here is the code for an intervention where we set the value of B according to a coin flip i.e. Bernoulli(0.5). It should be placed above the two model definitions, since it is common to both.
var doB = Variable.Array<bool>(N).Named("doB"); // True if we intervene on B, false otherwise
using (Variable.ForEach(N))
{
using (Variable.If(doB[N]))
{
B[N] = Variable.Bernoulli(0.5); // If we intervene, set B at random.
}
}
To make sure that we do not define B more than once, we also need to modify each model, so that B[N] is not set if an intervention happens. So the in the first model we need to add an IfNot statement around the line setting B[N]:
using (Variable.IfNot(doB[N]))
{
B[N] = A[N] != Variable.Bernoulli(q);
}
In the second model the IfNot statement is still placed around the line setting B[N], but in this case it is the first line because A[N] and B[N] are swapped:
using (Variable.IfNot(doB[N]))
{
B[N] = Variable.Bernoulli(0.5);
}
Because the intervention affects the two models in different ways, we will now be able to tell the difference between them.
To test out the finished model, let's assume that the true model is the first one, so that A actually does cause B. We can create sampled data sets with interventions from this true model with various numbers of interventions N and for varying noise levels q (to see how to do this look at the full code attached). For each data set, we use the above code to compute the posterior of the true model P(AcausesB).
The following plot shows the resulting computed probability for varying N and q. To account for random variation in datasets of the same size, the computed probability has been averaged over 1000 generated datasets. The plots show that Infer.NET has worked out correctly that the first model is the most probable one, but that the probability depends on the noise level and the number of interventions in the data. The less noise there is in the relationship between A and B, the fewer interventions are needed to be confident that A causes B. For example, when q=0.2 it takes about 20 interventions to be 90% sure that A causes B, whereas when q=0.1 it takes less than 10.
So there we have it  how to do causal inference in Infer.NET.
The complete code for this example is attached to this post below (CausalityExample.cs)  you'll need to download and reference the Infer.NET DLLs to make it run.
Thanks for reading!
John W.
]]>Original date of blog: February 3, 2009
Original blog author: John Guiver
NOTE: This example and the corresponding code are now directly available in the Infer.NET release.
It’s been a couple of months now since we’ve released Infer.NET to the outside world, and a blog is overdue. I hope those of you have downloaded it are enjoying it and have had a chance at least to go through some of the tutorials to see the scope of what we’re providing. In the process of doing this release, one point of discussion has been the question of to what level we should provide and/or support standard black box models. Here I’m talking about the likes of Bayesian factor analysis, Bayesian PCA, HMMs, Bayesian neural nets, and the Bayes point machine. We have been reluctant to provide these as stock models because we have been keen that Infer.NET should be applied in a more customised way than just throwing data at a generic model. The click through model is an excellent example of a thoughtful customised model which tries to capture the behaviour of a user interacting with a page of results returned by a search engine.
That said, it is often useful in the early stages for a user to look at and employ some standard models just to get started. With the release, we have already provided a set of multiclass Bayes point machine examples which can be used for a variety of classification problems. In this blog I will look at Bayesian principal components analysis – see Chris Bishop’s book Pattern Recognition and Machine Learning for example. Most of you will be familiar with standard Principal Component Analysis which is a widely used technique for dimensionality reduction. Given an observation space of dimension D we would like to find a latent space of lower dimension M (the ‘principal space’) which captures most of the data. The mapping from principal space to observation space is via a linear mapping represented by a mixing matrix W. PCA can be formulated in probabilistic way (‘probabilistic PCA’) and this in itself gives many benefits as itemised in Bishop. If in addition we adopt a Bayesian approach to inferring W, we can also determine the optimal value for M using ‘automatic relevance determination’ (ARD).
A fully Bayesian treatment of probabilistic PCA (including ARD) is quite complex and it might take several weeks of work to implement an efficient deterministic inference algorithm. In Infer.NET, defining the model is 20 to 25 lines of code and a few hours’ work.
I am going to plunge straight in and show the factor graph for Bayesian PCA. Factor graphs are a great representation for probabilistic models, and especially useful if you are going to build models in Infer.NET. If you can draw a factor graph, you should be able to code up the model in Infer.NET (subject to the currently available distributions, factors, and message operators).
I’ll jointly explain aspects of the factor graph and Bayesian PCA as I go along. As a reminder, the first thing to note is that the variables in the problem are represented as circles/ovals, and the factors (functions, distributions, or constraints) are represented as squares/rectangles. Each factor is only ever attached to its variables and each variable is only ever attached to its factors. The big enclosing rectangles represent ‘plates’ and show replications of variables and factors across arrays. Factor graphs are important both because they provide a clarifying visualisation of complex probabilistic models and because of the close relationship with the underlying inference and sampling algorithms (particularly those based on local message passing).
The accompanying factor graph shows PCA as a generative model for randomly generating the N observations – these observations can be represented as an NxD array, and so are represented in the factor graph in the intersection of two plates of size N and D respectively. I have explicitly shown all indices in the factor graph to make the comparison with the code clearer. The generative process starts at the top left where, for each observation and for each component in latent space, we sample from a standard Gaussian to give an NxM array Z (with elements z_{nm}).
In the top right we have the matrix variable W (elements w_{md}) which maps latent space to observation space – so this will be an MxD matrix which will postmultiply Z. We have specified a maximum M in advance, but what we really want to do is learn the ‘true’ number of components that explain the data, and we do this by drawing the rows of W from a zeromean Gaussian whose precisions a (elements a_{ m}) we are going to learn. If the precision on a certain row becomes very large (i.e. the variance becomes very small) then the row has negligible effect on the mapping and the effective dimension of the principal space is reduced by the number of such rows. The elements of a are drawn from a common Gamma prior.
The MatrixMultiply factor multiplies the Z with W; as this factor takes two double arrays of variables, it sits outside the plates (shown as a white hole in the surrounding plates). The result of this factor is an NxD matrix variable T (elements t_{nd}). We then add a bias vector variable m (elements m_{ d} drawn from a common Gaussian prior) to each row of T to give matrix variable U (elements u_{nd}). Finally, the matrix X of observations vector is generated by drawing from Gaussians with mean U and precision given by the vector variable p (elements p_{ d} drawn from a common Gamma prior). The elements in this final array of Gaussian factors are the likelihood factors and represent the likelihood of the observations conditional on the parameters of the model. These parameters are W, m and p which we want to infer.
This model is captured in the following few lines of C# code which, as you can hopefully see, directly matches the factor graph (I have omitted the variable and range declarations as these don’t add further insight, but the full code can be found here):
// Mixing matrix
vAlpha = Variable.Array<double>(rM).Named("Alpha");
vW = Variable.Array<double>(rM, rD).Named("W");
vAlpha[rM] = Variable.Random<double, Gamma>(priorAlpha).ForEach(rM);
vW[rM, rD] = Variable.GaussianFromMeanAndPrecision(0, vAlpha[rM]).ForEach(rD);
// Latent variables are drawn from a standard Gaussian
vZ = Variable.Array<double>(rN, rM).Named("Z");
vZ[rN, rM] = Variable.GaussianFromMeanAndPrecision(0.0, 1.0).ForEach(rN, rM);
// Multiply the latent variables with the mixing matrix...
vT = Variable.MatrixMultiply(vZ, vW).Named("T");
// ... add in a bias ...
vMu = Variable.Array<double>(rD).Named("mu");
vMu[rD] = Variable.Random<double, Gaussian>(priorMu).ForEach(rD);
vU = Variable.Array<double>(rN, rD).Named("U");
vU[rN, rD] = vT[rN, rD] + vMu[rD];
// ... and add in some observation noise ...
vPi = Variable.Array<double>(rD).Named("pi");
vPi[rD] = Variable.Random<double, Gamma>(priorPi).ForEach(rD);
// ... to give the likelihood of observing the data
vData[rN, rD] = Variable.GaussianFromMeanAndPrecision(vU[rN, rD], vPi[rD]);
In order to test out this model, I generated 1000 data points from a model with M=3, D= 10. However, when doing inference, I gave it M=6 to see if the inference could correctly determine the number of components. Here is the printout from a run. As you can see, as expected, only three of the rows of W have significant nonzero entries, and the bias and noise vectors are successfully inferred.
Mean absolute means of rows of W: 0.05 0.24 0.20 0.02 0.21 0.01
True bias: 0.95 0.75 0.20 0.20 0.30 0.35 0.65 0.20 0.25 0.40
Inferred bias: 0.90 0.70 0.22 0.22 0.31 0.37 0.65 0.17 0.25 0.42
True noise: 1.00 2.00 4.00 2.00 1.00 2.00 3.00 4.00 2.00 1.00
Inferred noise: 0.98 1.99 4.09 2.00 1.08 2.00 2.70 3.51 2.16 1.01
The full C# code for this example can be found here. This code provides a class (BayesianPCA) for constructing a general Bayesian PCA model for which we can specify priors and data at inference time. There is also some example code for running inference and extracting marginals. One important aspect of running inference on this model is that we initially need to break symmetries in the model by initialising the W marginals to random values using the InitialiseTo() method – for another example of this see the Mixture of Gaussians tutorial.
I hope this provides a useful addition to your Infer.NET toolbox and also provides some insight into building models with Infer.NET. Keep the feedback and comments coming.
John Guiver and the Infer.NET team.
]]>Original date of blog: November 24, 2010
Original blog author: David Knowles
The new release of Infer.NET supports a number of new factors you may find useful when designing your models. Some of these features have been well tested, while others are still experimental and should be used with caution. How do you know which features are experimental I hear you ask? To answer just that question we have introduced quality bands, which are assigned to every piece of functionality. By default you'll get a warning if your model is using any experimental features. For more information and guidance on quality bands take a look at the user guide and the list of factors.
In this blog post I'll give one example of a new feature, the softmax factor, which lets you perform efficient multinomial logistic regression using Variational Message Passing (VMP). The inputs to the regression are continuous variables, and the output is categorical, i.e. belongs to one of C classes. One example would be predicting what political party someone will vote for based on some known features such as age and income. Another would be classifying cancer type based on gene expression data.
Multinomial regression involves constructing a separate linear regression for each class, the output of which is known as an auxiliary variable. The vector of auxiliary variables for all classes is put through the softmax function to give a probability vector, corresponding to the probability of being in each class. The factor graph for this model is shown below (see John G.'s blog post on Bayesian PCA for more on factor graphs).
Multinomial softmax regression is very similar in spirit to the multiclass Bayes Point Machine (BPM) which we also have as an example. The key differences are that where the BPM uses a max operation constructed from multiple “greater than” factors, multinomial regression uses the softmax factor, and that where Expectation Propagation is the recommended algorithm for the BPM (although VMP is now supported), VMP is currently the only option when using the softmax factor. Two consequence of these differences is that multinomial regression scales computationally much better with the number of classes than the BPM (linear rather quadratic complexity) and it is possible to have multiple counts for a single sample. An example of where these differences would make multinomial regression appropriate would be predicting word counts in documents, based on some known features such as date of publication. The number of classes C would be the size of the vocabulary, which could be very large, making the quadratic O(C^{2}) scaling of the BPM prohibitive. The outputs are counts rather than single class assignments, which are straightforward to model with multinomial regression using the multinomial factor.
Now let's set up our model in Infer.NET. First we'll need ranges over the C classes and N samples.
var c = new Range(C);
var n = new Range(N);
Here I've used lower case to denote the Infer.NET ranges, and upper case to denote the (fixed) size of the range. We need a separate vector of coefficients B and mean m for each class c . The coefficient vectors B have length K, which is the number of features (inputs) to the model.
var B = Variable.Array<Vector>(c);
B[c] = Variable.VectorGaussianFromMeanAndPrecision(Vector.Zero(K),
PositiveDefiniteMatrix.Identity(K)).ForEach(c);
var m = Variable.Array<double>(c);
m[c] = Variable.GaussianFromMeanAndPrecision(0, 1).ForEach(c);
var x = Variable.Array<Vector>(n);
x.ObservedValue = xObs;
var yData = Variable.Array(Variable.Array<int>(c), n);
yData.ObservedValue = yObs;
var trialsCount = Variable.Array<int>(n);
trialsCount.ObservedValue = yObs.Select(o => o.Sum()).ToArray();
The variable trialsCount is simply the total count for each sample, which I've calculated on the fly here using a little Linq magic. We are now ready to write down the core of the model: we create the auxiliary variables g, pass them through the softmax factor, and use the resulting probability vectors in the multinomial factor. If you have an individual class assignment for each sample rather than counts you could use the Discrete rather than Multinomial factor.
var g = Variable.Array(Variable.Array<double>(c), n);
g[n][c] = Variable.InnerProduct(B[c], x[n]) + m[c];
var p = Variable.Array<Vector>(n);
p[n] = Variable.Softmax(g[n]);
using (Variable.ForEach(n))
yData[n] = Variable.Multinomial(trialsCount[n], p[n]);
A problem we need to address here is that the model is not identifiable as we currently have it set up. You can see this because adding a constant value to all the auxiliary variables g will not change the output of the prediction. To make the model identifiable we enforce that the last auxiliary variable (corresponding to the last class) will be zero, by constraining the last coefficient vector and mean to be zero:
Variable.ConstrainEqual(B[C  1], Vector.Zero(K));
Variable.ConstrainEqual(m[C  1], 0);
Finally we run inference over the model:
var ie = new InferenceEngine(new VariationalMessagePassing());
ie.Compiler.GivePriorityTo(typeof(SaulJordanSoftmaxOp_NCVMP));
var bPost = ie.Infer<VectorGaussian[]>(B);
var meanPost = ie.Infer<Gaussian[]>(m);
A specific handling of the softmax factor is required because its input is deterministic: this is specified by the GivePriorityTo function. If you don't do this you'll get an error saying the input to the softmax function cannot be deterministic. For some factors multiple implementations are available with different characteristics: GivePriorityTo allows you to choose which version you would like to use. The default implementation for the softmax factor assumes the input is stochastic which allows a faster converging algorithm to be used.
Let's test this model using synthetic data drawn from the same model with six features, four classes and a total count of ten per sample. For a sample size of 1000, the following plot shows the accuracy of the inferred coefficients and mean. We have recovered the coefficients really well: but of course we had quite a lot of data! Running 50 iterations of VMP took only three seconds.
A good sanity check is to see how the inference performance improves with increasing sample size. The following plot shows the improvement in the RMSE (root mean squared error) between the true and mean inferred coefficients. Of course, we would expect this to decrease as the number of samples increases because each sample gives us some more information to use. Indeed we see that this is the case: error rapidly decreases with sample size. Note the decreasing returns, which is typical of statistical estimation of unknown quantities: initially additional samples help us a lot, but once we have a reasonable amount of data, new samples don't tell us much we didn't already know.
The code including generating synthetic data, setting up the model, running inference and computing errors is available here. Feel free to download it, try it out on your own data or modify/extend the model however you please. If you don't have it yet, be sure to install the latest Infer.NET release.
I hope this blog has shown you the potential of just one of the new factors in the new release, the softmax factor. The possibilities extend far beyond simple multinomial regression. The great thing about the softmax factor is that any time you want to model probability vectors with complex dependencies, you can model those dependencies in continuous space and map to probability vectors using the softmax factor. A model of this type in the machine learning literature is the correlated topic model, a version of Latent Dirichlet Alocation where the document specific distributions over topics are represented as continuous vectors drawn from a learnt multivariate Gaussian, and are put through a softmax function to give valid probability distiributions.
I hope you've enjoyed reading about just one new piece of functionality available in the new release of Infer.NET, and that you feel inspired to get stuck in designing your own probabilistic models!
Happy modelling!
David K and the Infer.NET team.
]]>Original date of blog: November 18, 2010
Original blog author: John Guiver
This is the second in a series of blogs about the Infer.NET 2.4 Beta 1 release, and highlights the benefits of an important design principle of Infer.NET  the separation of model and inference.
The Infer.NET API provides a way of declaratively describing a model, and the Infer.NET compiler generates the code that runs inference on the model. The model encodes prior domain knowledge whereas the inference specifies computational constraints. There are some important reasons to keep these separate:
This blog illustrates each of these points. To make things concrete we will look at a simple example of modeling the time it takes to bike in to work. When we are biking into work the first time we may only have a vague idea of how long it might take, but as we record the timings each day we gradually get a better idea.
Our model is as follows. We have a number of timings N  we make this a variable because we will want to run the model on several occasions. We want to be able to specifiy our prior belief in the average time it will take  this is averageTimePrior in the code below which is a distribution we can set at runtime. The average time (averageTime) is then a variable which we want to infer and, prior to seeing any data, it is distributed according to our prior belief averageTimePrior. Finally, we need to attach our observed timings, and we make an assumption that our observations are randomly scattered around the unknown averageTime with a known precision of 1.0 (precision is 1.0/variance and is a more natural parameter with which to describe a Gaussian distribution). We could think of this noise as reflecting various random changes in traffic, weather etc.
var N = Variable.New<int>();
var n = new Range(N);
var averageTimePrior = Variable.New<Gaussian>();
var averageTime = Variable.Random<double, Gaussian>(averageTimePrior);
var noise = 1.0;
var timings = Variable.Array<double>(n);
timings[n] = Variable.GaussianFromMeanAndPrecision(averageTime, noise).ForEach(n);
Note that this makes very clear what the assumptions are, in this case, the data is generated from a Gaussian with fixed noise and random mean with an as yet unspecified Gaussian prior. At this point we also have not specified any data in the form of observed timings.
Suppose we now are given some data and we want to infer the average time it takes to cycle into work. At this point, we need to specify our prior belief for the average time (our initial assumption, on the first line below, says that it will take 20 minutes, but we are very unsure about this so the variance is large). We also need to hook up the data:
averageTimePrior.ObservedValue = Gaussian.FromMeanAndVariance(20, 100);
timings.ObservedValue = new double[] { 15, 17, 16, 19, 13 };
N.ObservedValue = data.ObservedValue.Length;
We can infer the posterior average time as follows:
InferenceEngine engine = new InferenceEngine();
engine.Algorithm = new ExpectationPropagation();
Gaussian averageTimePosterior = engine.Infer<Gaussian>(averageTime);
This will give us a posterior Gaussian distribution which, based on the observations, updates our initial belief (20 minutes with large uncertainty) to an updated belief (16 minutes with a standard deviation of about 0.45). There is still uncertainty in the average time it takes to ride into work because we only have a few timings under our belt  the more timings we make, the more confident we'll be in the accuracy of the average time.
Now suppose we want to ask a slightly different question  namely when I ride in tomorrow, what is the probability that it will take less than 20 minutes? To answer this question we can add a 'tomorrowsTime' variable to the model and directly make that query of Infer.NET:
Variable<double> tomorrowsTime = Variable.GaussianFromMeanAndPrecision(averageTime, noise);
double probLessThan20Mins = engine.Infer<Bernoulli>(tomorrowsTime < 20).GetProbTrue();
The answer can be found in the table at the end of this blog.
The code above uses Expectation Propagation. However, Infer.NET supports 3 main algorithms, the other two being Variational Message Passing and Block Gibbs Sampling. As documented in the User Guide, these all have different characteristics, for example, trading off accuracy against speed. Also, not all algorithms are applicable to each model  refer to the list of factors and constraints for more guidance about this. For this simple model, all three of these algorithms are applicable. If we want to try a different algorithm, we don't need to change the model; instead we simply reset the engine's algorithm:
engine.Algorithm = new VariationalMessagePassing();
engine.Algorithm = new GibbsSampling();
Each engine can be then run in turn and, depending on the model, will give slightly different results due the nature of the approximations. The Gibbs Sampling inference should be exact if enough iterations are run. Some comparative results are given at the end of this blog.
In our initial model we made an assumption about the size of the noise representing the daytoday variation. Suppose now we want to change our model so that we learn the noise rather than make an assumption about its value. This is easily done by making the noise a random variable (with some prior) rather than a fixed variable:
var noise = Variable.GammaFromShapeAndRate(2, 2);
If the model code had been mingled with the inference code, this would have been much trickier. Putting all the above together, here are the results of running our second model using all three algorithms:

Expectation Propagation 
Variational Message Passing 
Gibbs Sampling 
AverageTime mean 
16.02 
16.02 
16.03 
AverageTime standard deviation 
1.00 
0.77 
0.89 
Noise mean 
0.36 
0.33 
0.33 
Noise standard deviation 
0.24 
0.16 
0.17 
Tomorrow's mean 
16.02 
16.02 
15.98 
Tomorrow's standard deviation 
1.73 
1.05 
1.68 
Prob. tomorrow's ride < 20 minutes 
98.91% 
99.99% 
99.15% 
Now suppose that there are days where there are unforeseen events such as a puncture, or an accident which necessitates a detour. The distribution of timings on these days is different from our standard timings. We may want to improve our model to represent normal days and exceptional days. This can be done by incrementally extending the model to include branches which are conditioned on random variables. Alternatively, there are many other directions you could take to extend the model using the different modeling components of Infer.NET.
Hopefully this has given you something to think about on your ride into work, now highly confident that you will get there on time!
John G. and the Infer.NET team
]]>Original date of blog: November 8, 2010
Original blog author: John Winn
Hello Infernauts!!
Now that version 2.4 of Infer.NET is released, we're planning a series of blog posts to describe the new features and capabilities that we've added over the last 12 months. The main focus of this version is performance, so I'm going to start off the series by showing how we've made improvements in the speed of running inference for a range of models. For those of you already using Infer.NET, the nice aspect of these changes is that your existing models will just run faster and use less memory without you having to do a thing!
I've picked two common models to compare on:
The results are shown in the table below. For both versions I am using the Release DLLs, so as to get the best performance out of each version of the framework. They show the total time for inference (not compilation) as given by the ShowTimings option on the inference engine.

Infer.NET 2.3 beta 4 
Infer.NET 2.4 beta 1 
Speedup 
Mixture of Gaussians 
116.2 
7.5 
×15.5 
Bayes Point Machine 
3186.2 
6.2 
×514 
As you can see, the changes we have made in version 2.4 provide significant speed ups for several useful models when running VMP or EP. These speed ups are mainly due to improvements in the compiler which means that the generated code is much more efficient. For example, we now check to see if more than one message in the message passing algorithm is being computed in the same way and, if so, remove the redundant computation and the redundant message. This optimisation means that, as well as an increase in speed, there are also significant reductions in memory usage.
Let's now look at Gibbs Sampling, which we have supported as an inference algorithm since Infer.NET 2.3. As a model, I'll use the BUGS Rats model  the 'Rats' normal hierarchical model from the examples that come with the WinBUGS Gibbs Sampling software. I ran inference on this model in both versions of Infer.NET and in WinBUGS  using 20,000 iterations of Gibbs Sampling in each case. For a further comparison, I also ran exactly the same model using 50 iterations of VMP.

Infer.NET 2.3 beta 4 
Infer.NET 2.4 beta 1 
WinBUGS 1.4.3 
BUGS Rats 
6.1 
5.8 
12.6 
BUGS Rats 
0.07 
0.07 
 
These results show that there's only a small improvement in speed in version 2.4. However, the solution is still more than twice as fast as WinBUGS! What's even more interesting is that, by using Infer.NET's ability to switch inference algorithms on the same model, we can use VMP and run in about 70 milliseconds i.e. about 80 times faster.
In addition to improvements to the generated code, we have added support for sparse vectors which allow much more memoryefficient inference in models with large discrete or Dirichlet distributions as described in this user guide page. We'll be looking at using this new sparse support in a blog post later in this series.
Also on the topic of performance, we have upgraded the support for parallel inference described in this blog post to use the .NET 4.0 Parallel class. Parallel inference is still an experimental feature but you may well get improved performance for your model if you are running on a multicore machine.
So that's it  the same model code now runs faster. Enjoy the new, more streamlined, Infer.NET!
John W. and the Infer.NET team
]]>Original date of blog: February 1, 2010
Original blog author: John Guiver
In this post, we're going to look at how to use Infer.NET to streamline the conference reviewing process. The process in a typical computer science conference involves each submission being reviewed by several reviewers of differing expertise, attention to detail, and time available. Back in March 2009, Peter Flach and Mohammed Zaki, program committee cochairs of the ACM SIGKDD 2009 conference, asked us if we could come up with a model that could be used to calibrate the ratings against these differences. They then planned to use this model to help streamline the process for selecting the final acceptances.
Based on some previous experience with these types of models, we were able to come up with a model very rapidly, the Infer.NET code taking about a day, and some additional time for dealing with data and result handling. I will just describe the main aspects of the model and the corresponding code in this blog (see the end of this post for how to get the full code). We calibrate reviewer scores using a generative probabilistic model which addresses variation in reviewer accuracy, and selfassessed differences in reviewer expertise level. The model relies on the fact that each paper has several reviewers, and each reviewer scores several papers. This enables us to disentangle (up to uncertainty) the reviewer's reviewing standards and the quality of the papers. Similar forms of model have been used for assessing NIPS reviews, for rating the skills of Xbox Live players, and also for the Research Assessment Exercise for UK computer science, 2008. A factor graph of the model is shown below; arrows are included to show the generative flow.
First a bit of notation. We assume that there are R reviews, S submissions, E expertise levels, and J reviewers (we use J for 'judges' to distinguish reviewers from reviews). These counts are shown in the top right hand corner of the corresponding plates (the rectangles in the factor graph). Recall that plates represent parts of the graphical model which are duplicated  if N is the size of a plate, there are N copies of all factors and variables within the plate. In Infer.NET, plates are represented by ranges:
Range s = new Range(S);
Range j = new Range(J);
Range r = new Range(R);
Range e = new Range(E);
There is one other range that our model uses and that is a range over T thresholds  more on that later. For each review r, there is a corresponding submission s[r], expertise level e[r], and reviewer j[r]. These dependent indices are shown on the appropriate edges in the factor graph to indicate the sparse connection topology between the review plate and the other plates. So, for example, variables/factors with index r in the review plate are only connected with factors/variables with index s[r] in the submission plate etc. This sparse information needs to be represented in the Infer.NET model, and we do this by creating integer variable arrays whose observed values are the corresponding mapping (the following code makes the assumption that expertise levels start at 1):
// Constant variable arrays
var sOf = Variable.Observed((from rev in reviews select submissionToIndex[rev.Submission]).ToArray(), r);
var jOf = Variable.Observed((from rev in reviews select reviewerToIndex[rev.Reviewer]).ToArray(), r);
var eOf = Variable.Observed((from rev in reviews select ((int)rev.Expertise  1)).ToArray(), r);
Our generative model attempts to explain the observed data as random samples from a generative process. Let's start describing this process by focusing attention on the top right hand area of our factor graph in the S plate. Here we suppose that there is an array of variables q (represented in the code by quality) which represent the true underlying quality of the submission. This is unobserved, and is, in fact, the main thing we want to infer. We suppose that these qualities q derive from a broad Gaussian prior:
quality[s] = Variable.GaussianFromMeanAndPrecision(m_q, p_q).ForEach(s);
where the mean of the prior is set to be at the midrange of the rating levels. The prior parameters are common to all the quality variables (i.e. they sit outside the plate), and so we need ForEach  here shown as an inline expression to indicate they are shared across the s plate. Similarly, we suppose that there is a different amount of noise l (represented in the code by 'expertise') associated with different expertise levels. This noise will affect the observed quality of the review; in fact we would expect that high expertise would lead to more precise reviews  however, we don't enforce this in the model:
expertise[e] = Variable.GammaFromShapeAndRate(k_e, beta_e).ForEach(e);
k_e and beta_e are shape and rate parameters for the Gamma distribution and are discussed in the paper. Given the quality of the submission and the stated expertise of the reviewer, we derive another variable s which represents the latent score of the review:
score[r] = Variable.GaussianFromMeanAndPrecision(quality[sOf[r]], expertise[eOf[r]]);
Note here that we have used the index maps sOf and eOf that we defined earlier. This completes the right hand side of the factor graph.
The left hand side is a bit trickier. We suppose that the latent score s in the review plate is compared to a number of reviewerdependent thresholds. The number of thresholds T is set to one less than the number of rating levels, and represent the range of scores corresponding to a particular rating. So a score that is less than the smallest threshold represents the lowest rating (strong reject), a score lying between the lowest and second lowest threshold represents the second lowest rating (weak reject), and so on. We would like to learn these thresholds for each reviewer, to account for how generous or otherwise the reviewer is. A particular reviewer's set of thresholds will be a noisy version of a standard set of thresholds, where the amount of noise is determined by the accuracy a of the reviewer. The accuracy is a precision much like the expertise precision we discussed earlier and derives from a Gamma distribution with shape and rate parameters k_a and beta_a. The assumption here is that more unbiased reviewers will show less variation away from an ideal set of thresholds given by the nominal thresholds:
accuracy[j] = Variable.GammaFromShapeAndRate(k_a, beta_a).ForEach(j);
theta[j][t] = Variable.GaussianFromMeanAndPrecision(theta0[t], accuracy[j]);
The final step is to incorporate the observed rating for the particular review. To do this, we create a set of bool random variables from logical expressions comparing score with threshold:
observation[t][r] = score[r] > theta[jOf[r]][t]
We can now observe the truth or falsehood of these statements directly from the reviewer's rating:
// Observations  convert from recommendation to an array of bool
bool[][] obs = new bool[T][];
for (int i=0; i < T; i++)
obs[i] = (from rev in reviews select (i < (int)rev.Recommendation)).ToArray();
observation.ObservedValue = obs;
We can now run inference on this model in the standard way by creating an inference engine and calling Infer. The data is passed to the main Run method as an IEnumerable<Review>; so, for example, you could pass down a List<Review>. The Review class has the following constructor:
public Review(Reviewer reviewer, Submission submission, int recommendation, int expertise)
which you can call to build up your list as you parse your data file.
The KDD 2009 program committee chairs used this model to highlight areas where further screening of submissions was needed, and the model was successful in bringing to light biases in the reviews due to the variation in standards and expertise level of different reviewers. We would love to describe the results in more detail, but unfortunately we cannot due to the confidentiality of the data. One interesting side note is that our expectation, stated earlier, that high expertise should lead to more precise reviews is in fact confirmed by the model which gives precisions of 1.287 ("Informed outsider"), 1.462 ("Knowledgeable"), and 1.574 ("Expert").
Take a look at the full code (details below) and let us know your experiences in using this model.
Enjoy!
John Guiver and the Infer.NET team.
Full code for this model:
A fuller description of the model and its application can be found here and full C# code can be viewed at http://research.microsoft.com/infernet/blogs/reviewer_model.aspx. This code includes classes for review (Review), reviewer (Reviewer) and submission (Submission), and a Utility class to ensure that mappings between these entities and model indices are handled safely and consistently. There are also enums for expertise level (Expertise) and rating level (Recommendation)  if you want to try out this model, you should change these to match the levels in your data, though note that there is an assumption that levels start at 1 in both cases. Finally, the code includes a class to collect the results (ReviewerCalibrationResults), and, the focus of this blog, the main model class (ReviewerCalibration).