InfoVis CyberInfrastructure- Topics

Software > Topics

Description

Similar topics models are developed at Berkeley and in Finland (Mark at OSTP)

Our topics model (also called latent dirichlet allocation model) does not use EM. Instead, it uses Markov Chain Monte Carlo simulation with Gibbs Sampling (see our paper for what that means).

This is a Gibbs sampler for the Latent Dirichlet Allocation generative model. It requires the Mersenne Twister random number generator code in cokus.c. To make the code on a Linux machine, put all files in a single directory and type "make bars".

INPUT:

Topics works over a term-document matrix that represents how often any of the terms in a set of documents appear in any of the documents in your data set. Most cell entries are zero and hence a sparse matrix summary of this matrix is used to represent this matrix.

When you determine the list of unique terms in a document set, this term set is also called a lexicon), be sure to delete all function words ("the", "a", "is" etc). They will mess up the topics.

OPERATION AND OUTPUT:

The algorithm performs a "burn-in" of length set by the user, intended to allow the Markov Chain to reach its stationary distribution. Samples are then generated with a lag between them to reduce correlation. Once each sample is generated, summary information is saved out to a (potentially large) file sample{n}.dat, where {n} is the number of the sample, starting from 0. The code is currently written to save out the matrix of counts of the number of times a particular word occurs in a particular topic. This
matrix can be used to estimate P(w|z), the probability of a word under each topic. A Bayesian estimate consistent with the model is to add BETA to each entry in the matrix, and then normalize the columns. The estimates for each sample can then be averaged together to give an overall estimate.

eg. For the bars data, sample0.dat will contain 10 columns and 25 rows, with entries representing the number of time a particular word (row) was assigned to a particular topic (column). Plotting the results with the included Matlab code baranalysis.m will show the topics extracted by the model.

Notes: This code has not been optimized and nobody has had time yet write the code in a fashion that scales naturally. This means that there is no dynamic memory allocation, or any other things that would make the code easy to apply to new data sets. In particular, the output files tend to be large because they contain a great many zeros. The output format used here is mainly for illustrative purposes, and any output generated in this fashion should be converted into sparse matrices using Matlab or some other program for storage. A more efficient way of producing the same information is to write out the files in the same [words documents counts] format that is used for the input data, which can also be converted into a sparse matrix by Matlab.

The estimation procedure:

Look over the code to see if anything looks mysterious. There are two free parameters in the model that determine smoothing (ALPHA and BETA). The current values should work on most datasets so there is not much to change in the code besides providing the right values for W, D, NMAX, and KMAX (the number of topics, the most important variable).

The output:

Remember that the ldacora.cpp program is a Gibbs Sampler. You do not get a single estimate for the parameters (as in maximum likelihood) but a stream of samples that start after BURNIN iterations with LAG iterations in between. For most of our purposes, each single estimate is already quite good and no integration is needed over samples.

The program saves sparse matrices of how many word tokens were assigned to each topic (the wp matrix saved in the wp*.txt files). Another file is ll*.txt which saves the loglikelihood values at each iteration. Ignore the dp files. Note that the displayed output also shows the loglikelihood values. These values are based on Bayesian Model Selection and will automatically penalize solutions with a large number of topics (see our PNAS paper).

Interpreting Topics:

When the estimation procedure saved the first sample (after BURNIN iterations), you can use the Matlab program to save a nicely formatted text file of the topics (example provided in topics.txt). Note that this
simulation was based on 100 topics and the file shows the top ten words in each topic (with highest probability). This sample was saved after 50 iterations. Better results are achieved when the Gibbs sampler runs for about 500 iterations (just keep running the code until the wp file is saved).

Pros & Cons

Top

Applications

Top

Details

Top

The current topics code requires a input file that has a header .... followed by three columns. The first column gives the word indices (ranging from 1 to the total number of unique words), the second column the document indices (ranging from 1 to total #documents), and the third column gives the frequency
with which that "word" occurred in that "document".

Topics takes a file formatted as follows:

Keywords Documents (describing what two entities are)
Document index word index Word (frequency) count for the document
Document index word index Word (frequency) count for the document

Data input is a matrix of terms vs. documents (e.g., words vs. abstracts). An input file consists of three lines:
1) number of words (W)
2) number of documents (D)
3) number of word tokens (n)
followed by n lines with document, word, and count indices. For example, a line that reads
234 56 12
means that word #56 in the lexicon occurs 12 times in document #234.

Note: This representation is the same as that obtained when applying the find command to a sparse matrix in Matlab, and facilitates converting data back and forth.

Topics returns a hashmap which contains two matrix models: one containing counts of keywords by topic and the other containing documents by topic.

Usage Hints

Top

References

Top

Acknowledgments

Top

The original Topics code (SVDPACK) was provided by Tom Griffith and Mark Steyvers. Ben Ashpole translated the code into Java. Sriram Raghuraman integrated the code into the repository.

Information Visualization Cyberinfrastructure @ SLIS, Indiana University
Last Modified January 14, 2004