Software > Latent Semantic Analysis Description | Pros and Cons | Applications | Details | Usage Hints | References | Acknowledgments
Latent Semantic Analysis (LSA) can be applied to induce and represent aspects of the meaning of words (Berry et al., 1995; Deerwester et al., 1990; Landauer & Dumais, 1997; Landauer et al.,1998). LSA is a variant of the vector space model that converts a representative sample of documents to a term-by-document matrix in which each cell indicates the frequency with which each term (rows) occurs in each document (columns). Thus a document becomes a column vector and can be compared with a user's query represented as a vector of the same dimension. In the case of digital libraries, terms may stand for words in abstracts or titles, author names, or cited references. Only author names, titles, abstracts, and cited references occurring in more than one document are included in the analysis. LSA extends the vector space model by modeling term-document relationships using a reduced approximation for the column and row space computed by the singular value decomposition of the term by document matrix. This is explained in detail subsequently. Singular Value Decomposition (SVD) is a form of factor analysis, or more properly, the mathematical generalization of which factor analysis is a special case (Berry et al., 1995). It constructs an n dimensional abstract semantic space in which each original term and each original (and any new) document are presented as vectors. In SVD a rectangular term-by-document matrix X is decomposed into the product of three other matrices W, S, and P' with {X} = {W} {S} {P}' (see figure below). W is a orthonormal matrix and its rows correspond to the rows of X, but it has m columns corresponding to new, specially derived variables such that there is no correlation between any two columns; i.e., each is linearly independent of the others. P is an orthonormal matrix and has columns corresponding to the original columns but m rows composed of derived singular vectors. The third matrix S is an m by m diagonal matrix with non-zero entries (called singular values) only along one central diagonal. A large singular value indicates a large effect of this dimension on the sum squared error of the approximation. The role of these singular values is to relate the scale of the factors in the other two matrices to each other such that when the three components are matrix multiplied, the original matrix is reconstructed. Following the decomposition by SVD, the n most important dimensions (those with the highest singular values in S) are selected. All other factors are omitted, i.e., the other singular values in the diagonal matrix along with the corresponding singular vectors of the other two matrices are deleted. The amount of dimensionality reduction, i.e. the choice of n is critical and an open issue in the factor analytic literature. Ideally, n should be large enough to fit the real structure in the data, but small enough such that noise, sampling errors or unimportant details are not modeled (Deerwester et al., 1990). Use good retrieval and browsing performance as an operational criterion. The reduced dimensionality solution then generates a vector of n real values to represent each document. The reduced matrix ideally represents the important and reliable patterns underlying the data in X. It corresponds to a least-squares best approximation to the original matrix X. Similarity matrix: The similarity between documents can be computed on the basis of the three matrices as follows:
The dot product between two row vectors of the reduced matrix reflects the degree to which two documents have a similar pattern of terms. The matrix XX' is the square symmetric matrix containing all these document-document dot products. Since S is diagonal and W is orthonormal, the following holds: XX'=WS2W'. Thus the i, j cell of XX' can be obtained by taking the dot product between the i and j rows of the matrix WS. That is, if one considers the rows of WS as coordinates for documents, dot products between these points give the comparison between documents. Data Visualization: Different algorithms can be applied to display the resulting semantic networks. Among them are Spring Embedding Algorithm and Pathfinder Network Scaling. Sometimes you may wish to display a certain partition of data into clusters/categories. Use Matlab to cluster your data hierarchically and select the best partition of the hierarchy according to the utility measure given below. Hierarchical Clustering: Apply nearest-neighbor-based, agglomerative, hierarchical, unsupervised conceptual clustering to create a hierarchy of clusters grouping documents of similar semantic structure. Clustering starts with a set of singleton clusters, each containing a single document di D, i=1, ..., N, where D equals the entire set of documents and N equals the number of all documents. The two most similar clusters over the entire set D are merged to form a new cluster that covers both. This process is repeated for each of the remaining N-1 documents. A complete linkage algorithm is applied to determine the overall similarity of document clusters based on the document similarity matrix. Merging of document clusters continues until a single, all-inclusive cluster remains. At termination, a uniform, binary hierarchy of document clusters is produced. The figure below shows an example of how five documents may be clustered hierarchically. The cluster hierarchy is shown on the left hand side. Corresponding partitions are given on the right hand side. Note that each documents to itself has a similarity of one. Hierarchical Clustering Selection of the Best Partition:The partition showing the highest within cluster similarity and lowest between-cluster similarity is determined by means of a utility measure that contrasts the sum of within-cluster similarities wSim by the sum of between-cluster similarities bSim: utility = wSim / (wSim + bSim). The partition that shows the highest utility is selected for data visualization.
The advantage of SVD lies in the fact that documents are not represented by individual terms but by a smaller number of independent "artificial values" that can be specified by any one of several terms or combinations thereof. In this way, relevant documents that do not contain the terms of the query are retrieved via other terms in the query that can be properly identified. The resulting retrieval scheme allows one to order documents continuously by similarity to a query. A similarity threshold or a number of solutions can be specified depending on the user and task. In such a way, SVD overcomes two fundamental problems faced by traditional lexical matching indexing schemes: synonymy (variability of word choice - that is different words can be used to express a concept and query terms may not match document descriptions) and polysemy (words can have multiple meanings and a user's query may match irrelevant meanings and thus retrieve irrelevant documents) (Deerwester et al., 1990).
The SVDPACK by M. Berry (1993) berry@cs.utk.edu comprises four numerical (iterative) methods for computing the singular value decomposition of large sparse matrices using double precision (Well, this was true some years ago ... it is much faster now!) The package allows one to decompose matrices on the order of 50,000 by 50,000 (e.g. 50,000 words and 50,000 contexts) into representations in 300 dimensions with about two to four hours of computation on a high-end Unix work-station with around 100 Megabytes of RAM. The computational complexity is O(3Dz), where z is the number of non-zero elements in the word (W) X context (C) matrix and D is the number of dimensions returned. The maximum matrix size one can compute is usually limited by the memory (RAM) requirement, which for the fastest of the methods in the Berry package is (10+D+q)N + (4+q)q, where N=W+C and q=min (N, 600), plus space for the W X C matrix. Other packages:
From the SVDPACK, the sparse SVD via single-vector Lanczos algorithm will be used because of its comparatively low memory and processing requirements and low to moderate accuracies. The document by term input matrix has to be represented in a compressed Harwell-Boeing storage fomat (Berry et al, 1993) Appendix B. The package and all java files can be found on ella in '/home/www/ella/htdocs/classes/L697/code/lsa'. There are four separate steps involved in determining and visualizing the semantic relationships between documents: 1. Generate compressed Harwell-Boeing storage format:
------ ------------ --------- las2.c lap2, matrix lao2,lav2,lav2-data
where <name> is the name of the data set; lanmax is an integer specifying maximum number of lanczos step allowed; maxprs indicates maximum number of singular triplets of A (eigenpairs of the equivalent matrix B) desired; endl,endr are two end-points of an interval within which all unwanted eigenvalues of the particular matrix B lie; vectors contains the string TRUE or FALSE to indicate when singular triplets are needed (TRUE) and when only singular values are needed (FALSE); kappa contains the relative accuracy of ritz values acceptable as eigenvalues of the matrix B.
RSV 2 7 SV 2
4. Display the semantic document relationships using Spring Embedding Algorithm or Pathfinder Network Scaling. A similarity matrix such as slis-courses.sim and the corresponding data file slis-courses.dat can be merged and converted into the corresponding xml file using SimConverter.java in '/home/www/ella/htdocs/classes/L697/code/toolkit/demo/'. Simply compile and run SimConverter and enter the Matrix File '*.sim' file and Source Data File '*.dat' second. The result is named list.xml.
The original LSA code (SVDPACK) was provided by Michael Berry berry@cs.utk.edu. Katy Börner implemented the parsers as well as code to compute the document by document or term-by-term matrices. Sriram Raghuraman integrated the code into the XML toolkit. |