How do you find similar or related records or documents within a collection of unstructured data or a text corpus? One classical solution to this problem is Latent Semantic Indexing (LSI). In this blog post we’ll walk you through the steps of applying the LSI technique. We will use a subset of public SEC 10-K filings by companies as the text corpus to analyze. We will use LSI to capture similarities between these companies based on the information in their 10-K filings.
LSI is a method grounded in linear algebra and Singular Value Decomposition (SVD). The “Cloud Numerics” library offers an implementation of SVD that works on distributed arrays. This distributed array support enables the LSI solution to be scaled out over a cluster of Windows Azure nodes.
This example focuses on the data computation and the examination of the correlated results. It does not discuss the one-time preparation to clean, tokenize, and stem each document in order to assemble the word counts. These pre-processing steps are well suited to a map/reduce framework, such as Hadoop. The preprocessed dictionaries of terms and term counts are available by way of the following Windows Azure storage containers:
The C# LSI example constructs and operates on a term-document matrix --an array where columns correspond to documents, rows correspond to individual terms, and the elements correspond to logarithmically scaled frequencies of terms in each document.
Running the Example
The source code and project file associated with this example can be downloaded from the “Cloud Numerics” download page on the Microsoft Connect site. If you have not already registered to access the “Cloud Numerics” lab download page on Microsoft Connect, register now. The source code example comes in the form of a Microsoft Visual Studio solution.
To run the example:
- Build and run the “AppConfigure” project that comes with the “Cloud Numerics” project to set up a Windows Azure deployment (the example works with the deployment of at least two nodes)
- Deploy the LSI application to Azure.
|You will need a SQL Azure storage account for writing application output (see Step 6). The program will create a publicly readable blob under this storage account for output.|
Walking Through the Steps Performed for the LSI
The example LSI solution performs the following steps.
Step 1: Read List of Documents
We begin the application by reading in the list of financial documents, stored in Windows Azure Blob storage. Because each document is stored as a blob within a single container, we can simply list the contents of that container.
Step 2: Read Global Dictionary
Next, we read in the global dictionary from a blob. The global dictionary contains unique terms for each document. This dictionary is relatively small --tens of thousands of key-value pairs. Therefore, no distributed IO is required.
Step 3: Read Local Term Counts and Form Term-Document Matrix
Once we have loaded the list of documents and the global dictionary, we are ready to form the distributed term-document matrix. The key-value pairs in the global dictionary define the mapping from the individual terms in each document to the rows in the global term-document matrix. To form the term-document matrix, we build a reader class that implements Microsoft.Numerics.Distributed.IO.IParallelReader interface. It consists of the following members:
- ComputeAssignment, assigns close to an equal number of documents to each rank for the distributed computation.
- ReadWorker reads term counts for each document and places them in the corresponding row of the term-document matrix. During computation, ReadWorkers operate in parallel to build the large matrix.
- DistributedDimension property defines the distribution of resulting term-document arrays. The output is distributed column-wise; so that each rank holds data for a subset of documents.
The following operations applies the reader and produces a distributed array termDoc that can then be passed into the LSI algorithm.
Step 4: Read Mapping from Term-Document Matrix Columns to Actual Document Names
To understand the results of the computation, we need to be able to map the columns of the term-document matrix back to actual documents. However, this is not a simple 1-to-1, 2-to-2, or n-to-n mapping; it depends on how the documents were preprocessed on different nodes. The correct mapping is read in using TermDocReaderAzure2. It produces an array that holds the index from columns to entries in the parsedNames array. This array is relatively small and is then copied into a local array.
Step 5: Apply SVD-Based Algorithm
To extract the most relevant concepts from the document set we first compute an SVD for the term-document matrix and then apply dimensionality reduction by zeroing all elements of the S matrix, except for those elements that correspond to singular value components between svdMin and svdMax. The result is a matrix SReduced * V that is used to build a correlation matrix between documents.
Next we build the correlation matrix. Note that the operations are performed in the Windows Azure cluster on distributed data using overloaded methods of ThinSvd and other functions.
Step 6: Analyze Correlation Matrix to Find Interesting Documents
The correlation matrix, representing correlations between documents, is much smaller than the original term-document data, and can be copied into a local array for further analysis. We select a row that corresponds to a document of interest, and set a correlation threshold. Then, we select documents that have a large correlation --above the threshold-- with the interesting document, and produce a smaller matrix that only holds the correlation coefficients between these documents. To map the correlation coefficients back to actual documents, we add document names as row and column headers.
This document matrix is saved as a string that can be written into a .csv file.
Step 7: Write Results to Azure Storage
We write the interesting part of correlation matrix --that is, the filtered correlations between companies found in the previous step-- into blob storage as a text file using the Windows Azure Storage Client APIs.
Replace the following:
with your own Azure Storage account key and name. From your storage account the result file can be fetched to your local workstation, for example, by using a web browser to open and save https://myAccountName.blob.core.windows.net/lsiresult/lsiresult as a .csv file, and then use Microsoft Excel to view the .csv file.