Searchable Sentiment Analysis Archive


TL;DR

You can find a searchable archive of my previously described sentiment analysis of at https://inaugural.azurewebsites.net/. The website will not remain active indefinitely, I will take it down once the novelty has worn off. Let me know what you think. The full source code is available on GitHub.

The application lets you do free text search of all the inaugural addresses:

The results can be sorted by relevance, date, or sentiment. You can also view a specific inaugural address and the paragraphs of the speeches will be color coded according to sentiment:

The actual sentiment score of each paragraph is listed to the left.

Background

In a previous blog post, I described how to use Azure Cognitive Services text analytics to do a sentiment analysis of all US Presidential Inaugural Addresses. The analysis yielded a lot more data than I was able to include in that description. I have made a searchable archive of the results. This blog posts describes where to find this archive, where to find the raw data, and some of the implementation details. The application is intended to provide a way to access the data but also to illustrate some key services in Azure, that make it relatively easy to put such an application together. The application makes use of:

Data and Text Analytics

The details of how I got the original inaugural addresses was described in the previous blog post. I did modify the source code for that project slightly. It now has two different executables, one that extracts each of the inaugural addresses and puts them in json files and one that reads these json files and performs the analysis. I did this separation to provide the opportunity to go through each address and make sure it has been extracted properly. There were a few funny symbols that caused some paragraphs to split in a few addresses, I manually edited them for this project. It actually changed the overall score for a few of the speeches a bit, so if you compare results, you will find some discrepancies. The net result is that we how have a json file for each address that looks something like this:

{
    "SourceURI": "http://www.presidency.ucsb.edu/ws/index.php?pid=25800",
    "Speaker": "George Washington",
    "Category": null,
    "Date": "1789-04-30T00:00:00",
    "Paragraphs": [
        "Fellow-Citizens of the Senate and of the House of Representatives:",
        "Among the vicissitudes ...",
        "Such being the impressions ...",
        "By the article establishing ..."
    ]
}

We then pass that through the text analytics API and end up with a json file for each presidential address that looks like this:

{
    "Spiel": {
        "SourceURI": "http://www.presidency.ucsb.edu/ws/index.php?pid=25800",
        "Speaker": "George Washington",
        "Category": null,
        "Date": "1789-04-30T00:00:00",
        "Paragraphs": [
            "Fellow-Citizens of the Senate and of the House of Representatives:",
            "Among the vicissitudes ...",
            "Such being the impressions ...",
            "By the article establishing ..."
        ]
    },
    "SpielAnalytics": {
        "SummaryAnalytics": {
            "Words": 1427,
            "Characters": 8603,
            "Sentiment": 0.99597227813115874,
            "KeyPhrases": [
                "Senate",
                "Fellow-Citizens",
                "House of Representatives"            ]
        },
        "ParaGraphAnalytics": [
            {
                "Words": 10,
                "Characters": 66,
                "Sentiment": 0.5,
                "KeyPhrases": [
                    "Senate",
                    "Fellow-Citizens",
                    "House of Representatives"
                ]
            },
            {
                "Words": 313,
                "Characters": 1819,
                "Sentiment": 0.99999940395355225,
                "KeyPhrases": [
                    "country",
                    "retreat",
                    "day"
                ]
            },
            {
                "Words": 306,
                "Characters": 1863,
                "Sentiment": 1.0,
                "KeyPhrases": [
                    "united government",
                    "United States",
                    "free government"
                ]
            }
       ]
    }
}

As you can see, I have also modified the analysis to include key phrases (which we don't use in this application, but maybe they can be useful in a future blog post). I have provided the raw address json files (before sentiment analysis) and also the documents including the text analysis. The latter package of json files is what was uploaded to the DocumentDB/CosmosDB for this example application. In this example, we are using Presidential inaugural addresses, but obviously these formats are generic and could be used to analyze any collection of public speeches or documents. If you try some other data, please let me know about it.

Document Database and Azure Search

To store these speeches for the website, I simply dumped them into a Cosmos DB, which is Azure's answer to a document (NoSQL) database. I set up the database through the Azure Portal, and I used the Azure Document DB Data Migration Tool to import all the analyzed json files into a collection. In the application, I access this database through the DocumentDBRepository class, which provides an easy interface. When the documents are retrieved, they are mapped to the SpielEntry class (defined in Spiel.cs). For some introduction on how to access Document DB from .NET, you can look at this tutorial.

The collection is a great way to store the documents and retrieve one of them for viewing, but it is does not lend itself well to easy searching, especially not with a document structure such as the one outlined above. The text is separated into paragraphs (an array), which among other things make it awkward (not impossible, just messy) to index and search directly in the Document DB. So in this application, I am using Azure Search to search the records. Azure Search is basically Elastic Search and it is easy to hook up to a SQL database, Document DB, or even a container full of blobs. Since this may be less familiar, I will walk through how to set it up for the database of speeches that we are using.

First create a new Azure Search service:

I have selected the "Basic" pricing tier here, but for the size of this database, the free tier would actually be just fine. After the search service has been created, we need to import some data:

Select the Document DB data source:

And in the "Query", we need to specify a query that will "flatten" out the multi-level data structure we have and actually pull out the fields that we would like to search. There are many options, but something like the query below will work:

SELECT c.id, c.Spiel.Speaker AS speaker, c.Spiel.Date AS spieldate, 
c.SpielAnalytics.SummaryAnalytics.KeyPhrases AS keyphrases, 
c.SpielAnalytics.SummaryAnalytics.Sentiment AS sentiment,
c.Spiel.Paragraphs AS paragraphs,
c._ts 
FROM c  WHERE c._ts >= @HighWaterMark ORDER BY c._ts

This will create a search table with the fields "id", "speaker", "spieldate", "keyphrases" (array), "sentiment", and "paragraphs" (array of strings). This table will be what the Azure Search service will be indexing. Next step is to customize the target index. This is where one selects which fields will be searchable, retrievable, etc.:

 

Last step is selecting how frequently to run the indexer and starting the import of the data.

 

Once the data has imported (which should be very fast for this small database), you can use the "Search Explorer" to try various search queries:

The "Search Explorer" is a great tool for trying different search queries and playing with faceting tools, etc. If your application accesses the search index directly through the REST API, it is also a great way to form queries for use in your application. In this application, I am using the .NET API in the C# code.

Conclusions

This example application illustrates that it by combining several Azure services, it is fairly straightforward to implement applications that combine text analytics, document databases, and sophisticated search engines. It takes relatively little code to tie these services together using the .NET libraries.

While this example focuses on inaugural addresses, the application itself would really work for any collection of speeches, letters, books, etc. As long as they can be formatted in the json format outlined above, the code should work. Please have a look at the source code, and let me know what you think.

There is a lot more information in the data. As an example, key phrases have been extracted and they would form a nice basis for faceted navigation using Azure Search. If you come up with improvements or more elaborate analysis, please let me know or submit a pull request to have changes included.

 


Comments (0)

Skip to main content