Observations from the Text Analytics Summit 2009

One of the hard parts about organizing a conference like the 5th annual Text Analytics Summit, held last week in Boston, must be selecting the industry case studies. Text analytics is a highly specialized, but broad reaching topic that has applications in life sciences, financial service, legal, retail, government, media, and entertainment, to name a few. Any one of these industries could have filled the conference with interesting examples.

As it was, most of the case studies and vendor briefings at this conference were about Voice of the Customer or Market Intelligence. I suspect that some attendees might have preferred a little more variety in the cases presented. The absence of any government case studies, for example, was conspicuous, but understandable given the special nature of that domain. We’d all probably have needed security clearances to attend those sessions anyway. Overall, I appreciated the more commercial/consumer focus and felt that the conference organizers did a great job of finding representative examples and balancing the practical (vendor briefings and case studies) with the theoretical.

As a first time attendee to the conference, I was interested in just getting the lay of the land in text analytics, but I was also interested to learn how people were answering the “what’s next” question. It came up several times over the 2 days during Q&A and panel sessions and there were different takes, but I paid close attention to three, in particular, that resonated with my own observations looking through the lens of enterprise search.

Trend 1: ETL-like Tools

Ok, this is not really a trend in text analytics, but it is one in enterprise search that is informed by text and data analytics.

Many of the vendors at the conference demonstrated graphical tools designed to simplify the process of building text analysis “pipelines”. These tools look very much like the Extract, Transform, and Load (ETL) tools that have been around for many years in the data integration world. The difference is that the text analysis versions of these tools focus on operations for handling unstructured text. For example, named entity recognizers are a common text analytics task for automatically recognizing and tagging things like person names, company names, and locations in text.

This ETL “pattern” exists in enterprise search, as well, where information must be extracted from a source repository (e.g. an email archive), transformed into an enhanced, canonical representation (e.g. annotated XML), and loaded into a database or index for searching. The demand for graphical tools to manage the ETL process for search has not been as high as for text or data analysis. I think this partly because, for search applications, it is usually a one-time set up process and not an iterative modeling exercise as it is with text analytics. It may also be because historically the operations performed on content before it’s indexed for search have not been as sophisticated as the operations performed for in-depth text analytics.

This is changing. To start, extensible pipeline processing frameworks that incorporate advanced text analysis capabilities have become more common in enterprise search products. By now, most mainstream enterprise search platforms include entity extractors, for example. We are also seeing more ETL-like graphical consoles for managing content integration and analysis.

The adoption of these tools and techniques for enterprise search is motivated, in part, by a desire to more easily harness text analytics features that increase search precision and create richer search experiences. It’s also the case that, while text analytics shares a heritage more with information retrieval (search) than with business intelligence (BI), it includes technologies relevant to both and sits smack in the middle of the convergence between these two spaces. Sue Feldman and Hadley Reynolds of IDC reinforced this role of text analytics by describing it as a cornerstone of Unified Information Access during their Market Report at the conference. Given this, it shouldn’t be surprising to see that, as text analytic tools and concepts have found their way into BI applications, traditional BI tools and concepts, like ETL, are finding a place within enterprise search.

Trend 2: Empowering the End User

Another topic that popped up at various times during the conference was the challenge of delivering the richness of text analysis tools to users other than specially trained analysts. As with traditional BI tools, many text analysis tools assume a trained user or “analyst” capable of designing sophisticated workflows or advanced analytical models. One question posed to a speaker after he finished describing his text mining process was “when do you think you’ll be out of your job?” - meaning, when will the tools be so easy to use that your end users won’t need you to do their investigation for them?

I’m sure this exact question was asked at a conference of professional research librarians some 15-20 years ago - back when online search services and later Internet search engines were becoming easier and easier to use and obviating the need for “professional searchers”. The answer is likely the same, too. There will always be specialists and “power users”, but as the tools become easier to use, end users will become more empowered to do their own increasingly advanced analysis.

In practice, we are seeing more applications that combine conventional search with advanced text analytics in ways that bring a more powerful search experience to relatively unsophisticated end users. Silobreaker.com is a clever site that combines the richness of text analytics within what is fundamentally a news search application. Unlike other news search sites, Silobreaker offers options and tools that help to uncover and discover interesting and potentially novel connections and patterns in the news. There are still some usability challenges with a consumer site like Silobreaker, but I like it as an example of ad hoc search converging with iterative knowledge discovery.

The trend toward empowering users with more than just a search box and list of blue links also reaches into less “analytical” consuemr applications. Two examples are www.oodle.com and www.globrix.com. Both sites show the power of applying analytics to both structured and textual data (classifieds in the case of Oodle, real estate postings in the case of Globrix) in what are otherwise fundamentally search applications.

Trend 3: Taking Sentiment Analysis to the next level

Sentiment analysis is the ability to recognize the mood, opinion, or intent of a writer by analyzing written text. It is sometimes called the “thumbs up, thumbs down” problem because the most common application is establishing whether a writer is positive or negative on a particular subject. In this form, it is often used to analyze written product reviews (see this example on Microsoft’s new Bing Web search).

Sentiment was a much mentioned topic at the conference. This is not surprising given the focus on Voice of the Customer and Market Intelligence – two areas where accurately establishing the sentiment of customers and consumers toward products, services, and brand is highly desirable. One of the presenters at the conference was Roddy Lindsay from Facebook. I missed that session, but it doesn’t take much imagination to appreciate the possible applications for text analytics and sentiment analysis, in particular, with the information available on Facebook and other social networking platforms.

Every vendor present had something to show or say on the subject of sentiment analysis, but all the panelists in the vendor-only panel acknowledged the difficulties of increasing the precision of sentiment classification. Currently, the number tossed around is 80%. That is, a sentiment classifier will get it right about 80% of the time compared to human judgments. This number is higher in some applications - for example, when analyzing short, strongly opinionated product reviews. It is lower when analyzing longer pieces of text where just fixing the subject can difficult – like this blog post.

Progress is being made, though. The first step has been a shift away from “document-level” sentiment to “topic-level” sentiment. This allows sentiment classification to be more accurate when confronting documents, like this post, that touch on and offer opinion on multiple topics. It also helps with more concrete problems like the ones represented in this sentence:

“Acme’s new P40 digital camera has a good viewer, but its controls are awkward.”

While it’s relatively easy for a human, it takes some heavy linguistic lifting for a machine to recognize that the sentiment of this opinion is directed not just at Acme or at the P40 digital camera, but specifically at the viewer (positive sentiment) and the controls (negative sentiment). It’s ever trickier establishing what the word “its” refers to in the 2nd part of the sentence. Is it the Acme P40 itself, or just the viewer?

Sentiment is admittedly a niche topic, even within text analytics, but getting it right matters a lot for enterprise search applications in eCommerce (think product reviews), Market Intelligence (reputation tracking and competitive intelligence), eDiscovery, and Government Intelligence. One presenter suggested that all the remaining hard problems in sentiment analysis will be solved, at least academically, in a couple years. It will be interesting to see how soon these improvements surface in real-life applications.

Nate