Underneath the layer of factual information and numerical data is a deeper, more personal Internet. The online world comprises vast data and information resources and our search engines are adept at crawling through it and finding answers. But what if a user needs more than just a straightforward answer? What if the user needs insights from others’ personal experiences, opinions, or abstract ideas and philosophies?
Microsoft’s Senior Applied Researcher Manish Gupta recently partnered with Ankan Mullick, Prof. Pawan Goyal, and Prof. Niloy Ganguly from IIT Kharagpur to deploy artificial intelligence and machine learning to help us get more meaningful answers for social queries from the Internet. Here’s a closer look at the findings which were recently published in a white paper.
Seeking the intangible
Today, search engines crawl the Internet and extract answers based on a rigorous analysis of keywords and phrases more quickly and effectively than ever before. Questions such as “What’s the median income in London?”, “How many hours would it take to walk across the Great Wall of China?”, and “In which action movies did Tom Hanks star?” can all get an instantaneous answer.
Search engine algorithms are great at working with fact-based queries and providing structured answers. However, search engines are surprisingly ineffective at answering subjective and personal questions.
Queries based on human experiences and personal opinions are difficult for a standard search engine to comprehend. This means users cannot rely on the algorithm to provide meaningful and helpful answers for questions such as, “How to make small talk with new friends,” “People’s favorite memories from school,” “How does it feel to immigrate to a new country?” or “The songs that defined the 80’s.”
While traditional search engines may struggle with such deeply human queries there are online platforms specifically tailored for personal opinions and conversations - social media. Twitter, specifically, has become a forum for people to create sustained online conversations held together by a common hashtag. Twitter hashtags coupled with the 140 character limit per post streamlines the conversation and centers it on a single theme. These themes tend to be deeply personal and human. With hashtags and social conversations, Twitter provides complementary information compared to the one that can be accessed using traditional search engines. This is precisely why Microsoft researchers picked the platform for this study.
About the study
The purpose of the study was to extract meaningful information from social conversations to answer social list queries. To this end, our researchers collected around 4 million hashtags that were trending between January 2015 and June 2015. Out of these, around 67K multi-word hashtags referring to a conversational and personal theme were extracted, using an SVM (Support Vector Machine) classifier. We call such hashtags “idioms”. Since social list names can be expressed using multiple words, social list hashtags are a subset of idioms. Hence, the first step was to enable a classifier to learn to classify an idiom as a social list hashtag versus one that is not. On identifying social list hashtags, related tweets can be used to extract list items. List items for such social lists can be ranked using various factors like popularity and recency.
Datasets were created based on the length and popularity of the hashtags used as well as specific details extracted from the Twitter profiles of the users who tweeted. The intention was to manually annotate some of the idioms as social list hashtags versus those that are not, and use that data for machine learning. Once the model has been learned, it should be able to extract social list hashtags from a large pool of idioms with significant precision.
The raw dataset included nearly 0.2 billion tweets and close to 85 million URLs. The dataset was pre-processed to segment hashtags and detect parts of speech. Social list hashtag detection from a set of idioms focused on three types of features: linguistic (use of numbers, hashtag length, and vocabulary ratios), search (coverage in top 10 or 20 search results on a popular search engine), and Twitter (duration of popularity on the platform and distribution of co-occurring hashtags). The system was evaluated using a comprehensive 10-fold cross-validation on metrics such as precision, recall, and overall accuracy.
Altogether, the high-recall classifier was able to work through the dataset and uncover around 67,000 idioms. These included deeply personal and human hashtags such as #foreveralone, #awkwardcompanynames, #childhoodfeels, and #africanproblems. These idioms were further condensed into social lists based on particular personal themes.
Factors such as the duration of hashtag popularity, related hashtags, URLs, and related hashtags were used to detect context and classify the social lists accurately.
The recall optimized social list hashtag detection system demonstrated 75% precision and 95.3% recall. As expected, Twitter proved to be a treasure trove of valuable social information and opinions. The paper conclusively demonstrated that relevant social information and opinions can be classified as social lists by a high-recall classifier system. This algorithm forms the basis for a better search engine for social platforms.
To sum up
Social media has helped augment the Internet with a layer of deeply social information, experiences, emotions and opinions. This layer of data can help inform users looking for subjective information and trusted opinions.
Traditional search engines struggle with subjective and opinionated information. The structured, keyword-oriented nature of conventional search doesn’t offer valuable insights based on actual experiences and opinions of others. While a search engine can effortlessly tell users the distance between the Sydney and Auckland, it can’t help users from others’ experience of learning a new language or getting married to a childhood friend.
Researchers at Microsoft worked with the IIT Kharagpur team to develop a system that can scour social networks and detect valuable insights from public conversations. Highly effective and precise, this system could form the basis for a deeper, more meaningful search engine.