Greetings! Today we’re happy to offer a guest post by Ross Smith, Director of Test, Windows Security, at Microsoft, and one of the authors of The Practical Guide to Defect Prevention (Microsoft Press, 2007):
Ross here. This month celebrates the 50th anniversary of the publication of the Harry M. Markowitz’s classic book, Portfolio Selection. In this ground-breaking publication, for which he was awarded a Nobel Prize in 1990, Markowitz talks of the benefits of diversification and the mathematics behind risk-reward strategies.
Modern portfolio theory (MPT), based on Markowitz’s work, suggests that the return of an investment portfolio is maximized for any given level of risk by using asset classes with low correlations to one other. In other words, a diverse set of investments reduces risk and maximizes return. In a portfolio with two diverse assets, when the value of asset #1 is falling, asset #2 is rising at the same rate. MPT also assumes an efficient market—that is, all known information is reflected in the price of an investment. These factors contribute to an investor’s ability to create an “optimal portfolio” for his level of risk.
How does this apply to testing software? The effort we put forth in testing (or quality improvement) is our investment. Our return or investment yield is the number of defects discovered. Each of our techniques will yield a return of a certain number or percentage of defects. This is easily seen in the distribution of the “How Found” field of our defect-tracking database. In addition to the return of discovered defects, there is the risk of escaped defects: missed bugs that are found in the field. This is akin to investor loss.
The evaluation of our testing strategy based on the MPT principles exposes a set of deficiencies and enables us to improve the return on our testing investment while minimizing the risk of escapes, the same way investors maximize the return on their portfolios while minimizing the risk of loss of principle. The range of optimal portfolio selection, according to Markowitz, is called the “efficient frontier” and is derived by evaluating each asset’s correlation with every other asset’s correlation to determine the optimal allocation of all the asset classes. Once the efficient frontier has been determined for the asset classes being evaluated, the decision of which optimal portfolio to choose becomes a question of the level of risk tolerance.
In other words, once the efficient frontier has been determined for our defect discovery techniques (“how found” in the tracking database), we can use our tolerance for risk (how many bugs found in the field are we willing to accept as a reasonable level of risk) to estimate which test strategies to invest in, and how much/frequently we should invest. A diversified approach minimizes our risk and maximizes our return. When the defect yield of “how found = test case development” starts to wane, it’s time for “how found = customer” or “how found = ad hoc testing.” We are governed by the principle that the second bug is harder (and more costly) to find than the first. Yield curves through a project cycle illustrate this effectively. This is common sense to any seasoned tester, but the numbers give us a formula to predict and dictate the timing of behavior change.
The most important aspect of the diversified approach is to stay with the portfolio once it has been established, regardless of return. This takes a level of trust that we’re not used to at Microsoft and a belief that our techniques are good investments. Just as an investor might panic when a given investment fails miserably, we tend to over-react when we miss a certain type of bug. Just as a fund manager massages her investments to provide consistency, there are great defect prevention tools and techniques to improve our test strategies.
Game Theory and Human Computation
The relationship here is interesting. The year before winning the Nobel Prize, Harry Markowitz won the John von Neumann Theory Prize. From the Nobel Prize site:
“In 1989, I was awarded the Von Neumann Prize in Operations Research Theory by the Operations Research Society of America and The Institute of Management Sciences. They cited my works in the areas of portfolio theory, sparse matrix techniques and the SIMSCRIPT programming language.” John von Neumman was one of the leading mathematicians in his day, and instrumental in the development of game theory.
John von Neumann’s 1944 book, Theory of Games and Economic Behavior, helped set the stage for the use of math and game theory for Cold War predictions, stock market behavior, and TV advertising. He was the first to expand early mathematical analysis of probability and chance into game theory in the 1920s. His work was used by the military during World War II, and then later by the RAND Corporation to explore nuclear strategy. In the 1950s, John Nash, popularized in the film A Beautiful Mind, was an early contributor to game theory. His “Nash Equilibrium,” helps to evaluate player strategies in non-cooperative games. Game theory helps us to understand how and why people play games.
So, other than Markowitz winning the von Neumann award in 1989, how does MPT relate to defect prevention? The answer lies, seductively, in the use of crowd-sourcing and human computation: attracting the effort of “the crowd” to assist.
Wikipedia describes “crowdsourcing” as
“a neologism for the act of taking a task traditionally performed by an employee or contractor, and outsourcing it to an undefined, generally large group of people or community in the form of an open call. For example, the public may be invited to develop a new technology, carry out a design task (also known as community-based design and distributed participatory design), refine or carry out the steps of an algorithm (see Human-based computation), or help capture, systematize or analyze large amounts of data (see also citizen science).”
and “human computation” as
“Human-based computation is a computer science technique in which a computational process performs its function by outsourcing certain steps to humans. This approach leverages differences in abilities and alternative costs between humans and computer agents to achieve symbiotic human-computer interaction.”
So, if the problem set for defect detection lies in our ability to balance our portfolio of discovery techniques, how can we involve the “crowd” to balance our portfolio on a grander scale?
The answer lies in the use of “Productivity Games.” Productivity Games, as a sub-category of Serious Games, attract players to perform “real work,” tasks that humans are good at but computers currently are not. Although computers offer tremendous opportunities for automation and calculation, some tasks, such as analyzing images, have proven to be difficult and error-prone and, therefore, using computers can often lower the quality and usefulness of the results. For tasks such as this, human computation can be much more effective. Additionally, by framing the work task in the form of a game, we are able to quickly and effectively communicate the objective and achieve higher engagement from a community of employees as players of the game.
One of the all-time greatest examples of a Productivity Game is the ESP Game, developed by Luis von Ahn of Carnegie-Mellon University (also well known for inventing the Captcha), in which players help label images. In the ESP Game, two players work together to match text descriptions of images to earn points. The artifacts of game play are text-based (searchable) descriptions of images (not searchable). More at http://www.gwap.com.
Following is a series of quotes and examples related to the importance, usefulness, and appeal of games.
As University of Minnesota researcher Brock Dubbels suggests, “Games provide the opportunity to experience something grand—flight simulators do not have the excitement that games do—games exaggerate and elevate action beyond normal experience to make them motivating and exciting. In World War 2, the likelihood of being in a dogfight was slim, but in the game ‘1942,’ you can find one around every corner. Games raise our level of expectation to the fantastic and our biochemical reward system pays out when we build expectation towards reward. Sometimes the reward leading up to the payout is greater than the reward at payout! A game structures interaction in ways that may not be available by default for special circumstances and projects. A game can also create bonds that hold people together through creating opportunities for relationships that one might not experience every day.”
Brook Mitchell, CEO of Snowfly, a company that makes game software for performance rewards, describes the manual labor used in a slot machine: “Pulling a lever on a slot machine is a very routine and repetitive task. If playing a slot machine were a job, it would be difficult to staff it at almost any reasonable wage. Yet these people were paying money to do it.”
Paul Herr, in the book Primal Management, concurs: “The neurobiologic revolution has, in turn, sparked a revolution in economics. Economists, working in close cooperation with neurobiologists, have designed brain-imaging experiments based upon game theory to explore the brain’s decision-making apparatus. These experiments indicate that all forms of reward, monetary or otherwise, depend upon feelings. When players in an economic game plan their monetary strategy, the dopamine reward system in the basal striatum—the same brain area that processes food, sexual, and drug-related rewards—lights up on the brain scans. These experiments indicate that there is only one reward metric for human beings—sensations of pleasure and pain emanating from the basal striatum. Neuroeconomic research is putting feelings and emotions where they belong—at the core of economic decision making.”
Even family advice columnist Ask Evelyn says that people learn patience and perseverance as they learn to wait their turns, wait for a particular card, or come back from a loss. They learn to finish the game, sticking it out to the end, whether they win or lose. And they learn to win or lose graciously. They learn to cooperate, be honest, play fair, evaluate situations, and use critical thinking and strategy. They also learn to make choices for which they must accept the consequences. Accepting the consequences of your choices—being responsible for them—is a vitally important life skill. Best of all, no one has to work at “teaching” all this. It happens naturally while you are having fun together.
Juan Barrientos, Development Officer with the Games for Learning Institute at NYU, describes how they are “studying what makes games fun and educationally effective. G4LI researchers use a variety of methods such as game play observation, interviews, and experiments in order to identify design patterns for effective educational games.”
Ken Perlin, Director for the Institute adds, “The key question is how to reliably design fun and measurably effective learning games. The Games for Learning Institute places this question on a sold empirical scientific foundation by creating a wide variety of mini-games. As kids play these games, we measure the impact that various patterns of game design have on different kinds of learning outcomes.”
Institute co-Director Jan Plass further emphasizes that “in addition to conducting empirical research on design patterns for effective educational games, the mission of G4LI is to create a thriving research community on educational games. For example, we are building a game design architecture, based on XNA, that will be fully instrumented, and will therefore allow other researchers to collect play data for their own studies on games and learning.”
These same principles apply at work today and will increasingly apply as the next generation of employees learns and prepares for future employment. How do we use games to teach employees to deploy defect prevention techniques? Using mini-games to attract attention to a variety of techniques helps to distribute effort across the set of techniques as prescribed by Markowitz’s MPT.
Below are a three examples of Productivity Games we’ve used at Microsoft to improve quality. See Chapter 5, “Using Games to Improve Productivity,” of The Practical Guide to Defect Prevention for the genesis of this approach and for some examples related to development of Windows Vista. In addition to the list below, there’s a lot of work going on in the Office Labs group to experiment with the use of games in Microsoft Office. The Office Labs Skill Tracker adds elements of game play into Office to motivate users to explore more of the applications, learn new features, and boost their productivity. Skill Tracker will be released in late 2009. Office Labs Program Manager Jennifer Michelstein, who is coordinating the Skill Tracker project, believes that “adding elements of game play to Office can motivate people to learn more features and boost their productivity, while having fun, competing, and feeling good about learning. A key variable is integrating the right level of fun in Office, so that game elements boost instead of reduce overall productivity.”
Code Review Game
Background: Code reviews are a cost-effective method for discovering defects, but they require training, rigor, and dedicated effort
Problem: How do we encourage effort in code reviews, when, for individuals, techniques requiring less effort might be more attractive?
Solution: Code Review Game
The Windows Security Test team deployed a game designed around code reviews in March 2009. The belief was that work and fun need to co-exist to help keep employees motivated through the ebb and flow of the product cycle and that you must vary defect detection approaches to get the best results. The team organized a lot of games like bug bash, bug smash, self hosting, etc. In one of the brainstorming sessions, the team came up with the idea of organizing a Code Review Game:
1. We wanted to keep the game easy and simple to get the best results. We created 4 teams and captains for each team.
2. We asked the teams to choose their code and make sure it is not chosen by any other team. Each team gets points by the following rules.
a. Every sev 1 code bug gets 10 points.
b. Every sev 2 code bug gets 5 points.
c. Every Doc/KB gets 3 points.
d. Every code bug for Win8 gets 2 points.
e. For participation every team gets default 4 points.
Response and enthusiasm for the Code Review Game
The team rates this game approach as one of the most successful in the recent past. Each team created their own strategy to win the game. A few of the strategies to “win the game” were shared and they look a lot like solid techniques to find defects:
1. Identify the developers who are more prone to make errors and take their code to review to get maximum ROI.
2. After finding a code review bug, look for similar kind of bugs in the full code. If the code is written by the same developer, it’s high likely that a similar bug will appear in other places/files as well.
3. Check all the APIs used by developer in MSDN to see whether they are correctly documented.
4. Do the review in the first four hours of the day rather than later in the day when the teams are already exhausted. This activity takes out a lot from individuals.
5. Divide the code into pieces so that each day people review around 500 lines of code.
6. Organize the game with a critical yet playful attitude, never targeting any individual developer. If one finds a good piece of code, never forget to praise the developer for this. This way the developer also felt the sincerity of the players and understood that issues which are reported are genuine issues.
7. Clear the deck for code review week; get stuff done early so there is time to concentrate.
8. Start with the code review checklist. As new people enter the game or as new strategies are developed, update the checklist to raise the bar for competition.
Background: Understanding, measuring, and improving search relevance
Problem: Evaluating search relevance and findability of Web pages
Solution: PageHunt Game, available at http://PageHunt.msrlivelabs.com
See articles on the game at
The goal of the PageHunt Game is to improve the relevance of the search results and, particularly, to look at the areas where improvements can be made. Most work to evaluate and improve the relevance of Web search engines typically uses human relevance judgments or click-through data.
Both of these methods look at the problem of learning the mapping from queries to Web pages. They work when a page or result does get surfaced. But what if some pages rarely get surfaced? There are no ratings from the crowd, no click-through data, nothing at all. This game is designed to employ a different approach of learning about the mapping from Web pages to queries.
The hope is to use the data from the game to automatically extract query alterations for use in query refinement. For example, from data gathered in a pilot version, we learn that JLo is a sort-of equivalent to Jennifer Lopez, Capital city airport to Kentucky airport, etc. So when someone searches for JLo, we can also refine the query to look also for Jennifer Lopez and improve our search results. We also hope to get additional metadata for (e.g., image-heavy, text-poor) pages, identifying ranking issues. etc.
The Language Quality Game
Background: Localization, translation, and linguistic quality require tremendous investment, effort, and talent. Usually, the best way is to hire local experts to complete a manual visual inspection of every translated string, dialog, and user interface element.
Problem: How to capture local and cultural nuances, reduce the expense and schedule time, and improve quality of localized releases by employing native language speakers.
Solution: Language Quality Game
The traditional business process uses specific language vendors to perform translation work and then a secondary vendor to assess the quality. The business challenge has been that, for some languages and locales, finding two independent vendors can be difficult and costly. To address this problem, the Language Quality Game was developed to encourage native speaking populations to do a final qualitative review of the Windows user interface and to help identify any remaining language issues. The goal was to ensure a high-quality language release. Using the diverse population of native language speakers within Microsoft has enabled the pre-release software to be validated in a fun and cost-effective way. The list of Windows languages can be found on MSDN.Microsoft.com.
Total Screens Reviewed (Points Earned)
Average Screens per Player
Top Player Screens Reviewed
To learn more about the Language Quality Game, join Microsoft and others in the software quality field at the 27th Annual Pacific Northwest Software Quality Conference (http://www.pnsqc.org/). Joshua Williams, a senior test engineer from the Windows Defect Prevention Team, will be presenting a paper demonstrating the success of using games in testing software.
The use of Productivity Games at Microsoft stretches back several years. Early use of games to increase quality improvement efforts in Windows began in 2006. In August 2008, Forrester released a report on Serious Games (“It’s time to take games seriously,” by TJ Keitt and Paul Jackson) with the insightful prediction that “the strongest ROI and ultimate adoption will be in serious games that help workers do real work.”
As we warm up for the highly anticipated 50th anniversary of Portfolio Selection, we can recognize that the world is changing. Crowd-sourcing, social networks, and instant and real-time communication are all altering the way we work. However, our ability to focus has not increased at the same rate as our tools. How does the crowd focus its attention? How do we, as those interested in attracting effort from the crowd, retain the crowd’s attention span? Creative and collaborative play is the key. Productivity Games help individuals work together effectively and help focus our collective energy.
There is real potential here.
Productivity Games could be the Six Sigma of the 21st century.
To give a shout-out to others in the field, here’s a list of great thought leaders and references in this area:
· NYU Games for Learning Institute http://g4li.nyu.edu/
· Snowfly, Inc – www.snowfly.com
· IBM Innov8
· Video Games as Learning Tools – http://vgalt.com/
· Seriosity http://www.seriosity.com/
· Games at Work – upcoming book
· Changing the Game – book
· Google Guest Blog Post – Using Games to Improve Quality http://googletesting.blogspot.com/2008/06/productivity-games-using-games-to.html
· Games with a Purpose – www.gwap.com
· Fold It http://fold.it/portal/
· Pacific Northwest Software Quality Conference (http://www.pnsqc.org/).
· Office Labs http://www.officelabs.com
· The Edge Magazine – http://www.edge-online.com/blogs/changing-the-game
· NY Times Freakonomics blog – http://freakonomics.blogs.nytimes.com/2008/11/05/theres-free-labor-in-video-games/?scp=1&sq=changing%20the%20game&st=cse
· Employees play games, Microsoft Wins http://unlockthemysteries.com/employeevideogamesmicrosoftwins.aspx
· Behind the Scenes at Microsoft http://www.leaderperfect.com/articles/microsoft_trust.htm
· Changing the Game: How Video Games are Transforming the Future of Business – http://www.changingthegamebook.com – Ch8
· CNET interview link http://podcast-files.cnet.com/podcast/danielgameauthors.mp3?tag=mncol;txt
· Realtime Perfomance Webinar link
· Singapore Management University did an interview (about 4 mins in) http://www.forimmediaterelease.biz/index.php?/weblog/the_hobson_holtz_report_podcast_400_november_24_2008
· Serious Games Summit – 2008 presentation http://www.defectprevention.org/downloads/bug%20hunter.pdf
· Microsoft Research – Rethinking the ESP Game.
· And, of course, the Microsoft Press book that brought us all here: The Practical Guide to Defect Prevention. Buy it today <grin>.