Out of calibration

 It’s calibration time at Microsoft. Time for managers to rank everyone in your peer group (same discipline, same career stage, same division) into five (and a half) ranges: the top 20 percent (and top 5 percent), the near top 20 percent, the middle 40 percent, the lower 13 percent, and the bottom 7 percent.

Calibration brings out the best in us—the best in our acidic, reproachful disdain. Engineers hate calibration because it’s not fair to great teams, in which everyone deserves high ratings, and because it discourages teamwork, since team members compete against each other for rewards. Managers hate calibration because it forces them to make hard choices, it punishes them for having a great team while rewarding their peers for having poor teams, and it creates uncomfortable conversations with their employees.

Well, I love calibration. That’s right, I love it! You weenies and whiners can go join some puritan, petite startup, while I count our billions and continue working with a top-notch staff. Hey, I’m huge on rewarding strong teams and teamwork. The fact that you think calibration discourages teamwork shows your ignorance. It’s time you got a clue.

Wisdom of crowds

There are teams and then there are divisions. They are not the same. A division has thousands of engineers. A team has between one and 12. You are calibrated against your peers in your division, not your team.

Yes, the few engineers on your team that are in your discipline career stage are among the hundreds in your calibration group. So what? That’s rounding error. You’re not competing against your teammates—you’re being compared across your division.

“Yeah, but my boss says every team has to fit the curve!” Typically, group managers like their entire teams to fit the percentages as a starting point. I used to misunderstand this, thinking it applied within career stage and would persist in calibration. Now that I’ve been through many calibrations, I realize it’s like the initial guess in a root-finding algorithm. You’ve got to start somewhere, but it’s rarely where you end up. By starting with teams roughly meeting the percentages, you at least cover the common cases quickly. Nonetheless, managers still talk about everyone in the calibration group.

Eric Aside

Employees often complain that the HR text descriptions of each rating range don’t match the actual definitions—the percentages. True, but those definitions are a handy guide for managers to determine a starting point for calibration.

What are you trying to say?

A common concern is, “Instead of rating teams against their results, we’re rating engineers against each other! Doesn’t that discourage teamwork?” No, it doesn’t. Remember, you’re compared against all the engineers in your discipline and career stage across your division—not just the few on your team. If you and your teammates perform better than others in your division because you collaborate well, then you and your teammates will rank higher.

Don’t get me wrong—managers can certainly use calibration to create a competitive environment within their teams, making them dysfunctional. But managers can create competitive, dysfunctional environments any number of ways. I discussed this at length in “Beyond comparison” (Chapter 9).

Calibration doesn’t discourage teamwork. However, calibration does have a message for employees—Microsoft pays for performance.

If you are not performing as well as other engineers in your discipline at your career stage, then you will not be paid as well as your peers. If you perform better than other engineers in your discipline at your career stage, then you will be paid extra—sometimes a great deal extra. True, “perform better” is subjective, and any subjective system can be abused, but managers are calibrated too. In the end, Microsoft seeks to reward sustained excellence.

We must prepare

How does the actual calibration meeting work? How do managers decide who is “performing better”? It starts with preparation. HR provides a standard spreadsheet that has tabs for every employee and a main table to capture the calibration results. Because it’s a spreadsheet, metrics are automatically calculated to help groups understand the distribution of performance.

The tab for each employee in the spreadsheet is filled out in advance of the calibration meeting. HR puts in the employee’s past review results and basic information (such as the employee’s name, level, discipline, and date of last promotion). The employee’s manager adds:

  • What the employee accomplished against and beyond his or her commitments.
  • How the employee accomplished those results. (Did the employee make friends or enemies along the way?)
  • Proven capability the employee has demonstrated in the past (for context).
  • Feedback the employee has received from peers.
  • A promotion indicator with explanation.
  • The recommended choice out of the five ranges.

Naturally, these tabs are easier to fill out after employees have submitted their own assessments and after the manager has received peer feedback. That’s why everyone is encouraged to fill out assessments and feedback at least a few days before calibration. Yes, I know that doesn’t always happen. (That’s another reason to talk to your manager regularly.)

Getting to know you

The actual calibration meeting is usually six to eight hours long (no joke). Typically, names of employees are put on 3×5 cards or Post-it notes. For each career stage, the cards of employees in that career stage are placed in one of the five ranges as a starting point.

One range at a time (typically highest first), managers talk about every employee in that range. In addition to describing what, how, proven capability, and feedback for each employee, the managers talk about why they feel those aspects align with the selected range. Other managers typically ask questions, particularly if the reasoning doesn’t align with their own. In cases where employees clearly don’t fit the initial choice of range, they are moved up or down accordingly, regardless of percentages.

Once everyone in a range is discussed, the number of employees in the range is compared to the target percentage. If there are too few in the range, managers discuss who might move up as they review the next range. If there are too many in the range, managers go back and further question the employees that seemed to fit least solidly in the range. This all continues until every range is discussed and the calibration model is complete.

That’s a bit extreme

A naïve manager exclaims, “But I’ve worked hard to create a high-functioning team. They should all be in the top range!” Congratulations! Is there no room for improvement? I’ll bet there is.

Remember, you aren’t comparing teammates to each other—you are comparing employees to their peers across the division. Usually, you’ve got a mix of engineers that work well together. Reward each accordingly, and help every engineer become the best employee possible.

“But what about lame managers and lame teams? They all should suffer!” That’s harsh—calm down. But there are bad managers who hire bad people, yet present them as average or good people. These managers manipulate the system until they are caught.

Even the best of the worst managers get caught within a few years—usually faster. They are replaced or their teams are disbanded. No system is perfect, but ours does get things right, given time, and my experience is that the process has gotten better since Microsoft started focusing more on management excellence.

That’s not fair at all

The naïve manager laments, “But what do I tell a solid employee who was in the bottom 7 percent? He completed all his commitments.” Perhaps the commitments were too easy for his level—but what’s done is done. The employee is still in the bottom 7 percent. He is not getting a bonus, a raise, stock, or a promotion. Instead, he is getting a tough message about moving up or moving out.

“But he’s a solid employee. How is that fair?” You’ve got a solid employee who’s not nearly as good as other engineers in the same division, in the same discipline, at the same career stage. That means you can replace that employee and likely achieve better long-term performance. So, either your employee can substantially improve, or he can find another place to be successful.  I describe this in detail in “The toughest job” (Chapter 9).

Eric Aside

It’s important to have appropriate commitments for your career stage—achievable yet challenging enough to meet expectations for growth. Some divisions calibrate commitments within career stages by having managers review the commitments of their staff with their leads and with their peers. You can read more about writing great commitments in “I’m deeply committed.”

I’m still here

Yes, Microsoft compares employees against other employees in similar roles at the same career stage. Microsoft pays for individual performance. But why not pay for team performance instead, or at least in addition?

Personally, I’d like an element of my pay to be based on my team’s performance. Perhaps it will someday. However, I wouldn’t want all my pay to be team-based. I work for Microsoft, not my team. When I switch teams, I’m still working for Microsoft. My pay needs to be at least partly based on my own performance, not my associates’.

I also believe Microsoft could better recognize and reward the wide range of personality types and skill sets needed to create a high-functioning team. We need to find more ways to ensure that teamwork is recognized when it enables as well as when it produces.

Even with its imperfections, Microsoft’s system of pay for performance sustains the top-notch engineering workforce that I have the pleasure and privilege to collaborate with every day. That quality wouldn’t exist if calibration didn’t force us to have hard conversations and value the best among us. I love it.

Eric Aside

The percentages ranges I mention at the start of the column are part of the new review model introduced this year. In my 16 years with the company, I’ve seen three review models:

  • 2.5 – 4.5/A – C:  The model when I started in 1995. A 2.5 meant you were done—up or out within 6 months. A 3.0 was a warning—you’re off your game. A 3.5 was standard goodness. A 4.0 was great, and a 4.5 was wow! The stock ratings (A best, C worst) were hidden and unknown unless you were a group manager. Large divisions had a discipline calibration. All organizations had a cross-discipline calibration (in large divisions it followed the discipline calibration).  The approach varied by vice president, and your rewards were determined separately from your ratings. Reviews happened twice a year.
  • Underperformed, Achieved, Exceeded/20-70-10:  The model we’ve had since 2006. Theoretically, the commitment ratings were not subject to percentages, but in actual practice they did conform to rough percentages in order to differentiate bonus amounts within a fixed budget. The contribution rankings were transparent and directly corresponded to percentages—top 20 percent, middle 70 percent, and bottom 10 percent. All divisions had a calibration based on discipline and career stage. Some organizations followed with a second cross-discipline calibration by career stage. The approach varied by vice president, and your rewards were determined separately from your ratings. Reviews happened once a year.
  • 1 – 5:  The new system starting this summer. A 1 means the top 20 percent, 2 is the next 20 percent, 3 is the middle 40 percent, 4 is the lower 13 percent, and 5 is the bottom 7 percent. There is no separate stock/contribution ranking. All divisions have a calibration based on discipline and career stage. The approach is dictated by HR (though I imagine different vice presidents will introduce small variations), and your rewards are directly determined by your rating (with the very top 5 percent receiving extra). Reviews happen once a year.

Personally, I like the new system. We’ll have to see how it works out, but initially I like the simplicity of a single rating, that it’s on a curve that better matches historic percentages, that there’s no cross-discipline calibration (made no sense), and that there is a the standard approach across divisions.  And I am overjoyed with the direct, fixed, and transparent tying of rewards to ratings.

I can think of three more improvements beyond the ones I mention in the column.

  • Have six ratings (a separate one for the top 5 percent). With everything else so transparent, why hide the top 5 percent?
  • Get credit toward your bottom 7 percent if you worked with HR to dismiss or implement a career change for struggling employees during the past review cycle. You can’t claim people who simply transferred or left—HR has to have been actively engaged regarding these folks who would have otherwise gotten a 5 rating. This would encourage managers to connect with HR and actively manage performance issues.
  • Do calibration twice a year, and share the midyear calibration rating (perhaps tied to a semi-annual bonus, as it used to be, or used just as a tracking number). As a manager, I hate that I can’t provide my employees an unambiguous message about their calibrated performance at midyear. I can tell people they are in jeopardy, but psychologically that’s not the same as giving them a number.
Comments (8)

  1. Andrew says:

    It doesn't matter how many times you say that forced distribution doesn't encourage anti-team behavior, the simple fact of the matter is that it does. It's almost impossible to underestimate the power of incentives. You can avoid the worst excesses by also ranking employees on how they went about their work but I'll bet results trump method almost every time and employees know that.

    The real assessment that the designers of this system have to make is whether the loss of productivity caused by anti-teamwork behaviors is worth the gain caused by competition for rewards. The current prevailing wisdom is that it does, at least for disciplines like engineering. I wouldn't recommend paying your sales force on a curve :-).

    It's interesting to me that your current system has a very distinct "bottom of the pile" rating. Psychology teaches us that people hate losing what they already think they have way more than they value gaining something new. In this case the '5' ranking will be a very powerful negative motivator as people fight to stay out of it. In a 10 person team somebody is going to be a '5' and I'll predict you'll see the worst anti-team behaviors associated with not getting a '5' than you will with getting a '1'.

  2. Anonymous says:

    I always wonder why force a fixed distribution. Let teams/divisions put their employees on a curve but the curves can vary. Some teams could have Seattle's yearly temperature curve (i.e. the variation is smaller) and some teams could have New York's yearly temperature curve (i.e. variation could be more pronounced). Budget allocated to divisions/org can be the same or better still can be proportionate to their own relative rating. Then it should be left to the division/org to distribute it based on the natural distribution of their employees.

    It is pounded on that statistically people naturally fit the curve but reality is that curve is forced down on smaller and smaller populations (orgs, even teams) and make no sense when applied to a small number of people – and does penalize strong teams.

  3. Gj says:

    I have been here for 4 months only but I will be compared to my peers who have been here for a year almost. Let's see how I will be calibrated.

  4. E says:

    I appreciate you writing a column that sheds a little light on calibration.

    A few complaints…

    The first is that you give the impression of fairness in the system. There are myriad ways that you can get screwed.

    The big problem I have is that you are portraying the system at its best, but as an employee ( and an IC particularly) you have to protect yourself against the system at its worst. Even if you like your management chain and think they are trying to help your career, you have to be very careful how much you help others because that can leave you with fewer things of note come calibration time than your peers have. Or you can just as easily get screwed if you share an organization with a larger group; the smaller team will almost always lose out in calibration.

    The second is about your endorsement of the overall approach, especially the bottom 7% of the system. We've all seen people at jobs that, for whatever reason, couldn't meet their expectations, and good organizations will deal with them. But if you force a distribution, you encourage organizations to instead keep that person around as the one who will fill that important spot in the distribution; if I get rid of that person then I'm going to have to start delivering that hugely demotivation message (which, incidentally, is often forced on my from above) to my reports that I'm depending on to get things accomplished. You suggest that I can replace them with a higher-performing employee, which a) doesn't help me a lot if the job market is competitive and b) doesn't help during the time I have to look for and train that employee.

    Not to mention the fact that if you did this 7 years straight, you would lose the whole lower half of your distribution and anybody new who you brought it would be just as likely to push the average group level down than bring it up.

    The fact is that there are important jobs at Microsoft that you can't put a high-performing employee in because that person is going to leave for something more interesting/exciting within a year.

  5. Some comments looking back on the ranking system and my 4 reviews at Microsoft. The first two years I got Exceeded and thought the system was working :-)  

    The third year I changed jobs 10 months into the period, and my experience provides a good example of Anon: 10:17 comment about the curve being forced on smaller populations. My old manager was responsible for setting the rank. He told me that he only got to rank one person Exceeded for his team, and since I was no longer part of his team he was not going to "waste" it on me. So 10 months of exceeding level work were disregarded because of forced ranking applied at the team level. I appreciate there is hair-splitting involved sometimes, but I think this is a clear example of how the ideal ranking across divisions falls short in practice.  

    Another comment is a vote in favor of Eric's proposal for a mid-year calibration. In my 3rd evaluation, I had a negative surprise about something that kept me from exceeding with no prior feedback. The initial problem was that my direct manager did not participate in the initial calibration session, but was brought in 'after-the-fact" to approve the results. He said that he was initially surprised that I had not been ranked Exceeded, but the mid-level manager pointed out that there was one area of my job that I had less than exceeded — so my manager bought into the ranking. Once I was told about this (some weeks later), I reminded my manager that we agreed that I would not going to concentrate on that area, and showed him where we had documented this in in the mid-year write-up. By this time the calibration was set in stone, and I had no recourse to get any changes. If I had known (because of a mid-year calibration) that I was going to be calibrated based on that criteria, I could have reminded them of the existing plan, or altered the plan to better reflect the expectations of mid-level management.

    My final comment is an observation about how the ranking system plays into the culture of the company. I had long heard that Microsoft had an "up or out" management style, and Eric's (surprisingly direct) arguments in favor of the ranking system are a clear example of that type of thinking. Longevity, continuity, and competence in a particular area are undervalued — if a particular solid employee is not "better" than an abstract ideal employee of the same level, than they should be replaced by a new person who might better live up to the abstract ideal. In some positions and industries I can buy into the notion that things are so dynamic that someone "sitting still"  is wasteful. But applying this logic to all positions across the company means you are loosing the solid experienced base of your company. What you end up with is an employee population that moves around more, is more focused on the short term, more socially aggressive, and who often makes reoccurring mistakes due to the lack of depth in the product they are currently working on. I am not saying that the ranking system is the sole reason for this culture, but in Microsoft's case I think it style of ranking at the least reinforces that aspect of its culture. There are positive aspects to this practice, but I think it also contributes to the unusually high level of politics I saw while working at Microsoft.

  6. Glamour Girl says:

    Friends enjoy this esp. for USA,UK and German friends


  7. Justin Chase says:

    The problem with calibration in my opinion is that it promotes inter-team competition rather than intra-team competition. Competition is good, but teamwork is a force multiplier and undermining that is a big productivity hit. So rather than forcing team mates to compete against each other we should be promoting competition between teams. That way you get the best of both worlds. From a gamification stand point the current system leads to inefficiencies.

  8. Doctor says:

    So what's your take on the new system that does away with the overt curve fitting requirement? The new system is being billed as something that 'fixes' the problems of the old; problems you've taken quite some effort here to prove doesn't exist.