How would you administer millions of machines?


With the arrival of “Internet search engines” like Google, Yahoo or MSN, it is now a common practice to administer a pretty large cluster of machines. It is not uncommon to find these days clusters that comprise hundreds of thousands of computers. So it is not that hard to make a step from here to, say, clusters containing millions (or dozens, if not hundred of millions) of machines. 


But there is a little problem here – when you have clusters at this size, machine failures will become noticeable. Especially when you use cheap hardware, which BTW makes sense given that you have to keep the cost down with these gargantuan clusters. And this problem generates another problem – who will do the wiring, replacing, etc for computers of this size? Not mentioning installing new computers.


I am just thinking how a fully automated solution will look like. You would need an entire infrastructure to keep hundreds of thousands of computers in a single room (including the weird requirements for power, ventilation, cabling, etc). And some sort of robotic arms to install new computers or to replace defective ones. And a robotic crane to move around racks, etc. And everything automatic, fully driven by software…

Comments (6)

  1. Rob says:

    You would also need hundreds of unemployed PC techs.

  2. Mark says:

    Why would you notice machine failures when you have a million+ clustered machines? AFAIK, Google doesn’t currently replace failed machines…they let them sit in the rack. Presumably, when a tech is in that area, they can replace it. But the loss of a commodity machine in a large cluster doesn’t even register as a blip.

    Of course, if you wanted to make this fully automated, the software is the easy part. Mapping addresses to rack locations is easy enough…and then it’s just a matter of pulling the machine and replacing it. I’d suspect it’d be easier for a machine to replace the PC if it had a single push style connector (like a hot swap drive) rather than multiple cables.

    You wouldn’t need to move racks either…that’d be a logistical nightmare. A movable battery run robot would be much easier.

    Of course, when you can hire someone for < $10/hour to replace machines, it doesn’t make much sense to build a robot to do it. 😉

  3. Steve Hall says:

    HHuummm…. This is all starting to sound like the design for a Borg Cube…

    Seriously though, this is the reason why mainframe datacenters are sticking with more fault-tolerant solutions than commodity hardware: the overhead (people-time and floor-space) will soon out-step any savings you think millions of commodity racks will get you… Scaling out also has it’s "MP overhead" (multi-processor overhead, see "Amdahl’s Law" at http://home.wlu.edu/~whaleyt/classes/parallel/topics/amdahl.html) and interconnection nightmares that scaling up doesn’t have. (MP overhead for scaling out can be anywhere from a few % to 10-20-30% of processors performing both OS and application "overhead" such as message passing…) Basically, its far easier to maintain several scaled-up mainframes, each with thousands of users or capable of handling millions of transactions each day than it is to manage thousands of small systems and the thousands of network interconnections.

    Now granted, these days a Unisys ES7000 main-frame is composed of up to 106 XEONs or Itanium 2’s running Windows Server 2003 DataCenter…so even what is being called mainframes are essentially scaled out overly complex MP’s rather than a 4-CPU or 8-CPU System/390 running MVS. Even so, I’d rather have dozens of Unisys-style clusters than thousands of rack-mounted commodity machines. With the Unisys system, you’ll get better fault tolerance…and hopefully fewer "lost transactions" due to a unrecoverable CPU fault or memory fault. (Even some of the blade solutions are really nothing more than space-reduction packaging, with no real management improvement or MP-overhead reduction.)

    The same goes for disk drives: having racks and racks of RAID strings with no fault tolerant RAID controller host software is a scaling out "solution" that’ll work until drives start failing… Again, scaled up solutions like EMC’s RAID storage subsystems are a better way to manage lotsa storage.

    I used to work at the world’s largest datacenter where we had tens of thousands of disk drives (the old 14" removable 3330 platters). We had a staff of a half-dozen CDC and IBM FEs that did nothing but replace drive motors and drive brake-belts 24×7. (An anverage of about 6-12 drives failed each day… On a bad day, 2-3 dozen would fail…) Of course, this required that we have a staff of a half-dozen System Engineers that did nothing but restore disk volumes from weekly/nightly backups…24×7.

    If anything was like a Borg Cube, that scaled up and out datacenter certainly was! (Fortunately, I wasn’t in the SE group that ONLY did volume restores around the clock…) When you see disk farms that large which are connected to more than a dozen mainframes through the use of several dozen "T-bar" switches, you become VERY AWARE of the interconnection problems of large-scale datacenters…and it was almost frightening at times to realize how many thousands of time-sharing users were constantly logged on. (They also had one of the world’s largest real-time TP systems: a retail credit-sale verification system of hundreds of thousands of cash registers and POS credit terminals all leading into a single mainframe running CICS and TCAM. Each year on the day after Thanksgiving I had to go to work to babysit that machine…because I was the MVS, TCAM and CICS internals expert.)

    In today’s world, a failure of a dozen disk drives a day in a disk farm of thousands of drives or a dozen CPUs out of thousands would probably sink any company that relies on those "solutions" for real-time TP systems. (Google can get away with commodity hardware and lotsa system failures since they are NOT running a "mission critical" application…)

    The post-mortem results aren’t in yet on the Comair "computer outage", but initial reports are that it was a classic scaling out problem. (The hardware and software vendors are all in the "finger-pointing stage" of the post-mortem…) That problem will be an interesting one to analyze.

  4. Adi Oltean says:

    Interesting comments…

    On one side, it is true that not every problem can be solved by a huge scale out with commodity hardware. Correct – in many cases (especially when you need ACID transactions plus a response time measured in less than a second) commodity hardware won’t really help you.

    In addition, many problems aren’t solvable by commodity hardware just because the cost might be so high: the power to keep all these machines up, space requirements or even the cost of the hardware might be much higher than in the case of a single system, be it ES7000 or HP Superdome…

  5. Steve Hall says:

    Watching "60 Minutes" last night reminded me about an interesting example of scaling out that’s "taken on a life of its own": Google’s use of "so-called" commodity hardware. In one video clip, 60 Minutes showed a machine room with racks of servers…but they were all custom PC enclosures with the name "Google" not just stamped into the covers, but rather molded into the metal or with a big molded piece of plastic glued onto the covers. They sort of looked like Sun pizza-box-style client-side workstations. Then they pointed out that Google "has gotten so big, they make their own PCs to make sure they get better PCs than at the corner store".

    And then, looking at Google’s current job ads, one can notice they’re taking these rack-mount servers pretty seriously, since they’ve got a server hardware department which does mechanical design of the case and doing parts qualifications of all the adapter cards, etc. that go into the PCs. (See http://www.google.com/intl/en/jobs/eng/hw.html for details…)

    This is akin to my last company at which we had to do the same thing for a while (make our own PCs) in order to have a certain amount of control over the parts that went into the PCs.

    Having to do your own hardware design of a rack-mount PC is the ugly down-side of needing a large number of PCs, in order that they’re all EXACTLY the SAME so that managing them becomes a LOT EASIER… The trade-off here is that if you have to have a hardware department in order to control the contents of the PCs, then the results are most definitely NOT "commodity PCs", as their cost will be much higher than what one could buy at Fry’s Electronics or from your local Taiwanese-owned PC shop.

    With Google’s current technique of staffing a hardware department, I’m curious as to what that department’s budget is and how much money they expect to save (i.e., operational costs) by making all servers conform to a common hardware spec. (From my past experience, I know what the disasterous results are when you do NOT control the hardware specs for PCs…as my former company balked after making PCs for a while and just figured that a local PC shopy could be trusted to produce consistently the same PC over many years span. That cheapness backfired, of course!)

    It’d also be interesting to hear what one of the individual Google servers costs to make in-house and what it’s specs are (to compare to off-the-shelf el-cheapo systems). I wonder if anyone at Google would care to brag about their efforts…