My wife, mother in law, father in law, and sister in law all grew up using Macs. I grew up using various unix machines. I had to say that I was excited, if for no other reason than family tech support, to see that OS9 would be leaving our home and theirs and being replaced with something sensible and UNIX based.
This weekend, that sentiment bit me, badly.
The sister in law is going to graphic design school and naturally that requires a mac and the adobe creative suite. She was able to rustle up a powerbook (lombard, maybe ?) G3 4 or 500mhz. Works well enough, runs OS X and all her mandatory apps. I was impressed at how easily she was able to use it to do various computery tasks, and watching her actually work in Illustrator, drag PDFs to a network share (the other OS X machine in the house), and print from there (network printing wasn't setup) was actually pretty impressive.
Because my in-laws are so appreciative and generally wonderful, I don't mind doing occasional tech support for the mother and sister in law (the father in law is himself technically brilliant, but short on time). A week ago, the sister-in-law called with something of a dilemma.
Her powerbook had stopped booting. It would hard lock upon bootup, on the white-apple-logo screen with the spinning cursor (i love that apple added a spinning cursor during boot.. reminds me of 20 year old Sun machines with their spinning ascii cursor). Her dad had already given the machine a once over.. the standard apple tricks (pmu reset, pram zapping, etc) and some new special OS X ones (deleting kext cache! - do mac people actually know what that means? 🙂
His conclusion - hardware issue. Uh oh. Sister-in-law doesn't need any money-expending situations, so a hardware issue would be bad. Her call to the various paid-apple people all involved proposals that she give them money in order for them to do something to her computer that her father had already tried doing.
I asked her to try a few more things over the phone. The machine oddly enough booted and ran correctly in safe mode (hold down SHIFT during bootup). The machine would hard lock in the same place trying to boot from CD (uh oh).
I asked her to boot single user (hold down option-S during power-on). This drops sister-in-law into a single-user bash prompt, running as root. Frankly, i'm not sure this is what mac users want their debugging experience to be like. With old macs, you just kicked them or threw them away when they stopped working. Now that there's an actual operating system under the covers, people such as myself foolishly beleive we can revive these machines with enough messing around.
Indeed, i told her i'd fix it for her when i saw her in person next (they're a 3hr drive from us).
I spent about 10 hours this weekend working over the powerbook. I know now an exhaustive amount about the OS X boot procedure. It's a somewhat unix style boot, with some weird apple stuff thrown in at the end.
When you boot single user, /etc/rc.boot is executed. All this does is really do rudimentary fsck's, and then drop you to sh. Good.
Multi-user boot is /etc/rc
Both look at /etc/hostconfig
The last line of /etc/rc is something called “SystemStarter”. Uh oh. Capital letters mean its not unix, its NeXT/Apple stuff. Missing are both the SysV rcN.d/Sxxfeaturename system, as well as the “its all in a big rc” BSD style approach.
No, SystemStarter examines /System/Library/StartupItems and /Library/StartupItems, looking for stuff to do.
I modified rc so that it didn't call systemstarter, and instead gave me another sh prompt. Running SystemStarter manually (they did provide a non-daemon mode, a verbose mode, and a non-gui mode) showed that things were generally ok (except for a problem starting CrashReporter) until you tried to boot the GUI. SystemStarter starts LoginWindow as one of its last jobs, so it was clear that something inside SystemStarter->LoginWindow wasn't right.
Oddly enough, the binary version of systemstarter was 177.3 on this mac. The other mac, also running 10.3.4 with latest patches, had version 177.2 How does that happen ? One of these machines has the wrong file version on it, clearly, but both feel they've got all the latest critical updates. Interesting.
I did some more reading about what safe mode did, since safe mode booted fine. Turns out that OS X retains Apples ideas about “Extensions”. Some extensinos are “bad”, and don't get loaded during safe boot. We've got the analagous concept in windows. In windows, we use the registry to define service/driver sets for the various configs. (CurrentControlSet, etc). In OS X, the system extensions are filesystem structure based.
You've got /System/Library/Extensions, and extension is foo.kext. foo.kext is a subdirectory, and in this directory you've got Info.plist, which is an XML or NeXT prop-list document talking about the extensions dependancies, and as i found out eventually, what boot modes this extension is loaded under. Also you've got MacOS/foo, the actual binary image of the loadable kernel module.
Now, It turns out that kext's are loaded with kextload. Loaded kexts can be queries with kextstat. And there's a daemon which dynamically loads kexts as needed, called kextd.
Kextd is started in /etc/rc (the multi-user script). If safe boot is enabled, kextd -x is called, which tells kextd (as near as i can tell ) to just pass -x to kextload. kextload -x means “only load safe-boot or root extensions”.
I decied i'd try and isolate the problem to kext's. I modifed /etc/rc so that kextd -x was called regardless of if we're safe booting or not.
After making this change, normal multi-user boot succeeded. The machine worked great, apart from having no sound, modem, firewire, etc 🙂
Clearly, there's an extension that gets loaded outside of the safe-boot set that hangs bootup somehow. But how to isolate it ?
On this laptop, there were 183 extensions in /System/Library/Extensions. 100+ were required for a safe boot.
OS X has the notion that an extension can be required for the root filesystem, and furthermore, can be required for certain types of root filesystems (i.e. local, network, or cd). The set of extensions actually loaded during a safe boot is any extension marked required for safe boot (in its Info.plist, in the OSBundleRequired property), and any extension which is needed for root or root-filesystem mounting.
It is impossible to discern this, by the way, without using csh and grep. Eventually i constructed some output files using csh, awk, grep, vi, and comm to figure out a list of which extensions were NOT used in safe mode.
In the old world of macs, people would drag extensions out of the extensions folder into Extensions (Disabled) and drag them back in until the box worked. They had then identified the misbehaving extension.
Doing the same thing with csh foreach() loops and mv is just as good.
I started by removing every non-required kext from the default kext directory, and then reverted the kextd -x change and did a normal multi-user boot.
Awesome. machine booted, but all video is 4 shades of grey (i was pleased to see this bit of NeXTSTEP was still lurking under the covers).
next i started adding in groups of kexts. I'd make a textfile saying which ones i wanted to add, then add them in a foreach loop, sync the disks, and reboot. If it booted cleanly, i'd add another batch.
Working with the mac in greyscale mode kind of stinks, so i wanted to sort that out in short order. However, i found that moving all of the ATI video kexts caused a non-boot condition. Interesting. I had noted a few messages coming from the ATIRage128.kext driver as it loaded on previous boots (it can load in text mode and not crash) and i also noticed that trying to unload it caused a hang. I removed the ATIRage128.kext driver and teh box booted again, albeit with a grey screen again. Doh. I started trying to add more kexts and noticed a folder that wasn't a kext at all - AppleNDRV. In this folder were ATIRuntime bundles - neat. Adding that folder back to its original location gave me a booting mac complete with color.
After adding back all the other kexts, it would seem that the ATIRage128.kext module was responsible for the hang. Presumably, this also explains why the boot CD crashed, as i beleive it's a stock driver. One also wonders how im getting color video with no kernel extension for the video hardware, but, whatever 🙂
the 10.3.4 update (which was applied a day or two before the box stopped working) included massive updates for ATI and NVidia chipsets. The checksum on the actual module matched that of another ATI equipped machine (which worked, by the way).
So, is it a hardware problem ? I donno. When the 128 driver loads, it prints a kmessage saying its unhappy about the firmware on the video board in the powerbook (a RageM3). Without the kext loadable, the box works perfectly, even for doing opaque drags and work in illustrator and photoshop. What was it doing in the first place ?
The point of all of this is - are mac users really supposed to be able to fix this themselves ? I lived and breathed unix all through highschool and college and it took me the better part of 10 hours to get this machine booting multi-user with working audio and so on. Will the next system update drop a new ATIRage128.kext into the default place, perhaps causing this problem again ?
It seems that the process of debugging boot failures ought to be a bit easier than this. I run my windows machines with /SOS and the other nice switches that give you textual boot progress, and OS X supports “verbose boot” (hold down option-V during boot), but that was no good in this case, as the crash didn't occur until the video hardware tried to do something clever, even though merely loading the kernel module for the ATIRage128 didn't cause the crash - only starting the windowserver with this module also loaded triggered the hang.
What are the odds that Apple's paid tech support would have resolved this without a format ? (or at all?)