Flying with the BSOP (Black Screen of Penguin)

Derek Snyder and I are flying from Hong Kong to Vancouver/Canada on Cathay Pacific, nice airline, great service, comfortable seats, and an in-flight entertainment system that crashes at the drop of a hat. The interesting thing about this is the entertainment system is running Linux (see below, spot the penguin in the top left corner of the image [click to enlarge]).

During the 12 hour flight I was hoping to watch “March of the Penguins” but had to put up with watching “Reboot of the Penguins” instead – the flight attendants mentioned that this is not an isolated incident.

Reliability is key for embedded systems which are often expected to run for days, weeks, months or even years without being shutdown or rebooted, it’s only when things go wrong that we get to see what’s under the covers. Also interesting was the amount of time needed to reboot the Linux based entertainment system, it was a good few minutes from reset to fully running.

What’s really interesting is how Linux fans tout the reliability and robustness of the operating system, and then we get to see things like this…

– Mike

Comments (30)

  1. Newtronic says:

    I saw the same thing on Virgin Atlantic.  I wish I had timed it, cause I think it was close to 10 minutes.

  2. tarball says:

    It is actually the application that has crashed not the operating system.  Although, any fault like this in an embedded system is in excusable!

  3. explorer.exe says:

    In the screenshot you can see that the component that failed was SVGALib.  

    If this had been a Microsoft app a comparable application flaw probably would have taken out the operating system, as that functionality is at the system level in Windows products.

    Can’t fault the author for trusting his instincts!

  4. mikehall says:

    even if this is the application that’s crashed the system was in a totally unusable state – if this device was living in a remote/isolated location it would now be unusable and would probably need a technician to be sent out to fix/reboot the problem.

    Not a good situation to be in, and certainly doesn’t do anything for the "Linux is robust/reliable" myth.

    – Mike

  5. Michel Verhagen says:

    NZ Air is using a system running Windows CE. That system is just as slow as the other ones using Linux and stops working just as often, but that’s just because the choices made for the in-flight entertainment system are really poor:

    – using a serial connection to download the image (but later a streaming network connection for watching live CNN feeds is available!)

    – using a browser application for the navigation stuff (and probably implementing the browser application without being restricted by any knowledge)

    We’d love to design an in-flight system on Windows CE that doesn’t annoy the hell out of people by crashing and being dead slow. I know we can do much better than what’s available now!

    Michel Verhagen

    Embedded Fusion

  6. Vuemme says:

    I think that this fault shows that in an ambedded system every component should be robust and reliable since any fault could make the device unusable.

    I also note that who implemented that system simply don’t care about faults!

    What’s the point of putting a message requesting to press the "enter" key on a (I suppose) keyboard-less device? And what can the user do with the console? I may try to debug the issue and rebuild svgalib… but it’s not the average usage for this kind of devices!

    A good device should had logged the exception (and maybe allow someone to download it and send to your office after the plane has landed) and restart the device!

    The user will be annoyed but, at least, he will not need to ask for support to be able to watch a movie!

    I heard many people saying something like: it’s a driver/application fault, the kernel is still perfectly alive and working (and not only about linux), but from a user point of view that means absolutely nothing.

    It’s like having your car keys made with a poor material that broke each time you try to turn your car on an having the mechanics telling you "it’s those stupid keys fault, the car is still perfectly able to start!".

    Doesn’t that sound silly?

  7. patbateman says:

    I have been on MS Developer Days here in germany 2 month ago. A co-worker of me still uses the HTC Himalaya. While we have listened to an session he decided that he would like to check his mail.

    GPRS->VPN and connect to Exchange. He did not sync his device for days, so there was a large amount of mails, and he allways downloads the attatchements. But he ran out of memory on the device and that was the end for the os.

    It was not responding to anything. Softreset did even not work. So he did a Hard Reset and lost all his data. The only meaningful app he could use after that was bubbles…

  8. I love how most of the people responding to this post are adamant that CE is worse even still and that the situation on the plane wasn’t a Linux problem after all.  Isn’t it so interesting to see how Linux advocates react to plain truths (that these supposed facts are not necessarly always facts (stability, security, etc.)) such as the one you present?  Perhaps there is too much passion for a particular platform rather than a passion for computing in general.

    To those who constantly battle…stop, step back, and breathe.

  9. MYoung says:

    I am sure, it was a windows programmer… hehehe.

    I really like linux and I am for it. But you brought a good point. Love for computing.

    What makes me think is, how a OS like GNU/Linux can be even comparable with a company that has really deep pockets.

    My exeprience is that linux is much harder to setup, but it runs after that. Windows is easier, but as reliable as GNU/Linux (in some cases).

    How is that happening?


  10. Gursharan says:

    The signal recieved which is shown towarsds the end is Segmentation fault, which can occur by:

    1. Faulty memory management by the OS, per se, making  location read only where the app attempted to write.

    2. Faulty app development, that tried to write to a place not meant for it.

    3. Some memory overflow that "stole" the memory space for SVGAlib et al.

    So, either its a problem with the OS mem manager, or the application itself, or the right application on the wrong OS :D…

  11. Yan says:

    There are two system design issues with this.

    1. Verbose logging should not be shown to the end user. Blank screen or image should be used.

    2. Hardware watchdog should trigger a restart.


  12. mikehall says:

    I agree with both points – embedded systems, especially those built for consumers should completely hide the underlying technology. The user shouldn’t know there’s a ‘computing’ device living under the covers.

    But still, in this case the embedded Linux system was completely unreliable – I don’t care whether this was an application, driver or kernel – the end result was that I was sat in front of a screen that I couldn’t do anything with – the flight crew needed to hard-reset the unit to get it working again.

    – Mike

  13. EnviroTO says:

    I don’t think you can blame Linux for this one.  Whoever sold this device to Cathay Pacific is to blame.  The directory paths look out of whack for video drivers and it bombed running engine.cram/airsurf which isn’t an opensource application.  Linux is at the root a text mode operating system with graphical components on top.  In this case it seems possible that airsurf delivers the graphical interface and it is what had the problem.

    The same application on Windows seems to have issues.

  14. mikehall says:

    Hmm, running on Windows 95… I don’t want this thread to degenerate into Windows vs. Linux – I’m simply pointing out that in this case the in flight system was running Linux and was completely unreliable (you cannot dispute the fact that the "system" had crashed and was in an unusable state)- when looking at the reliability of embedded systems you should consider the entire system as a whole.

    – Mike

  15. As an update to the "Flying with Unreliable Penguins" post I thought it would be useful to point you…

  16. EnviroTO says:

    I’m only responding to your comment: "What’s really interesting is how Linux fans tout the reliability and robustness of the operating system".

    It is unfair to characterise this as an operating system problem.  It is definitely a problem with the device as a whole but not an operating system problem.  If some third party writes a program on top of the operating system which provides the entire graphical interface and doesn’t configure the system to automatically reboot that is a problem with the third party.  It’s not as if Linux can’t handle rebooting and writing a cron job to monitor the health of the software on the system is quite simple.  If you wrote airsurf for Windows CE and don’t write the application to use the Watchdog Timer API properly or perhaps accidentally use the WDOG_NO_DFLT_ACTION option it isn’t going to reboot and that isn’t a WinCE problem, its a programmer problem.

  17. EnviroTO says:

    One more point… an embedded device is the sum of its parts and if one part has a problem it seriously limits the usefulness of the whole device.  If the airsurf application has some flaw a reboot did not properly resolve or which occurs frequently the device is just as useless constantly rebooting.

  18. Vuemme says:

    I agree with your observation EnviroTO.

    I think that many people see the OS (Windows CE, linux, vxworks or anything else, including their own custom made and almost perfect OS) as the component who should grant stability and reliability to the whole system, and that’s simply not true.

    You should have a stable OS (and I think that both Windows CE and linux are stable, but many people think that Windows CE is simply a "rebranded" windows 95 as many of those comments imply…) but this will grant stability only for well written applications! The only thing an OS can do is protect other applications to malfunctioning ones… but if the malfunctioning app is the main user interface of the system or something that is critical for the device main pourpouse, that doen’t help much.

    I understand that an automatic reboot is not the solution, but it’s a way to make the problem less complicated for the end-user and lower the amount of work needed for on-board maintenance. If the flight attendant are continously doing resets of the entertainment system they will have less time to care about people, that’s their main job.

    Rebooting after a critical failure is a must (with detailed error logging) for any device that should run unattended or with untrained users (like passengers on a plane) and since the flight attendants are usually not able to diagnose the problem or fix it, rebooting automatically will simply save them some work.

    I hope that the faulty device saves some information about the fault and that this will allow someone to fix-it someday!

    But I can’t understand why those kind of problems cannot be detected during test on the gound, instead of using passenger as beta testers, making their travel even more boring (exactly the opposite of what an in-flight entertainment system should do!).

    Another thing that happens often is to see monitors at the terminals showing GPFs, or other kinds of faults (I saw linux terminals and also MS-dos ones!). Fixing this kind of issues takes so few lines of code that I can’t understand why someone could leave them out of their systems. I may understand an hardware failure (disk not found!), but I can’t understand why someone doesn’t care about its own software faults. Trying to Hide them from the users, at least!

    I hope that this is because the aviation industry put all his effort on critical equipment stability and that the team who did testing on the in-flight entertainment system is not the same who tested other kind of systems on the same kind of airplanes!

  19. collaborateur VDU says:

    I’ve tried both of the OSes (Windows and Linux), and in many different versions (Win95, Win98, Win2000, WinXP, Mandrake 2006, Fedora Core 5, and Ubuntu 6).

    All I can say is : before spitting on Linux, you gotta try it at least one time.

    I’ll explain my point :

    Windows is great -from the Windows 2000 version only, of course, before it was only the beta versions (very expensive, by the way, for beta versions)- ; I’ve got one thing to complain about WinXP, the network management, which is disastrous (long long waitings for network neighbourhood, and really often).

    Linux WAS not great. Today, some distributions are not great, because tooo complicated. But some are really relevant : Ubuntu, Debian and Mandriva are some good examples. And the final user is not confronted to complicated configuration of the computer anymore.

    Windows leaks by its security management (with the WinXP firewall, viruses pass through ???). Viruses are only a myth on Linux, I’ve never seen one.

    My actual distribution of Linux is Ubuntu 6. Its time of loading is 28 seconds (on a Centrino 1.6Ghz, 512 MB of RAM), and my office’s computer time of loading is 22 seconds (WinXP, P4 2.8 GHz, 2048 MB of RAM). Yeah, really, Linux has an issue of loading time…

    After two years of Linux usage, I’ve got nothing to complain about, and my actual computer is recent (one year).

    The BSOP you’ve captured is for sure from an old computer and an old distribution of Linux.

    Really, I must say, before talking about the "Black screen of the penguin", haven’t you tried to figure out why windows’s is quoted "of death" ?

    Sincere regards,

    An ancien Windows user

  20. mikehall says:

    collaborateur VDU,

    I consider the in flight system to be an "embedded system", this isn’t a general purpose computing device (I can’t install new applications or modify the behavior of the system), you quote 28 seconds for your linux distribution to boot – the Linux based in flight "Embedded System" took several minutes to be ready to use after reboot, not a great customer/consumer experience.

    Compare this to a Windows XP Embedded SP2 based system using HORM (Hibernate Once, Resume Many) where you can see boot times in the < 10 second range, or Windows CE where, with a custom shell I have boot times in a few seconds from cold boot.

    Again, reliability should be considered to be the system as a whole, not just the o/s, drivers, services or applications.

    – Mike

  21. EnviroTO says:

    There is no point talking about P4s, Centrinos, out of the box distributions, Windows XP, server or desktop OSs, etc. when we are talking about a stripped down Linux OS with all the overhead removed (hopefully the provider did this) which is running on something equivalent to or less than a 266MHz RISC microprocessor with 256MB of RAM.  There are many embedded Linux devices which boot in 2 or 3 seconds.  I’m not suggesting that this particular piece of hardware should come up in 3 seconds since it is a networked device providing many terminals but a GUI hiding the internals should be up almost instantly and the device should be usable in less than 20 seconds.  Lay the blame where it belongs… on the provider of the device.  Didn’t compile the OS optimally… device providers fault, created software that bombs too frequently… device providers fault, didn’t ensure routines to handle errors and monitor health of the device to trigger issue logging and reboot… device providers fault, didn’t have hardware scaled to meet the requirements of the OS as configured and software solution provided… device provider’s fault.  The Linux community and Microsoft can only be blamed for errors or inefficiencies in their own code, not the quality of an entire embedded device.

  22. Vuemme says:

    I think that what’s shameful is that this kind of systems are considered ready to be used on the field.

    I know that no device is perfect and no software can be granted to be bug free, but in this case the bugs seems to be too easy to reproduce to be unnoticed by even a very basic testing (the fact that the flight attendants know how to reboot the device proves that!).

    I suppose that airlines spend a lot of money for this kind of systems and having something that run soo poorly is a big damage for them.

  23. mikehall says:

    EnviroTO – "The Linux community and Microsoft can only be blamed for errors or inefficiencies in their own code, not the quality of an entire embedded device."

    I completely agree with this statement – all too often the blame for application/driver crashes are pointed at the o/s vendor. The O/S should provide mechanisms to trap these issues (watchdog timers, secure CRT, process isolation etc…) which the device provider should make use of.

    Seeing an o/s vendor/logo on a crash screen certainly doesn’t help with the perception of reliabilty of the underlying operating system.

    – Mike

  24. Mahesh says:


    You were lucky that only entertainment system of Cathey was on Linux. Wheather it was application, kernel or what-so-ever, we would have missed your blogs if it was auto-pilot instead of entertainment system.

  25. Riku Voipio says:

    This is truly embarassing engineering in so many levels.

    I doubt the same engineers would manage to make stable windows CE entertainment box either, even if they had  a "create in-flight entertainment system" wizard 😛

    Lets look at the mistakes we can see:

    1) Showing the system console to enduser is bad (like mentioned already)

    2) The crash is in SVGALIB. This the equilavent of WinG for Linux. It’s broken by design, outdated and unportable. There is no point of using it in 2006 anymore.

    3) Did I mention unportable? Thus it must mean that the entertainment system is X86 based. Ya for powermanagment and weight. Ofcourse irrelevant for airlines but..

    4) The system is started by a shell script with "set -x". Such setting has no place in production system.

    5) set -x reveals a typo. "extort LD_LIBRARY_PATH" does nothing. Nevertheless setting LD_LIBRARY_PATH in shell script only slows down the startup process, there are much better ways to do it.

    6) No automated recovery. "Press Enter to activate console"?? Truly WTF. System V init has provided service respawning before the time I started wetting my pants.

    And this was revealed just one screenfull 🙁

    Opensource is no guarantee for high quality. In fact the quality of open source components  varies a lot, svgalib happens to be at the crappy end.

  26. Derek Snyder and I are flying from Hong Kong to Vancouver/Canada on Cathay Pacific , nice airline, great service, comfortable seats, and an in-flight entertainment system that crashes at the drop of a hat. The interesting thing about this is the entertainmen