XML: Getting dirty with BMX

Years ago there was a time when I thought XML was evil.  That was back when the whole idea was pretty new to my ears.  The only introduction I’d had to it was a few examples being shown around by Adam Bosworth when he’d come to our group and try to convince us to redo ADO and SQL server to support XML as a transport.  Of course, I thought it was a really bad idea at the time.  After all, we had just gone through a lot of pain enabling Unicode in the transport layer and we were getting a lot of flak for it.  No one liked the downside that it took more bandwidth. 


But Adam persisted, and I became notorious.  He would show up at a meeting trying to convince our dev manager or product unit manager and see me there along with everyone else.  “Oh, it’s you again,” he would snarl at me as he entered the room. 


Adam would remind us of all the benefits of XML, how it was unstructured but discoverable.  At the time I wondered how this was really any different than HL-7, but mostly I was skeptical how we would benefit by replacing a binary format that we owned on both ends.  Still, I was an engineer, so as an engineer I tend to try to solve problems.  So I took what Adam described as the strong points of XML and was hell bent on developing an equivalent binary format that worked instead.


I put a lot of effort into trying to make certain that it could both capture hierarchy and be fully discoverable and even contain tags, but at the same time, if your data was highly regular then the additional bandwidth cost would be nominal.  When I presented this gem to Adam and his team it was immediately shot down.  Apparently, I didn’t get it.  XML was text.  Andrew Layman assured us that the W3C or someone was busily working on an official binary format, and that it was a heated debate.  We certainly shouldn’t waste any effort duplicating that or trying to invent or own proprietary form of XML. 


So I put the spec aside and forgot about it.  Eventually I even lost the files due to a mishap and so no record was left that I even attempted it except for the idea still floating around in my head.  Of course, a few years later, a strange twist of fate had me developing parts of the XML frameworks for .Net.  By that time, I had changed my mind.  I had forgotten about the bandwidth problem, as I just didn’t care about it anymore.  Pipes got fatter.  Size was not really an issue.


Yet, then came the long death-march, the stabilization period for version 1.0 of the frameworks.  We were fixing bugs like crazy and doing tons of perf work.  We met the goals we were shooting for, but I always thought we could do more.  At this time I started up an internal effort called ‘TenX’.  TenX was focused on taking performance to the next level.  The idea was to throw out any pre-conceived notions about how to solve the particular problems that plagued us.  The goal was to do miraculous things that would improve performance by an order of magnitude.  Through this effort I encouraged most of our top developers to think outside the box and find new solutions.  Maybe the solutions would only solve edge cases, but even edge cases are useful to someone. 


One idea I had was to improve the speed of the XML parser.  Sure, I made many attempts at re-writing the parser itself, and one day I’ll even share a faster parser with you.  But I also had this crazy idea that if the XML was already mostly parsed, it would scream.  You see, I knew that much of the cost of parsing came from doing various validations and fumbling over characters one by one.  If I could throw out most of that, maybe I could double or triple the speed of reading it in.  So I went back to the old binary XML idea.  If I could invent a format that was easier to parse than XML, it would be a big win.  


So I revived BMX and went to work making a prototype that I could put to the test.  What I came up with was very different than the original.  It is actually just a tokenized stream, because decoding tokens is extremely fast and why mess with the text itself?  The big news is BMX encoded files read faster than XML text, approximately 10 times faster!


I’m happy to be able to share this with everyone.  Feel free to criticize my programming skills if you’d like.  There is extra stuff in there that I was experimenting with but not entirely germane.  There are also many functions that are stubbed off with not-implemented-yet exceptions.


Take a look at the code.  It's here all crammed into a single file for your downloading pleasure.  All the comments with foul language have been removed.





Comments (8)

  1. stu says:

    you’re link to the file, results in an You are not authorized to view this page message 😉

  2. Matt says:

    It’s working now. I moved it to Yahoo. 🙂

  3. T says:

    I have a problem with this statement:

    "I had forgotten about the bandwidth problem, as I just didn’t care about it anymore. Pipes got fatter. Size was not really an issue."

    Are developers really getting that lazy? These sort of optimizations should be #1.

  4. Matt says:

    As it turns out, we never did change the format to XML so there was no issue. My statement was just meant to reflect the general sentiment of the XML community. Most XML fanatics are so focused on the benefits of text that cost issues are secondary.

  5. "Are developers really getting that lazy"

    I’ve always been that lazy…

  6. Gabe A says:

    Matt, this is a great idea and should be the pillar of Longhorn. I’m just now starting to learn XAML but I do have experience with the MSXML parser and can’t help but feel that when I start developing apps utilitizing XAML pictures, controls, etc.. that parsing time will become an issue sooner or later. Why not embrace the idea of an optimized binary linked to markup? Heck the Longhorn OS could even automagically add/read this enhanced binary data to a CDATA section on the file itself.

    What do you think?


  7. Matt says:

    This would work as part of the OS as long as you shipped a editor for binary xml, as well as the api’s for reading/writing. You could even shim these guys into the normal ‘text’ statcks so you could auto detect binary files and server them up as fake text files so downlevel apps would be none the wiser.

Skip to main content