Wednesday, January 9, 2008

BOM threats got you down? Let's explore these pesky Unicode prefix codes

Captain: What happen ?
Mechanic: Somebody set up us the bomb.
Operator: We get signal.
Captain: What!
Operator: Main screen turn on.
Captain: It's you!!
CATS: How are you gentlemen!!
CATS: All your base are belong to us.

That is of course a famous intro for an old 8-bit console game... and if you find that topic more interesting than this, here's a link for your pleasure (The Presbyopian aim's to please).

Still here? Good, maybe we can help each other. I want to feel like someone can benefit from the time I just wasted figuring this out and writing this blog post, and you want to know what the hell the BOM is and what it has to do with the weird bytes in the start of some of your text files.

I just recovered from a serious BOM attack on my code files, and the worst part was that the mad bomber terrorist was me. Unwittingly, of course... but that doesn't fix my code...

By now some have realized that BOM refers to the Byte-Order-Marker that is present as the first two "characters" in a Unicode file. The Unicode folks have a great FAQ written up on this beastie, if you know what to look for.:

Unicode BOM FAQ

Finally, a decent answer to the age-old question: "Waiter, why are there extra bytes in my text file?".

The official explanation from the FAQ on what a BOM is goes like this:
A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. [AF]
I'm sure some readers have not encountered this before - so let me just share a few Presbyopian notations in this regard:

  • - ASCII text files do not have this marker, so don't put it in unless you are saving the file in a Unicode format.
  • - The BOM is usually only inserted in text files to help the consumer determine the format of the file's text. You generally don't see it in binary files, as the format is usually already known to the consumer.
  • - You can't rely on a file having a BOM even if the file is written in Unicode. It is the "proper" thing to do for portability, but we all know how well good coding styles are enforced in the real world. :)
  • - Culprits on the win32 platform tend to be notepad and wordpad, who will insert the header if it was already present, or if you save it out in Unicode format.
  • - If you are not careful, you can end up breaking text parsing when you write out files when batch processing. Example: Powershell uses Unicode as its default format, but cmd.exe writes text files in ANSI format (signle-byte, not unicode at all). This can lead to annoying problems if you are then mixing formats for the files you work with. I wrote this post because I just got through fixing this problem where I had modified a nujmber of header files in a powershell script and specified UTF-8 format... leading to a two-byte mystery prefix that was the BOM header.
  • - If the header is present then the FAQ above will show you how to parse it, but otherwise you ahve to look at the file yourself to figure out what the format is. If you see every-other byte is a zero then you know you're probably looking at Unicode, as a simple example, but there is a reason the Unicode standard is as thick as a phonebook..
There are more fun quirks, but the gist of the issue is that you need to account for the BOM in your coding adventures now and then. This little blog entry should be all the ammo you need to handle it like a pro.

Later, youngsters...

-The Presbyopian Coder