Sunday, March 2, 2008

Cloud computing on the horizon?

More and more you see references to a big push into cloud computing. According to those who watch for the next big trends in computing, Cloud Computing is poised to revolutionize the entire internet.

To be sure, it has already had a significant impact inside corporations with large data-processing needs. Seti@Home and, more recently, Folding@Home have broadened public awareness of the possibilities, but a general computing platform where cpu cycles are a virtualized concept for sell to the public the same way megabytes of disk space are has yet to find its champion for the masses.

Perhaps Microsoft will change all that. Certainly Amazon's S3 and similar virtualization technologies have shown the potential market for such services... it remains to be seen how well it can be sold to the average user, and how willing they might be to offer up their own computer cycles to the "cloud" in exchange for some impressive computing horsepower when they happen to need it. How much monetization can be done in this realm? And how powerfully will open-source respond to try and ensure that the cloud with the most horse power is the free and opensource - instead of closed and for-profit?

Interesting times are ahead, of that - at least -there can be little doubt.

-Michael


Rough Type: Nicholas Carr's Blog: Rumor: Microsoft set for vast data-center push

Wednesday, January 9, 2008

BOM threats got you down? Let's explore these pesky Unicode prefix codes

Captain: What happen ?
Mechanic: Somebody set up us the bomb.
Operator: We get signal.
Captain: What!
Operator: Main screen turn on.
Captain: It's you!!
CATS: How are you gentlemen!!
CATS: All your base are belong to us.

That is of course a famous intro for an old 8-bit console game... and if you find that topic more interesting than this, here's a link for your pleasure (The Presbyopian aim's to please).

Still here? Good, maybe we can help each other. I want to feel like someone can benefit from the time I just wasted figuring this out and writing this blog post, and you want to know what the hell the BOM is and what it has to do with the weird bytes in the start of some of your text files.

I just recovered from a serious BOM attack on my code files, and the worst part was that the mad bomber terrorist was me. Unwittingly, of course... but that doesn't fix my code...

By now some have realized that BOM refers to the Byte-Order-Marker that is present as the first two "characters" in a Unicode file. The Unicode folks have a great FAQ written up on this beastie, if you know what to look for.:

Unicode BOM FAQ

Finally, a decent answer to the age-old question: "Waiter, why are there extra bytes in my text file?".

The official explanation from the FAQ on what a BOM is goes like this:
A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. [AF]
I'm sure some readers have not encountered this before - so let me just share a few Presbyopian notations in this regard:


  • - ASCII text files do not have this marker, so don't put it in unless you are saving the file in a Unicode format.
  • - The BOM is usually only inserted in text files to help the consumer determine the format of the file's text. You generally don't see it in binary files, as the format is usually already known to the consumer.
  • - You can't rely on a file having a BOM even if the file is written in Unicode. It is the "proper" thing to do for portability, but we all know how well good coding styles are enforced in the real world. :)
  • - Culprits on the win32 platform tend to be notepad and wordpad, who will insert the header if it was already present, or if you save it out in Unicode format.
  • - If you are not careful, you can end up breaking text parsing when you write out files when batch processing. Example: Powershell uses Unicode as its default format, but cmd.exe writes text files in ANSI format (signle-byte, not unicode at all). This can lead to annoying problems if you are then mixing formats for the files you work with. I wrote this post because I just got through fixing this problem where I had modified a nujmber of header files in a powershell script and specified UTF-8 format... leading to a two-byte mystery prefix that was the BOM header.
  • - If the header is present then the FAQ above will show you how to parse it, but otherwise you ahve to look at the file yourself to figure out what the format is. If you see every-other byte is a zero then you know you're probably looking at Unicode, as a simple example, but there is a reason the Unicode standard is as thick as a phonebook..
There are more fun quirks, but the gist of the issue is that you need to account for the BOM in your coding adventures now and then. This little blog entry should be all the ammo you need to handle it like a pro.

Later, youngsters...

-The Presbyopian Coder