Writing a Unicode file via perl ...

Article
06/07/2006

Several months ago someone filed bugs across
Windows Vista to make sure all performance monitoring .ini files were
Unicode, so the files could be properly localized ("translated") to
various languages (so we could have Korean, or Hindi descriptions). A noble goal to be sure.

For most people this was as easy as checking out the file for editing,
opening it in notepad, doing a "save as", picking Unicode. ESE
however has ALOT of perf counters (esp. when you Squeaky Lobster a
machine - more on that later) so we use a perl script to generate
several parts of the performance monitoring files, the .ini, the .hxx,
and some fairly repatitive .cxx code that gets compiled into ESE binaries ... I know some of you are saying,
"You can use perl on Windows?" ...

Anyway, I looked all over the internet, and couldn't even find help
when I scoped to Mr. Unicode's blog ... then I posted to an internal
alias on perl at Microsoft, and someone came to my rescue, since he
said he didn't mind and I couldn't find it on the internet (at least at
the time), I'd figured I'd post his comments ...

His comments, I wholesale included in our perl code
(I had to read it twice to really grok how the ":raw" type parts were
like piping through converting text commands, but in reverse so you read
them right to left):

 #
# Some notes from someone smarter than me about Perl and Unicode ...
# ----
#
# Which encoding do you want to use? UTF16-LE is the standard on Windows (nearly
# all characters are encoded as 2 bytes), UTF8 is the standard everywhere else 
# (characters are variable length and all ASCII characters are a single byte).
#
# Here's what I've figured out after lots of experimentation. To get UTF16-LE 
# output you need to play a few games with perl...
#
#   open my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", "e:\\test.txt";
#   print $FH "\x{FEFF}";
#   print $FH "hello unicode world!\nThis is a test.\n";
#   close $FH;
# Reading the IO layers from right to left (the order that they will be applied 
# as they pass from perl to the file) ...
#
# Apply the :utf8 layer first. This doesn't do much except tell perl that we're 
# going to pass "characters" to this file handle instead of bytes so that it 
# doesn't give us "Wide character in print ..." warnings.
#
# Next, apply the :crlf layer as text goes from perl out to the file. This 
# transforms \n (0x0A) into \r\n (0x0D 0x0A) giving you DOS line endings. Perl 
# normally applies this by default on Windows but it would do it at the wrong 
# stage of the pipeline so we removed it (see below), this is where it ought to 
# be.
#
# Next apply the UTF16-LE (little endian) encoding. This takes the characters 
# and transforms them to that encoding. So 0x0A turns into 0x0A 0x00. Note that 
# if you just say 'UTF16' the default endianness is big endian which is 
# backwards from how Windows likes it. However, because we're explicitly 
# specifiying the endianness perl will not write a BOM (byte order mark) at the 
# beginning of the file. We have to make up for that later.
#
# Finally, the :raw psuedo layer just removes the default (on Windows) :crlf 
# layer that transforms \n into \r\n for DOS style line endings. This is 
# necessary because otherwise it would be applied at the wrong place in the 
# pipeline. Without this the encoding layer would turn 0x0A into 0x0A 0x00 and 
# then the crlf layer would turn that into 0x0D 0x0A 0x0A and that's just goofy.
#
# Now that we've got the file opened with the right IO layers in place we can 
# almost write to it. First we need to manually write the BOM that will tell 
# readers of this file what endianness it is in. That's what the 
# print $FH "\x{FEFF}" does.
#
# Finally we can just print text out.
#
# If you want UTF8, I'm pretty sure it's a lot easier. Also, this is also a lot 
# easier on unix, the CRLF ordering problem is definitely a bug but the default 
# to big endian (and ensuing games to get the BOM to output without a warning) 
# are by design. I'm pretty sure that none of the core perl maintainers use perl 
# on Windows (even though at least one keeps perl on VMS working...).
#

#
# Until Exchange decides it wants a Unicode eseperf.ini, we're going to generate
# the old ASCII one.  Also if Exchange wants one, it will have to update it's
# version of Perl to understand the open modes we're using below.  Currently we
# get this error:
#   1>Unknown open() mode '>:raw:encoding(UTF16-LE):crlf:utf8' at .\perfdata.pl line 325,  line 6189.
#


if ( $ESENT ){ #ifdef ESENT 

    open( INIFILE, ">:raw:encoding(UTF16-LE):crlf:utf8", "$INIFILE" ) || die "Cannot open $INIFILE: ";
   print INIFILE "\x{FEFF}";  # print BOM (Byte Order Mark) for the unicode file

} else { #else

 open( INIFILE, ">$INIFILE" ) || die "Cannot open $INIFILE: ";

} #endif

The code worked like a charm, yeah Unicode esentprf.ini. Well,
until I sync'd the code to
Exchange then it broke, that is the source of the "if ( $ESENT )" which
is only defined when we build the ESE sources for Windows. I
should mention in closing that I know this code works for perl 5.8.7,
and I know it does not work for perl 5.6.1. I've heard the perl support got much better in 5.8 or so...

Oh I guess that's code, so I'm required to say something like:

Use of included script samples are subject to the terms specified at
https://www.microsoft.com/info/cpyright.htm (I'm having a hard time
imagining how such a small snipit could be subject to that, but
whatever).

Oh here is what the BOM is, and more on the BOM.

Update 2006/08/20: Turned out Exchange wanted a Unicode
eseperf.ini after all, and has updated thier version of perl, so good
news the NT - Ex code bases grow that much more similar.

Writing a Unicode file via perl ...

Additional resources