Writing a Unicode file via perl …


Several months ago someone filed bugs across
Windows Vista to make sure all performance monitoring .ini files were
Unicode, so the files could be properly localized (“translated”) to
various languages (so we could have Korean, or Hindi descriptions).  A noble goal to be sure.



For most people this was as easy as checking out the file for editing,
opening it in notepad, doing a “save as”, picking Unicode.  ESE
however has ALOT of perf counters (esp. when you Squeaky Lobster a
machine – more on that later) so we use a perl script to generate
several parts of the performance monitoring files, the .ini, the .hxx,
and some fairly repatitive .cxx code that gets compiled into ESE binaries … I know some of you are saying,
“You can use perl on Windows?” …



Anyway, I looked all over the internet, and couldn’t even find help
when I scoped to Mr. Unicode’s blog … then I posted to an internal
alias on perl at Microsoft, and someone came to my rescue, since he
said he didn’t mind and I couldn’t find it on the internet (at least at
the time), I’d figured I’d post his comments …



His comments, I wholesale included in our perl code
(I had to read it twice to really grok how the “:raw” type parts were
like piping through converting text commands, but in reverse so you read
them right to left):


#
# Some notes from someone smarter than me about Perl and Unicode ...
# ----
#
# Which encoding do you want to use? UTF16-LE is the standard on Windows (nearly
# all characters are encoded as 2 bytes), UTF8 is the standard everywhere else
# (characters are variable length and all ASCII characters are a single byte).
#
# Here's what I've figured out after lots of experimentation. To get UTF16-LE
# output you need to play a few games with perl...
#
# open my $FH, ">:raw:encoding(UTF16-LE):crlf:utf8", "e:\\test.txt";
# print $FH "\x{FEFF}";
# print $FH "hello unicode world!\nThis is a test.\n";
# close $FH;
# Reading the IO layers from right to left (the order that they will be applied
# as they pass from perl to the file) ...
#
# Apply the :utf8 layer first. This doesn't do much except tell perl that we're
# going to pass "characters" to this file handle instead of bytes so that it
# doesn't give us "Wide character in print ..." warnings.
#
# Next, apply the :crlf layer as text goes from perl out to the file. This
# transforms \n (0x0A) into \r\n (0x0D 0x0A) giving you DOS line endings. Perl
# normally applies this by default on Windows but it would do it at the wrong
# stage of the pipeline so we removed it (see below), this is where it ought to
# be.
#
# Next apply the UTF16-LE (little endian) encoding. This takes the characters
# and transforms them to that encoding. So 0x0A turns into 0x0A 0x00. Note that
# if you just say 'UTF16' the default endianness is big endian which is
# backwards from how Windows likes it. However, because we're explicitly
# specifiying the endianness perl will not write a BOM (byte order mark) at the
# beginning of the file. We have to make up for that later.
#
# Finally, the :raw psuedo layer just removes the default (on Windows) :crlf
# layer that transforms \n into \r\n for DOS style line endings. This is
# necessary because otherwise it would be applied at the wrong place in the
# pipeline. Without this the encoding layer would turn 0x0A into 0x0A 0x00 and
# then the crlf layer would turn that into 0x0D 0x0A 0x0A and that's just goofy.
#
# Now that we've got the file opened with the right IO layers in place we can
# almost write to it. First we need to manually write the BOM that will tell
# readers of this file what endianness it is in. That's what the
# print $FH "\x{FEFF}" does.
#
# Finally we can just print text out.
#
# If you want UTF8, I'm pretty sure it's a lot easier. Also, this is also a lot
# easier on unix, the CRLF ordering problem is definitely a bug but the default
# to big endian (and ensuing games to get the BOM to output without a warning)
# are by design. I'm pretty sure that none of the core perl maintainers use perl
# on Windows (even though at least one keeps perl on VMS working...).
#

#
# Until Exchange decides it wants a Unicode eseperf.ini, we're going to generate
# the old ASCII one. Also if Exchange wants one, it will have to update it's
# version of Perl to understand the open modes we're using below. Currently we
# get this error:
# 1>Unknown open() mode '>:raw:encoding(UTF16-LE):crlf:utf8' at .\perfdata.pl line 325, line 6189.
#


if ( $ESENT ){ #ifdef ESENT

open( INIFILE, ">:raw:encoding(UTF16-LE):crlf:utf8", "$INIFILE" ) || die "Cannot open $INIFILE: ";
print INIFILE "\x{FEFF}"; # print BOM (Byte Order Mark) for the unicode file

} else { #else

open( INIFILE, ">$INIFILE" ) || die "Cannot open $INIFILE: ";

} #endif



The code worked like a charm, yeah Unicode esentprf.ini.  Well,
until I sync’d the code to
Exchange then it broke, that is the source of the “if ( $ESENT )” which
is only defined when we build the ESE sources for Windows.  I
should mention in closing that I know this code works for perl 5.8.7,
and I know it does not work for perl 5.6.1.  I’ve heard the perl support got much better in 5.8 or so…



Oh I guess that’s code, so I’m required to say something like:

Use of included script samples are subject to the terms specified at
http://www.microsoft.com/info/cpyright.htm (I’m having a hard time
imagining how such a small snipit could be subject to that, but
whatever).


Oh here is what the BOM is, and more on the BOM.


Update 2006/08/20: Turned out Exchange wanted a Unicode
eseperf.ini after all, and has updated thier version of perl, so good
news the NT – Ex code bases grow that much more similar.

Comments (20)

  1. Anonymous says:

    this saved me a pile of time today as i was getting lost in the perl unicode documentation.

    a fellow MSFTer

  2. Anonymous says:

    Echo the "saving a ton of time" comment.  Thanks for documenting what should be in the standard perl docs.

  3. Anonymous says:

    Well, for UTF 8, you can do this:

    open FH, ">:utf8", "file";

    and it works fine.  

    I just wanted to make sure that’s here so people slog along the hard road unless they really need 16LE!

  4. Anonymous says:

    what about file names? open FH, ">:utf8", "file"; is storing my files fine in utf8 but the dam filenames are turned into gobbledegook if the file name is utf8. Anyone know how to solve that?

  5. Anonymous says:

    Was tearing out what remains of my hair for several hours. Found this article, problem solved in seconds. Wonderfully useful – Thank You!!!

  6. Anonymous says:

    Very useful, thanks a lot! I especially liked that the snippet has all those explanations!

  7. Anonymous says:

    That should be "UTF-16LE" not "UTF16-LE".  Surprised it worked as listed, "like a charm"!

  8. Anonymous says:

    Do you see any chance to open files with a name/path that can only be represented in unicode?

    The only way I found is using the Win32 APIs CreateFileW, but it is not compatible with the Perl open() api and therefore would require a major platform dependent rewrite of existing I/O code.

    Any clues would be much appreciated!

  9. Anonymous says:

    当サイトは、みんなの「勝ち組負け組度」をチェックする性格診断のサイトです。ホントのあなたをズバリ分析しちゃいます!勝ち組負け組度には、期待以上の意外な結果があるかもしれません

  10. Anonymous says:

    当サイトではリッチなセレブと割り切りでお付き合いしてくださる男性を募集しています。女性の性欲を満たし、高額報酬をもらって楽しく暮らしてみませんか?興味がある方はバイト感覚での1日登録もできる、安心の無料入会を今すぐどうぞ。

  11. Anonymous says:

    wow, thank you for this… my issue is over…

  12. Anonymous says:

    The perl code is executed in RHEL and generated the text file. In RHEL, in vi editor it is showing properly. But if i copy the same text file to Windows XP SP3 then it shows the junk characters for every 'n' like this:

    ਍吀栀椀猀 椀猀 愀 琀攀猀琀ഀ

    I copied the same text file in Windows XP sp2, it is showing properly.

    The perl code i used is:

    #!/usr/bin/perl

    use PerlIO::encoding;

    #open(FH, ">:raw:encoding(UTF16-LE):crlf:utf8","test.txt") or die "Could not open file for writing $!n";

    print FH "x{FEFF}";

    $i=0;

    while($i<100)

    {

           print FH "This is a testn";

           $i++;

    }

    close FH;

    Can someone please throw light on this?

    thanks in Advance,

    Bhaskar

  13. Anonymous says:

    Not sure if you gave File::BOM a try too. 🙂

  14. Anonymous says:

    Thank you for making this so accessible.  You saved me a bunch of time, and I'm not even using any Microsoft technologies (I'm on FreeBSD).  I owe you a beer.

  15. Anonymous says:

    How can we create files with names that contain Unicode characters?