Converting a text file from one encoding to another


The .NET framework handles file encodings very nicely.  Not too long ago, I needed to convert files from one encoding to another for a library that didn’t handle the encoding of the original file.  Since someone on an internal alias asked about doing this a couple of weeks ago, I thought it would be useful to post it here.


The .NET runtime uses Unicode as the encoding for all strings.  The StreamReader and StreamWriter classes in System.IO take an Encoding as a parameter.  So, to convert from one encoding to another, we just need to specify the original encoding and read the file contents into a string followed by writing out the string in the desired encoding.


The Path class, also in System.IO, provides us with an easy way to create temporary files in the Windows temporary directory.  We can write the results to a temporary file so that if anything goes wrong, the destination file is not overwritten.  Also, it allows the conversion to work when the source and destination are the same file.


StreamReader allows us to read the source file in blocks so that we don’t have any size limitations on the file that need to convert.


The Main() method below is just a trivial wrapper to call the ConvertFileEncoding()since it wasn’t oringally a standalone app.


// Example: convert test.cs test-conv.cs ascii utf-8

using System;
using System.IO;
using System.Text;

public class Convert
{
public static void Main(String[] args)
{
// Print a simple usage statement if the number of arguments is incorrect.
if (args.Length != 4)
{
Console.WriteLine(“Usage: {0} inputFile outputFile inputEncoding outputEncoding”,
Path.GetFileName(Environment.GetCommandLineArgs()[0]));
Environment.Exit(1);
}

ConvertFileEncoding(args[0], args[1], Encoding.GetEncoding(args[2]),
Encoding.GetEncoding(args[3]));
}

/// <summary>
/// Converts a file from one encoding to another.
/// </summary>
/// <param name=”sourcePath”>the file to convert</param>
/// <param name=”destPath”>the destination for the converted file</param>
/// <param name=”sourceEncoding”>the original file encoding</param>
/// <param name=”destEncoding”>the encoding to which the contents should be converted</param>
public static void ConvertFileEncoding(String sourcePath, String destPath,
Encoding sourceEncoding, Encoding destEncoding)
{
// If the destination’s parent doesn’t exist, create it.
String parent = Path.GetDirectoryName(Path.GetFullPath(destPath));
if (!Directory.Exists(parent))
{
Directory.CreateDirectory(parent);
}

// If the source and destination encodings are the same, just copy the file.
if (sourceEncoding == destEncoding)
{
File.Copy(sourcePath, destPath, true);
return;
}

// Convert the file.
String tempName = null;
try
{
tempName = Path.GetTempFileName();
using (StreamReader sr = new StreamReader(sourcePath, sourceEncoding, false))
{
using (StreamWriter sw = new StreamWriter(tempName, false, destEncoding))
{
int charsRead;
char[] buffer = new char[128 * 1024];
while ((charsRead = sr.ReadBlock(buffer, 0, buffer.Length)) > 0)
{
sw.Write(buffer, 0, charsRead);
}
}
}
File.Delete(destPath);
File.Move(tempName, destPath);
}
finally
{
File.Delete(tempName);
}
}
}

Comments (3)

  1. asdf says:

    man… this post is somehow screwing up all the posts below yours on Weblogs@asp.net… please fix..

  2. Buck Hodges says:

    Sorry about that — apparently the tags in the doc comments appeared both in HTML-encoded and raw forms.

  3. Ramon Smits says:

    Nice post!

    I did almost the same code some months ago. I just supply a filemask (could also be *.cs) and a recursive option.

    I made it because VS.NET 2003 has the weird behaviour to save utf-8 files without a file signature. You *always* have to advance save it. This way we sign all *.aspx pages so ASP.NET doesn’t screw up our output.