Extracting images using the .NET Packaging APIs

I posted yesterday about a simple tool you can write to extract images from .docx files.  That code relies on the packaging format specific to Word documents, meaning that it expects certain files to be in the package and certain relationships to exist for those files.  We can write a more general tool that extracts all images from package files simply by looping through all of the parts in the package, finding those parts with image MIME types, and pulling them out.  The following code will do exactly that; just put this into a .cs file and compile it (you'll need the .NET Framework 3.0).  Then, you can drag-and-drop any package file onto the .exe, and it'll extract all of the images.  Since it works with any package, you can use this with .docx, .pptx, .xlsx, .xps, and so on.  And with the Microsoft XPS Document Writer printer driver built into Windows Vista, you can print other file formats, like .pdf, to .xps files, and then run those .xps files through this tool to extract images... pretty cool.

using System;
using System.IO;
using System.IO.Packaging;
using System.Text;

class ExtractPackagedImages
{
    static void Main(string[] paths)
    {
        foreach (string path in paths)
        {
            using (Package package = Package.Open(
                path, FileMode.Open, FileAccess.Read))
            {
                DirectoryInfo dir = Directory.CreateDirectory(path + " Images");
                foreach (PackagePart part in package.GetParts())
                {
                    if (part.ContentType.ToLowerInvariant().StartsWith("image/"))
                    {
                        string target = Path.Combine(
                            dir.FullName, CreateFilenameFromUri(part.Uri));
                        using (Stream source = part.GetStream(
                            FileMode.Open, FileAccess.Read))
                        using (Stream destination = File.OpenWrite(target))
                        {
                            byte[] buffer = new byte[0x1000];
                            int read;
                            while ((read = source.Read(buffer, 0, buffer.Length)) > 0)
                            {
                                destination.Write(buffer, 0, read);
                            }
                        }
                        Console.WriteLine("Extracted {0}", target);
                    }
                }
            }
        }
        Console.WriteLine("Done");
    }

    private static string CreateFilenameFromUri(Uri uri)
    {
        char [] invalidChars = Path.GetInvalidFileNameChars();
        StringBuilder sb = new StringBuilder(uri.OriginalString.Length);
        foreach (char c in uri.OriginalString)
        {
            sb.Append(Array.IndexOf(invalidChars, c) < 0 ? c : '_');
        }
        return sb.ToString();
    }
}

Enjoy!

-Stephen

ExtractPackagedImages.zip