Extracting images from .docx files

I mentioned previously how we frequently need to extract images from articles submitted to MSDN Magazine as .docx files.  Doing this manually is very straightforward: change the file's .docx extension to .zip, open the compressed folder in Windows Explorer, browse to the /word/media folder, and copy out the images.  That said, doing this over and over just begs for a tool, and luckily, writing one with the classes provided in the .NET Framework 3.0 takes only a few minutes (and downloading the already compiled tool I wrote is even quicker ;)

We've covered Office Open XML several times now in MSDN Magazine (Office Space: Building Office Open XML Files, Server-Side Generation of Word 2007 Docs, Setting Word Document Properties the Office 2007 Way) and will continue to in the future. These files can be accessed programmatically using the Packaging APIs in the .NET Framework 3.0, and that's exactly how our tool to extract images from .docx files works:

using System;
using System.IO;
using System.IO.Packaging;
using System.Collections.Generic;

class ExtractDocxImages
{
static void Main(string[] documentPaths)
{
foreach (string path in documentPaths)
{
using (Package package = Package.Open(
path, FileMode.Open, FileAccess.Read))
{
PackagePart docXmlPart = GetDocumentXmlPart(package);
if (docXmlPart == null) continue;

DirectoryInfo targetDir = Directory.CreateDirectory(
path + " Images");

Console.WriteLine("Extracting images from {0} to {1}...",
path, targetDir.FullName);
foreach (PackagePart imagePart in GetImageParts(docXmlPart))
{
string targetPath = Path.Combine(targetDir.FullName,
Path.GetFileName(imagePart.Uri.ToString()));
using (Stream source = imagePart.GetStream())
using (Stream destination = File.OpenWrite(targetPath))
{
CopyStream(source, destination);
}
Console.WriteLine("Extracted {0}", targetPath);
}
}
}

        Console.WriteLine("Done.");
}
...
}

The tool expects paths to .docx files to be passed to it on the command-line (which also means that we can just drag-and-drop .docx files onto the app in Windows Explorer). For each path, it uses the Package.Open static method to create a System.IO.Packaging.Package object that represents the .docx. With the Package in hand, it then finds the main document.xml file within the package using a static helper method:

private static PackagePart GetDocumentXmlPart(Package package)
{
foreach (PackageRelationship docRel in package.GetRelationshipsByType(
@"https://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"))
{
return package.GetPart(PackUriHelper.ResolvePartUri(
new Uri("/", UriKind.Relative), docRel.TargetUri));
}
return null;
}

With the System.IO.Packaging.PackagePart object representing the document.xml in hand, it looks at the relationships for this part to find all of the referenced images, again using a static helper method:

private static IEnumerable<PackagePart> GetImageParts(PackagePart docXmlPart)
{
foreach (PackageRelationship imageRel in
docXmlPart.GetRelationshipsByType(
@"https://schemas.openxmlformats.org/officeDocument/2006/relationships/image"))
{
yield return docXmlPart.Package.GetPart(PackUriHelper.ResolvePartUri(
new Uri(docXmlPart.Uri.ToString(), UriKind.Relative),
imageRel.TargetUri));
}
}

For each image's PackagePart, it simply grabs the associated Stream and copies it to a new file on disk for this image (in a directory created based on the name of the .docx file):

private static void CopyStream(Stream source, Stream destination)
{
byte[] buffer = new byte[0x1000];
int read;
while ((read = source.Read(buffer, 0, buffer.Length)) > 0)
{
destination.Write(buffer, 0, read);
}
}

And that's all there is to it; a very simple, but very useful tool for extracting images from .docx files.

-Stephen