I’m working on a 3 (or 4) part tutorial right now that requires parsing of PDF files. The code started to get big enough I decided to pull it out and turn it into a new post that I can use in the series (stay tuned).
There are several solutions for reading through various file formats. The IFilter interface was defined to help Windows do search indexing on files for this purpose. There are lot’s of filter providers for various formats, including several from Microsoft. If you want to parse PDF files you’ll need to have a provider installed for that as well. The FoxIt IFilter download page has a provider that according to their website is free for client use (my case).
In looking around for some sample code I found a few examples that did close to what I wanted but didn’t have a lot of luck finding a C# example. I’ve pulled together various pieces of code to create a basic implementation for my (simple) needs. You can find some interesting links here:
- MSDN Documentation for IFilter
- Codeplex IFilter sample code in C++ (under MS-PL)
- P-Invoke definitions for IFilter and related members from www.pinvoke.net
The sample contains a class library for parsing the code and a console application that can be used to exercise the library against files. The code is built using a current internal build of VS2010 (stay tuned here for beta notice) but the key code (FilterCode.cs) should work fine on previous versions of VS and .NET Framework.
I’ve uploaded the solution to the MSDN code gallery here:
To use the sample, include FilterCode.cs in your project, create a new instance of FilterLibrary.FilterCode, and call the GetTextFromDocument method against the file you want to parse. If you have a filter installed for that document type, you will get back a StringBuilder with the text contents of the file.