Sample: Parsing Content in C# Using IFilter

I’m working on a 3 (or 4) part tutorial right now that requires parsing of PDF files.  The code started to get big enough I decided to pull it out and turn it into a new post that I can use in the series (stay tuned).

There are several solutions for reading through various file formats.  The IFilter interface was defined to help Windows do search indexing on files for this purpose.  There are lot’s of filter providers for various formats, including several from Microsoft.  If you want to parse PDF files you’ll need to have a provider installed for that as well.  The FoxIt IFilter download page has a provider that according to their website is free for client use (my case).

In looking around for some sample code I found a few examples that did close to what I wanted but didn’t have a lot of luck finding a C# example.  I’ve pulled together various pieces of code to create a basic implementation for my (simple) needs.  You can find some interesting links here:

The sample contains a class library for parsing the code and a console application that can be used to exercise the library against files.  The code is built using a current internal build of VS2010 (stay tuned here for beta notice) but the key code (FilterCode.cs) should work fine on previous versions of VS and .NET Framework.

I’ve uploaded the solution to the MSDN code gallery here:

To use the sample, include FilterCode.cs in your project, create a new instance of FilterLibrary.FilterCode, and call the GetTextFromDocument method against the file you want to parse.  If you have a filter installed for that document type, you will get back a StringBuilder with the text contents of the file.


Comments (7)

  1. Doug says:

    When I built I got an error on line 99 of FilterCode.cs that StringBuilder does not have a Clear() method. (VS2010 Beta 1)

  2. Jason Zander says:

    Sorry Doug, I inadvertently used a new API from Beta 2.  I just posted a new version that works against Beta 1.  The fix is simple:  construct the temporary StringBuilder inline rather than reuse the original.

  3. Greg says:

    Look at ItextSharp a .net PDF read/write library.  It’s a .net port of the iText (java) library.  It supports most all of PDF except for Jbig2

  4. Jasonz says:

    Greg – thanks for the pointer to IText.  In my case the IFilter solution will work against any underlying file format, so if the document type is changed to DOC* or XPS, etc.

  5. Mr. Zander,

    When will we be getting VS 2010 Bata 2?  As a Microsoft Certified Partner, do we have any way(s) of getting VS 2010 Beta 2 a bit earlier than general public?

    We are required to build WF, Silverlight 3.0, workflow/orchestration based product and instead of .NET Framework v3.5, we hope to take advantage of 2010 Beta 2 ASAP.


    Jason Huang

    Executive VP & Co-Founder

    enfoTech & Consulting Inc.

    (609) 896-9777 x108

  6. Jasonz says:

    Jason – We’re not ready to publish a release date for Beta 2 just yet, although we are making lots of great progress.  For projects with near term deliverables .NET Framework 3.5 is still the right target.  Stay tuned to this blog for more info…

  7. CraigD says:

    Hey Jason, I used IFIlter to extract text from a few different file formats in this ‘basic’ ASP.NET search engine

    (although I eventually wrote a specific handler for PDF using ITextSharp).

    The IFilter stuff is based on this C# IFilter sample

    Not sure how it compares to yours…