Full Text Indexing Terabytes of Files with SQL Server and Cloud Storage

Author: Darko Sancanin, Nicholas Dritsas

Reviewers and contributors: Lubor Kollar, Stuart Ozer, Michael Thomassy

Business Case

There are currently over 50 million files (over 10 Terabytes of data) that the customer is migrating into a custom project management application (for international customers) that can be accessed via the application and can be searched upon. SQL Server 2008 full text indexing is used to index the content of these files which gives them rich searching capabilities within their application.

The initial solution was to store the files within SQL Server but maintaining Terabytes of distributed data proved to be unsustainable both from a performance, maintenance and storage cost point of view.

It was decided to move the files into cloud storage, this allowed to drastically reduce the hosting costs, reduce the load on the servers from storing and serving out this data, increase the redundancy plan by easily replicating the data across cloud instances and to increase the speed at which users can access their files by allowing them to select which region to serve the data from based on the users geographic location.

One of the major challenges faced with this approach was maintaining the full text index on these files while moving the files out of the local file system and into the cloud.

Technical Background

Originally the files were stored as part of the database using file stream pointers with an incremental full text index over these files and file metadata, such as file size and content encoding, stored in the database.

Moving these files out to the cloud meant that sql server had no way to access these files to index them, so it was decided to use the same underlying technology that SQL Server uses to index the files, IFilters. IFilters are a plug-in architecture implemented via COM components used by Windows applications to index files based on file format, they are used by Windows Desktop Search and SQL Server to extract the content from the files for indexing.

The process of storing a file in the application first starts with a user uploading the file to the servers, the software then loads the appropriate IFilters and extracts the text from the documents, concurrently create thumbnails for the file and upload the file and thumbnails to cloud storage. The text that was extracted from the file is then stored in the database along with metadata for the file and the URI to the file in cloud storage.

A slightly modified sample of the code that achieves this is shown below:

string fileExtension = Path.GetExtension(filePath);

var filterLoader = new FilterLoader();

var iFilter = filterLoader.LoadFilter(fileExtension);

var filterReader = new FilterReader(iFilter);

var fileTextContent = filterReader.ReadAllText();

var thumbnailCreatorFactory = new ThumbnailCreatorFactory();

var thumbnailCreator = thumbnailCreatorFactory.GetThumbnailCreator(fileExtension);

var thumbnail = thumbnailCreator.GetThumbnail(filePath);

SaveFileInformation(fileId, fileTextContent, thumbnail);

The path to the COM component that implements the IFilter interface is stored in the registry based on the file extension, the FilterLoader class traverses the registry to find the appropriate filter based on the file extension. The next step is to read the contents of the file using the methods of the IFilter interface.

To create the thumbnails we use a 3rd party component to generate thumbnails for the PDF documents, and for other document types we use the Windows API to get the shell to generate the thumbnails, finding the appropriate thumbnail creator class is abstracted behind the ThumbnailCreatorFactory. Then, once we have the text and thumbnail information. we save it to the database.

The schema for the table looks something like the following:

CREATE TABLE [dbo].[FileStorage](

[FileStorageId] [bigint] IDENTITY(1,1),

[FileName] [nvarchar](250) NOT NULL,

[FileExtension] [nvarchar](10) NULL,

[FileTextContent] [nvarchar](max) NULL,

[CloudPath] [nvarchar](max) NULL,

[CloudLargeThumbnailPath] [nvarchar](1000) NULL,

[CloudLargeThumbnailSize] [int] NULL,

[CloudSmallThumbnailPath] [nvarchar](1000) NULL,

[CloudSmallThumbnailSize] [int] NULL,

[Encoding] [nvarchar](max) NULL,

[Size] [bigint] NULL

)

An example of a subset of this data is attached.

A full text index is created over the FileTextContent field, which allows us to find particular files, display the file name and thumbnails to the user. If the user chooses to download the file, then the user is given a link to the cloud storage file location.

The contents of the file are still stored in the database so there is still some overhead with the size of the database, but we are seeing over 15 to 1 reduction in file storage necessary to store just the text from the documents in the database. This approach also gives the flexibility to increment or regenerate the full text index anytime that we want.

References

The IFilter interface on MSDN: https://msdn.microsoft.com/en-us/library/ms691105(VS.85).aspx

IFilter.org, which contains links to some IFilters that are not included with Windows, including the Office 2007 filters and PDF filters : https://www.ifilter.org

On MSDN code gallery there is an example of loading an IFilter and extracting the text using the filter in C# : https://code.msdn.microsoft.com/ifiltersample

table sample.gif