SQL Server 2012 FILETABLE for managing data in scientific, heterogeneous, idiosyncratic computing environments

The FILETABLE feature, available in SQL Server 2012, solves a problem that has bedeviled our research group for years. It is especially useful for people who combine SQL Server for data management with other tools (e.g., MATLAB, Python, R, C code) for data analysis and modeling.

Problem: Our work combines remote sensing with climate and snowmelt modeling. We generate lots of images and other files with spatial data of varying types: daily images from MODIS (Moderate-Resolution Imaging Spectroradiometer), information derived about snow properties from the images, then smoothed and interpolated across time to account for cloud cover, artifacts caused by different viewing angles, and occasional instrument noise. We combine these data with hourly estimates of climate variables – solar radiation, air temperature, etc – over a grid, and we segment analyses by drainage basins, stored as polygons or masks.

A single run of a program to analyze a year’s worth of data might open 2,000 files. Until now, to find what we need when we analyze them, we use an ad hoc data management strategy -- folders and file names, crawlers through the file system, generating metadata about the files that we put in a DBMS. The process works poorly. Folders and files are too cumbersome when you have lots of images, and the problem with the integration of the DBMS and the files is that they got out of sync, with orphans in both directions. The most frequent problem is that we generate new images and forget to update the DBMS.

The FILETABLE feature in SQL Server 2012 completely solves this problem.

You configure SQL Server in a way that automatically keeps track of files. You just have to: (i) write them into the right place on the file system, and (ii) embed enough metadata in the file name that you can parse it to build a View, which you can call from your analysis program, to open the file(s) you want. The FILETABLE is updated automatically, with a new row in the table for each file, and the row goes away if the file is deleted. The process is robust, you cannot modify the table itself in SQL Server.

For a scientist (like many) with a heterogeneous, idiosyncratic computing environment, this new feature is fabulous.