Sourcing XML Data in SSIS 2005

So picking up from where I left off in my last post, I want to talk about my experience using the XML Source adapter in SSIS 2005 to load the data I extracted from SharePoint using the Data Retrieval service I blogged about in my last post. 

I had gotten so used to using flat file formats like tab-delimited over the years that sometimes I just go that direction by default when choosing a file format for data transfer.  After a few years doing development on the Microsoft UDDI Services project, I was cured of my database-geek tendencies and embraced XML and Web Services.  But I also learned that not all tools are equal when dealing with XML.  Just because you have an XML file and some schema doesn't mean your tool can handle it.  Many times I've gone down the XML path only to realize that it has significant limitations in dealing with more complex XML schemas than you would see in the typical "one element per row" XML file beloved by XML / database integration demo wonks.

So I have to say that I was very pleasantly surprised by the functionality exposed in the XML Source Adapter in SSIS 2005.  When you are loading data into a data warehouse, you really don't want to have to reformat your load files.  You just want to open them, read them once, and load them.  My goal was to source the XML files and schema created by my SharePoint extract utility directly without having to massage the files in any way, and I achieved that with one exception which I'll talk about in a minute.

Sourcing an XML file in an SSIS 2005 data flow is as easy as dragging an XML Source Adapter onto the design surface and configuring it.  The first set of configuration options tell SSIS how to access the file.  You can choose any of the following options:

  • Hard code the path
  • Get the path from a package variable (the approach I used)
  • Get the XML from a package variable

My package actually dynamically constructs the path to each of the XML files I load by detecting the directory used for the most recent extract operation performed by my SharePoint extract utility.  A little bit of script code at the beginning of my package calculates these paths.  The only complaint I had was that reading and writing to package variables from script seemed a bit arcane, you have to control your own locking since there are concurrency issues surrounding the use of package variables.  I'm told the API's for manipulating package variables in script are going to be simplified however only the brute-force approach worked for me. 

Here's a quick sample of what some of the code looks like for reading and writing to package variables:

Dts.VariableDispenser.GetVariables(vars)

' Verify extranet extract directory
Dts.VariableDispenser.LockOneForRead("ExtranetExtractDir", vars)
extpath = CStr(vars("ExtranetExtractDir").Value)
vars.Unlock()

If Not Directory.Exists(extpath) Then
   Dts.TaskResult = Dts.Results.Failure
   Exit Sub
End If

paths = Directory.GetDirectories(extpath)
pathIndex = paths.GetUpperBound(0)
extdir = paths(pathIndex)

' Configure variables for Application Profile
Dts.VariableDispenser.LockOneForRead("AppProfilesXmlFile", vars)
fileName = CStr(vars("AppProfilesXmlFile").Value)
vars.Unlock()

Dts.VariableDispenser.LockOneForWrite("AppProfilesXmlPath", vars)
vars("AppProfilesXmlPath").Value = extdir + "\" + fileName
vars.Unlock()

But before I could use the variable approach described above, I needed to use a hard-coded path to the file so that I could generate my own schema for sourcing the data.  It turns out that the XSD schema files I generated using SharePoint's data retrieval service were not going to work.  I was getting cryptic validation errors from SSIS when trying to use them to read my XML files.  To debug the problem, I grabbed my trusty copy of XML spy and tried to manually validate one of my XML files using one of the XSD schemas generated by my utility.  Lo and behold the file generated by the List Data Retrieval service could not be validated by the schema generated by the same service. 

After looking at the schema I realized that there was a bunch of irrelevant constraints that I didn't need.  I probably could have spent some time attempting to generate less restrictive schemas using my SharePoint extract utility, but I noticed a little button in the XML Source adapter labeled "Generate XSD... ".  Hmmm.......

It worked great!  After pointing to a physical file that I had already extracted, I used the XML Source Adapter to generate the schema and simply ignored the original schemas I had generated in SharePoint.  The SSIS schemas were tight, with no unnecessary constraints, and worked well for my application.  After generating the schema using SSIS, I reconfigured my source adapter to use my package variable for the source path and was able to proceed.

The XML Source Adapter needs a schema to use as meta data for mapping output columns from the source adapter.  After you supply a valid schema, just click on "Columns" option to configure which columns you want to include in the output result set.  IMPORTANT:  I recommend renaming the columns in the output result set to the final column names you intend to use in your target at this early stage.  This will save you a lot of time in complex data flows by not having to constantly remap columns whose names do not match.  An alternative to this is to use the Copy Map transform, but in my opinion its easier to do it right in the XML Source Adapter.

The last thing you need to think about is truncation.  The simple schemas generated by SSIS do not contain any length constraints, and so the XML Source adapter assumes a data type of "Unicode string [DT_WSTR]" and a default length of 255.  This is the equivalent of "nvarchar(255)" in SQL Server.  This assumption worked fine for me except in one case where I had a comments field that was greater than 255 characters.  This wound up causing truncation errors when executing the data flow.

To get around this I needed to right-click on the XML Source Adapter and select the "Show Advanced Editor..." option.  Using this dialog you can reconfigure the default data type and length mappings of both the input file and the output result set.  I bumped my comments field up to 4000 characters and I was good to go.  Incidentally I asked how this configuration could leverage the new "nvarchar(max)" datatype in SQL Server 2005.  The answer was that nvarchar(max) maps to a "text stream [DT_TEXT}" data type which is handled like a blob in data flows, but the SSIS team was looking at ways to handle it like a normal string of indeterminate size.

Now that we've covered source data, I'll move to discussion data transformation and scrubbing in my next post.