How to efficiently generate OpenXML Workbook from Excel files containing very large amount of data

Few days back, one of my colleague asked me for my help for a requirement to convert an Excel 97-2003 Workbook to an OpenXML based workbook. Although it may sound pretty easy, the catch was that the file to be converted had around millions of rows (you read it right, Millions!). If we go by the standard technique of looping through the cells of the source file, and generate the corresponding XML for the OpenXML file, it would take, if not ages, yet a substantial amount of time to complete.

So, the question was how to efficiently generate the OpenXML from an Excel workbook that contains huge amount of data?

In order to solve this, we decided to use the XSL Transform technique on our source XML’s data to generate the required content for OpenXML. In this post, I am going to show how we achieved this.

Here is the overall flow of what we are going to implement here:

Flow

Before we go into the details of this, let us briefly discuss what is XSL transform. I’ll not go into the details of XSLT, however the overall idea is that using XSLT (eXtensible Stylesheet Language Transformations) file we can transform an XML data to some other format, for e.g. XHTML; and in our case to XML based on a different schema.

For more details and explanation of XSLT, please see following links:

 

Extracting Data From Excel Workbook

Our first task is to extract the data contained in the Excel workbook into an XML file. We will achieve this by using XML mapping in Excel. Here is the snapshot of the workbook that I used for our example:

 ExcelBook

Here is the XSD file that I used to map the data in the above workbook; note that the contents of this file will depend on the structure of the source Excel workbook:

 <?xml version="1.0" encoding="utf-8"?>
<xs:schema elementFormDefault="qualified" attributeFormDefault="unqualified" xmlns:xs="https://www.w3.org/2001/XMLSchema">
  <xs:element name ="root">
    <xs:complexType>
      <xs:sequence>
        <xs:element maxOccurs="unbounded" name="data">
          <xs:complexType>
            <xs:sequence>
              <xs:element name="object" type="xs:string"/>
              <xs:element name="price" type="xs:integer"/>
              <xs:element name="count" type="xs:short"/>
            </xs:sequence>
          </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

After mapping this schema, we can extract the data from Excel by saving the current file as XML Data (*.xml) format. Here the XML that was generated from my example workbook using the above schema; let’s call it data.xml:

 <root>
    <data>
        <object>Table</object>
        <price>150</price>
        <count>1</count>
    </data>
    ...
</root>

After generating our data XML, next task is to prepare XSLT for transforming this data XML to our required format.

 

 

Preparing XSLT

In my XSLT, I have two templates that I apply on the generated Data XML, one at the root node, and the other for each data element and its children.

For each data node, we need to generate a row element as per Spreadsheet ML, and for each child element of data node we need to generate col elements. Each row element that we need to generate, must have an attribute named r which will contain the row number. Similarly for each col node we need to generate an attribute named r that needs to have value of the cell name which this column corresponds to.

For e.g. for first row in our data.xml our transformed row element should have an attribute r=”1” , and its first col element should have an attribute r=”A1” , second col element should have r=”B1” and so on.

Now that actual data value that is contained in a cell, can be maintained in following two ways according to Spreadsheet ML specifications:

  1. Keep each value in a separate file called sharedStrings.xml in the OpenXML package
  2. Keep them as inline strings as part of the sheet.xml

We will be using the second way to generate our XML. The second way also requires that each col element specifies the type attribute (t) with value as inlineStr. This tells Excel that it contains an inline string and Excel doesn’t need to search for the value from sharedStrings.xml. The actual content of the inline string is contained in the child element named is under the col node.

Here is the XSLT that I used for transforming our data XML to the required XML:

 <?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="https://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
  <xsl:output method="xml" indent="yes"/>

  <xsl:template match="root">
    <xsl:element name="sheetData">
      <xsl:apply-templates select="data"></xsl:apply-templates>
    </xsl:element>
  </xsl:template>

  <xsl:template match="data">
    <xsl:variable name="rowID">
      <xsl:number value="position()" format="1"/>
    </xsl:variable>
    <xsl:element name="row">
      <xsl:attribute name="r">
        <xsl:value-of select="$rowID"/>
      </xsl:attribute>

      <xsl:for-each select="*">
        <xsl:element name="c">
          <xsl:variable name="colID">
            <xsl:number value="position()" format="A"/>
          </xsl:variable>
          <xsl:attribute name="r">
            <xsl:value-of  select="concat(string($colID),string($rowID))"/>
          </xsl:attribute>
          <xsl:attribute name="t">
            <xsl:text>inlineStr</xsl:text>
          </xsl:attribute>
          <xsl:element name="is">
            <xsl:element name="t">
              <xsl:value-of select="."/>
            </xsl:element>
          </xsl:element>
        </xsl:element>
      </xsl:for-each>
      
    </xsl:element>
  </xsl:template>

</xsl:stylesheet>

 

Generating transformed XML using the XSLT

Now that we have our XSLT ready, we need to transform our data XML into our desired format using this XLST. For this example, I used a C# based Windows Forms application to perform this task.

Here is the code snippet I used:

 XPathDocument xDoc = new XPathDocument("data.xml");
XslCompiledTransform xXslt = new XslCompiledTransform();
xXslt.Load("test.xslt");

XmlTextWriter xOutput = new XmlTextWriter("output.xml", null);

xXslt.Transform(xDoc, null, xOutput);

xOutput.Close();

 

Running this snippet, results in an XML in following format:

 <?xml version="1.0" encoding="utf-8"?>
<sheetData>
  <row r="1">
    <c r="A1" t="inlineStr">
      <is>
        <t>Table</t>
      </is>
    </c>
    <c r="B1" t="inlineStr">
      <is>
        <t>150</t>
      </is>
    </c>
    <c r="C1" t="inlineStr">
      <is>
        <t>1</t>
      </is>
    </c>
  </row>
  ...
  ...
</sheetData>

 

Generating XLSX from our transformed XML

This involves following steps:

  1. Generate a blank XLSX file in memory. I have used OpenXML 2.0 SDK for this task, and also referred to this post:

         How To generate a .xlsx using Open XML SDK without loading any .xml

  2. Insert the sheetData element from our transformed XML into the sheet.xml

Here is the code snippet I used:

 // Read the generated XML
StreamReader sr = File.OpenText("output.xml");
string strSheetData = sr.ReadToEnd();
 
using (SpreadsheetDocument doc = SpreadsheetDocument.Create("output.xlsx", DocumentFormat.OpenXml.SpreadsheetDocumentType.Workbook))
{
    WorkbookPart workbook = doc.AddWorkbookPart();
    WorksheetPart sheet = workbook.AddNewPart<WorksheetPart>();
    string sheetId = workbook.GetIdOfPart(sheet);
 
    // Create a blank XLSX file
    string XML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><workbook xmlns=""https://schemas.openxmlformats.org/spreadsheetml/2006/main"" xmlns:r=""https://schemas.openxmlformats.org/officeDocument/2006/relationships""><sheets><sheet name=""{1}"" sheetId=""1"" r:id=""{0}"" /></sheets></workbook>";
    XML = string.Format(XML, sheetId, "Sheet1");
    this.AddPartXml(workbook, XML);
 
    // Insert our sheetData element to the sheet1.xml
    XML = @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?><worksheet xmlns=""https://schemas.openxmlformats.org/spreadsheetml/2006/main"" >{0}</worksheet>";
    XML = string.Format(XML, strSheetData);
    this.AddPartXml(sheet, XML);
 
    doc.Close();
}
  protected void AddPartXml(OpenXmlPart part, string xml)
{
    using (Stream stream = part.GetStream())
    {
        byte[] buffer = (new UTF8Encoding()).GetBytes(xml);
        stream.Write(buffer, 0, buffer.Length);
    }
}

 

After running this code snippet, a file named output.xlsx is generated with our transformed XML inserted into its sheet.xml. Here is how it looked on my machine, when opened in Excel 2007:

Generated XLSX

 

As an example, I have just inserted my data in the sheet.xml. In a more realistic scenario, you might want to implement extra tasks like:

  • Pull data from a database as XML (instead of extracting it from an XLS file) and transform that to the required XML using XSLT.
  • Add header row to the sheet.xml in addition to the data
  • Define styles for the columns. You might have to modify your XSLT to generate columns with associated styles.
  • Format columns for different types of data.

You can also play around with XSLT to generate a full sheet.xml for your requirement, rather than generating just the sheetData element of sheet.xml.

With these I am ending this post. Please feel free to post your queries as comments, I will definitely try to answer them.