Parsing and Reading Large Excel Files with the Open XML SDK

I want to thank you guys for leaving blog post requests. Today's post will focus on one of your requests: parsing and reading large Excel files with the Open XML SDK. The Open XML SDK provides two approaches for parsing Open XML files: 1. using the SDK DOM, 2. using the SAX-like reading/writing capabilities. The SDK DOM makes it easy to query and parse Open XML files due to strongly typed classes. Unfortunately, the DOM approach requires loading entire Open XML parts in memory, which can sometimes cause Out of Memory exceptions when dealing with really large files. The SAX-like reading/writing capability of the SDK provides a faster and less memory intensive approach. Let's compare the two approaches.

If you want to jump straight into the code, feel free to download the solution here.

Scenario

Imagine a scenario where I work for a financial company that deals with very large Excel spreadsheets. These spreadsheets are updated daily by analysts and can easily grow to sizes exceeding hundreds of megabytes. At the end of the day a solution is used to read and extract relevant data from every spreadsheet.

Here is a screenshot of an example workbook with millions of cells of data:

image

Solution

The scenario described above is pretty simple and involves many of the general concepts described in a previous post on reading data from Excel files. The main difference in today's post is that I will show you the solution using both the DOM and SAX-like approaches. To accomplish the scenario we will need to take the following actions:

  1. Open the workbook using the Open XML SDK
  2. Get the main workbook part
  3. Get the appropriate worksheet
  4. Read every cell value in the sheet

Open XML SDK DOM Approach

The SDK DOM approach is pretty straight forward. The method below will read every cell value for each row contained within the sheet data.

Note that code below works for cells that contain numeric values. Cells that contain other types of content will need a different type of solution.

static void ReadExcelFileDOM(string filename) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First(); foreach (Row r in sheetData.Elements<Row>()) { foreach (Cell c in r.Elements<Cell>()) { string text = c.CellValue.Text; } } } }

The method above will definitely work for reading cell values within a worksheet. However, if the worksheet is quite large then the program's memory footprint will also be quite large. In fact, you are left at the mercy of the garbage collector, which may result in the program throwing an Out of Memory exception.

Open XML SDK SAX-Like Approach

If you have performance and/or memory concerns when developing an Open XML solution then you should look into using the OpenXmlReader and OpenXmlWriter. In today's post I will cover how to use the OpenXmlReader. I'll try to cover the OpenXmlWriter in a future post.

The method below will read every cell value for each row contained within the sheet data using the SAX-like approach.

Note that code below works for cells that contain numeric values. Cells that contain other types of content will need a different type of solution.

static void ReadExcelFileSAX(string filename) { using (SpreadsheetDocument myDoc = SpreadsheetDocument.Open(filename, true)) { WorkbookPart workbookPart = myDoc.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();   OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); string text; while (reader.Read()) { if (reader.ElementType == typeof(CellValue)) { text = reader.GetText(); } } } }

The method above takes advantage of the OpenXmlReader class, which allows you to read an entire part without loading the part in memory. The Create method has the following signatures:

  • public static OpenXmlReader Create(OpenXmlElement openXmlElement);
  • public static OpenXmlReader Create(OpenXmlPart openXmlPart);
  • public static OpenXmlReader Create(Stream partStream);
  • public static OpenXmlReader Create(OpenXmlElement openXmlElement, bool readMiscNodes);
  • public static OpenXmlReader Create(OpenXmlPart openXmlPart, bool readMiscNodes);
  • public static OpenXmlReader Create(Stream partStream, bool readMiscNodes);

As you can see, the Create method allows you to read either a part or an xml subtree given a specific element. There are methods that allow you to skip miscellaneous nodes that are not associated with the Open XML format. Using the OpenXmlReader class allows you to iterate the entire xml tree one node at a time.

Conclusion

Hopefully, this post gives you more insight into reading large Open XML files, while avoiding performance or Out of Memory Exceptions. If you guys have other Open XML SDK related post requests, let me know.

Thanks,

Zeyad Rajabi