DataSet versus plain objects

The other day, a question popped up on the forums about whether the DataSet, by virtue of not being able to “spill to disk” intelligently like a database engine can, was inappropriate for some specific scenario.

That got me thinking about how much memory the component consumes, one thing led to another, and so here is today’s post.

There are a few things I’d like to remark up-front.

  • You still need to measure your own scenario. The approach I show here is a bit overly simplistic and there are other variables that might alter the results for your scenario quite a bit, like the specific structure of the data, any code that is JIT’ed, whether you care about working set or private bytes or something else, etc.
  • You need to figure out what you care about. The GC memory bytes are used below an approximation for "overall memory pressure", but it gives I hope a reasonable indication.
  • You need to consider what functionality tradeoffs you're making. Sure, DataSet might take up more memory overall than plain objects, but then again it also provides functionality to easily serialize the data, do change tracing, perform data binding, have indexes with alternative sorts, etc. Rebuilding these capabilities for your own object model might be a considerable undertaking.

Without further ado, here is the code I used. There are other things that come into play, like potentially JIT’ed code when using one approach or another.

First, some boilerplate code to set up the class and some constants to connect to a local instance of AdventureWorks on SQL Express.

 using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;
 
class C {
  private const string ConnectionString =
    "Data Source=.\\SQLEXPRESS;" +
    "Initial Catalog=AdventureWorks;" +
    "Integrated Security=true";
  private const string CommandString =
    "SELECT * FROM Sales.SalesOrderDetail";
 
  public static void Main(string[] args) {
    if (args.Length == 0 || args[0] == "--dataset") {
      MeasureDataSet();
    } else if (args[0] == "--poco") {
      MeasurePlainObjects();
    } else {
      Console.WriteLine("Use --dataset or --poco as arguments");
    }
  }
...

Next, this is how we implement the DataSet code. We'll see how many bytes we have before creating and populating the DataSet, then collect all of the intermediate things that got allocated while doing the construction work, then measure again.

 public static void MeasureDataSet()
{
  long totalMemoryBeforeWork;
  long totalMemoryAfterWork;
  DataSet dataset;
 
  GC.Collect();
  totalMemoryBeforeWork = GC.GetTotalMemory(true);
 
  dataset = new DataSet();
 
  using (var connection = new SqlConnection(ConnectionString))
  using (var adapter = new SqlDataAdapter())
  using (var command = new SqlCommand(CommandString, connection))
  {
    adapter.SelectCommand = command;
    adapter.Fill(dataset);
  }
 
  GC.Collect();
  totalMemoryAfterWork = GC.GetTotalMemory(true);
 
  GC.KeepAlive(dataset);
 
  WriteResults(
    "DataSet", 
    totalMemoryBeforeWork, totalMemoryAfterWork,
    dataset.Tables[0].Rows.Count);
}

We will contrast the DataSet with plain old CLR objects (POCO), so here is a class that can hold the row values, and a method that does equivalent work than the DataSet method above.

 // See https://msdn.microsoft.com/en-us/library/ms124498(v=SQL.100).aspx
public class SalesDetail
{
  public int SalesOrderID { get; set; }
  public int SalesOrderDetailID { get; set; }
  public string CarrierTrackingNumber { get; set; }
  public Int16 OrderQty { get; set; }
  public int ProductID { get; set; }
  public int SpecialOfferID { get; set; }
  public decimal UnitPrice { get; set; }
  public decimal UnitPriceDiscount { get; set; }
  public decimal LineTotal { get; set; }
  public Guid rowguid { get; set; }
  public DateTime ModifiedDate { get; set; }
}
 
public static void MeasurePlainObjects()
{
  long totalMemoryBeforeWork;
  long totalMemoryAfterWork;
  List<SalesDetail> details;
 
  GC.Collect();
  totalMemoryBeforeWork = GC.GetTotalMemory(true);
 
  details = new List<SalesDetail>();
 
  using (var connection = new SqlConnection(ConnectionString))
  using (var adapter = new SqlDataAdapter())
  using (var command = new SqlCommand(CommandString, connection))
  {
    connection.Open();
    using (var reader = command.ExecuteReader())
    {
     while (reader.Read()) {
        // Horribly inefficient but concise.
        var detail = new SalesDetail();
        foreach (var prop in detail.GetType().GetProperties()) {
            object val = Convert.ChangeType(
              reader[prop.Name], prop.PropertyType);
            prop.SetValue(detail, val, null);
        }
        details.Add(detail);
     }
    }
  }
 
  details.TrimExcess();
 
  GC.Collect();
  totalMemoryAfterWork = GC.GetTotalMemory(true);
 
  GC.KeepAlive(details);
 
  WriteResults(
    "POCO", 
    totalMemoryBeforeWork, totalMemoryAfterWork,
    details.Count);
}

Finally, we can simply write the results to the console.

 public static void WriteResults(string method,
  long totalMemoryBeforeWork, long totalMemoryAfterWork,
  int recordCount)
{
  long delta = totalMemoryAfterWork - totalMemoryBeforeWork;
  Console.WriteLine(method + " (" + recordCount + " records):");
  Console.WriteLine("Bytes before work: " + totalMemoryBeforeWork);
  Console.WriteLine("Bytes after work:  " + totalMemoryAfterWork);
  Console.WriteLine("Delta:             " + delta);
}
 
}

I ran each variation, DataSet and plain old CLR objects (POCO), and here are my results. Your results will probably vary depending on various factors, like the version of the framework you use, whether you run this on a 32-bit or 64-bit machine, etc.

C:\work\repro>mem.exe --poco

POCO (121317 records):

Bytes before work: 49160

Bytes after work: 15860472

Delta: 15811312

C:\work\repro>mem.exe --dataset

DataSet (121317 records):

Bytes before work: 49152

Bytes after work: 29234040

Delta: 29184888

So, there you go: about 29 MB for the DataSet version, and about 16 MB for the POCO version. This includes things like pooled connections probably, so take them more as relative numbers than as absolute values. With other kinds of data, you might see very different profiles: there is a lot of boxing going on in the DataSet for all those value types in the sales rows, which give POCO a pretty considerable advantage. If most of the memory were going to larger string fields for example, the difference should be smaller.

Next time, a common trick to squeeze out some more memory out of the POCO case.

Enjoy!