DataSet versus plain objects


The other day, a question popped up on the forums about whether the DataSet, by virtue of not being able to “spill to disk” intelligently like a database engine can, was inappropriate for some specific scenario.

That got me thinking about how much memory the component consumes, one thing led to another, and so here is today’s post.

There are a few things I’d like to remark up-front.

  • You still need to measure your own scenario. The approach I show here is a bit overly simplistic and there are other variables that might alter the results for your scenario quite a bit, like the specific structure of the data, any code that is JIT’ed, whether you care about working set or private bytes or something else, etc.
  • You need to figure out what you care about. The GC memory bytes are used below an approximation for "overall memory pressure", but it gives I hope a reasonable indication.
  • You need to consider what functionality tradeoffs you're making. Sure, DataSet might take up more memory overall than plain objects, but then again it also provides functionality to easily serialize the data, do change tracing, perform data binding, have indexes with alternative sorts, etc. Rebuilding these capabilities for your own object model might be a considerable undertaking.

Without further ado, here is the code I used. There are other things that come into play, like potentially JIT’ed code when using one approach or another.

First, some boilerplate code to set up the class and some constants to connect to a local instance of AdventureWorks on SQL Express.

using System;
using System.Collections.Generic;
using System.Data;
using System.Data.SqlClient;
using System.Linq;

class C {
private const string ConnectionString =
"Data Source=.\\SQLEXPRESS;" +
"Initial Catalog=AdventureWorks;" +
"Integrated Security=true";
private const string CommandString =
"SELECT * FROM Sales.SalesOrderDetail";

public static void Main(string[] args) {
if (args.Length == 0 || args[0] == "--dataset") {
MeasureDataSet();
} else if (args[0] == "--poco") {
MeasurePlainObjects();
} else {
Console.WriteLine("Use --dataset or --poco as arguments");
}
}
...

Next, this is how we implement the DataSet code. We'll see how many bytes we have before creating and populating the DataSet, then collect all of the intermediate things that got allocated while doing the construction work, then measure again.

public static void MeasureDataSet()
{
long totalMemoryBeforeWork;
long totalMemoryAfterWork;
DataSet dataset;

GC.Collect();
totalMemoryBeforeWork = GC.GetTotalMemory(true);

dataset = new DataSet();

using (var connection = new SqlConnection(ConnectionString))
using (var adapter = new SqlDataAdapter())
using (var command = new SqlCommand(CommandString, connection))
{
adapter.SelectCommand = command;
adapter.Fill(dataset);
}

GC.Collect();
totalMemoryAfterWork = GC.GetTotalMemory(true);

GC.KeepAlive(dataset);

WriteResults(
"DataSet",
totalMemoryBeforeWork, totalMemoryAfterWork,
dataset.Tables[0].Rows.Count);
}

We will contrast the DataSet with plain old CLR objects (POCO), so here is a class that can hold the row values, and a method that does equivalent work than the DataSet method above.

// See http://msdn.microsoft.com/en-us/library/ms124498(v=SQL.100).aspx
public class SalesDetail
{
public int SalesOrderID { get; set; }
public int SalesOrderDetailID { get; set; }
public string CarrierTrackingNumber { get; set; }
public Int16 OrderQty { get; set; }
public int ProductID { get; set; }
public int SpecialOfferID { get; set; }
public decimal UnitPrice { get; set; }
public decimal UnitPriceDiscount { get; set; }
public decimal LineTotal { get; set; }
public Guid rowguid { get; set; }
public DateTime ModifiedDate { get; set; }
}

public static void MeasurePlainObjects()
{
long totalMemoryBeforeWork;
long totalMemoryAfterWork;
List<SalesDetail> details;

GC.Collect();
totalMemoryBeforeWork = GC.GetTotalMemory(true);

details = new List<SalesDetail>();

using (var connection = new SqlConnection(ConnectionString))
using (var adapter = new SqlDataAdapter())
using (var command = new SqlCommand(CommandString, connection))
{
connection.Open();
using (var reader = command.ExecuteReader())
{
     while (reader.Read()) {
        // Horribly inefficient but concise.
        var detail = new SalesDetail();
        foreach (var prop in detail.GetType().GetProperties()) {
            object val = Convert.ChangeType(
reader[prop.Name], prop.PropertyType);
            prop.SetValue(detail, val, null);
        }
        details.Add(detail);
     }
}
}

details.TrimExcess();

GC.Collect();
totalMemoryAfterWork = GC.GetTotalMemory(true);

GC.KeepAlive(details);

WriteResults(
"POCO",
totalMemoryBeforeWork, totalMemoryAfterWork,
details.Count);
}

Finally, we can simply write the results to the console.

public static void WriteResults(string method,
long totalMemoryBeforeWork, long totalMemoryAfterWork,
int recordCount)
{
long delta = totalMemoryAfterWork - totalMemoryBeforeWork;
Console.WriteLine(method + " (" + recordCount + " records):");
Console.WriteLine("Bytes before work: " + totalMemoryBeforeWork);
Console.WriteLine("Bytes after work:  " + totalMemoryAfterWork);
Console.WriteLine("Delta:             " + delta);
}

}

I ran each variation, DataSet and plain old CLR objects (POCO), and here are my results. Your results will probably vary depending on various factors, like the version of the framework you use, whether you run this on a 32-bit or 64-bit machine, etc.

C:\work\repro>mem.exe --poco

POCO (121317 records):

Bytes before work: 49160

Bytes after work:  15860472

Delta:             15811312

C:\work\repro>mem.exe --dataset

DataSet (121317 records):

Bytes before work: 49152

Bytes after work:  29234040

Delta:             29184888

So, there you go: about 29 MB for the DataSet version, and about 16 MB for the POCO version. This includes things like pooled connections probably, so take them more as relative numbers than as absolute values. With other kinds of data, you might see very different profiles: there is a lot of boxing going on in the DataSet for all those value types in the sales rows, which give POCO a pretty considerable advantage. If most of the memory were going to larger string fields for example, the difference should be smaller.

Next time, a common trick to squeeze out some more memory out of the POCO case.

Enjoy!

Comments (2)

  1. LD says:

    Nice post. I a little beginer and i want to  ask something: if you call GC on an object who has implemented dispose, the mem will be free or the GC decide that it can wait a little more, and free the mem later ?

    P.S. it is my first time reading the blog so i started with the first post , back in nov 2004 and i have one thing to say : GO GO …. .

  2. LD, thanks for the encouragement 🙂

    Generally, there is no way to ask that a specific object be garbage collected. The only component in a managed application that frees managed memory is the garbage collector. What you *can* do is ask the garbage collector to run, which is something that's very rarely done as the GC usually knows best when it is a good time to do a collection. In this sample, I'm calling GC.Collect() because I want to look at memory before and after and I don't care about execution efficiency.

    The "Dispose" thing is something quite different. You typically implement IDispose and the Dispose method when you have a reference to something that the garbage collector doesn't know about, for example a network connection or a file handle. In that case, you may want to close those immediately because other programs may need to access the same resources for example. In this sample, the "using" statements implicitly call Dispose on the connection and the related objects becuase they may be holding a connection to the database "under the covers". Now, because the GC doesn't know about Dispose, the objects will still be collected whenever the GC decides (so to answer your question specifically: yes, the GC might decide to wait a bit longer).

    Finally (and hopefully this isn't getting too long), there is yet another thing that is usually done for cases like the file system or network connection. The problem is that these may rely on APIs that the garbage collector doesn't know anything about, and even if you forget to call Dispose on them, they really should be letting go of the file or connection. In that case, you can write a destructor for your object. That tells the garbage collector, "please call this method right before you reclaim the memory I'm in", and gives you one last chance to close the file or network handle.

    Actually, reading back on this, I think this may end up being a post on its own 🙂

Skip to main content