Atomize your strings to improve memory usage

In yesterday's post, I hinted at a method to improve memory usage in your applications. This trick can be applied anytime you have many strings in your application that have the same value but were allocated separately and thus each take up space of their own.

This is something that you may find whenever you're reading data from some external source into your application, where that data has no information to help you figure out whether it's a repeated instance or not. For example, when you read data from a database, typically character values will be read as different strings, even if they have the same value from one row to the next.

The process of taking a string and checking whether you already had one with the same value to reuse it is called atomizing a string. This has two nice properties.

  1. You end up using less memory to hold all the strings.
  2. You can compare the strings faster. The value is the same if and only if they are the same reference, so you can do a by-reference comparison, which is much faster. Even if you eventually mix non-atomized strings, you'll still have cases where the by-reference gives you a "quick yes" on equality comparison.

By the way, XmlReader already applies this mechanism to things like element names, so we can borrow the NameTable class to do this work for us. We will use the Add method to add a string value if we haven't seen it before ("atomizing it"), or get the reference to an already-atomized string with the same value.

We can go back to the MeasurePlainObject method we wrote yesterday and touch it up like this:

 ...
details.TrimExcess();
 
// While we're at it, try having less CarrierTrackingNumber instances.
System.Xml.NameTable nt = new System.Xml.NameTable();
foreach (var detail in details)
{
  detail.CarrierTrackingNumber =
      (detail.CarrierTrackingNumber == null) ? null : 
      nt.Add(detail.CarrierTrackingNumber);
}
 
GC.Collect();
totalMemoryAfterWork = GC.GetTotalMemory(true);
...

Now when I run this on my machine, I get the following values.

C:\work\repro>mem.exe --poco

POCO (121317 records):

Bytes before work: 49160

Bytes after work:  13575952

Delta:             13526792

Recall that without this, these were the values I had for the POCO case.

C:\work\repro>mem.exe --poco

POCO (121317 records):

Bytes before work: 49160

Bytes after work:  15860472

Delta:             15811312

This is a cool 2,284,520 bytes for very little code. A few things to bear in mind.

  • You can go a lot further than this the more your string values are repeated. An extreme case is where you're actually storing enumeration values as strings.
  • This works in many contexts. Typically you need to consider who is creating the string, and if they're not doing atomiziation and the data lends well to it, you may be able to apply this trick. This could be data read from the network, from a text file, from a database, etc.
  • Don't overdo this. There is some work in building the name table and updating all the references; if you're not going to gain much, it might not be worth the effort. Also, if you're not going to use the memory for anything and it's about to get garbage collected anyway, you might as well not bother doing the atomizing and let the garbage collector clean everything up for you; you might even make things slower by spending time building that name table.
  • Measure. Ultimately this is a performance trick, and if you don't measure, you won't really be able to tell whether you're making things better or worse.

Enjoy!