OutOfMemoryExceptions while remoting very large datasets

When you have to pass an object back and forth between processes or application domains you have to serialize it into some type of stream that can be understood by both the client and the server.  

The more complex and big the object gets the more expensive it is to serialize, both CPU wise and memory wise, and if the object is big and complex enough you can easily run into out of memory exceptions during the actual serialization process... and that is exactly what happened to one of my customers... 

They had to pass very large datasets back and forth between the UI layer and the datalayer and these datasets could easily get up to a couple of hundred MB in size.  When they passed the datasets back they would get OutOfMemory Exceptions in stacks like this one...  in other words they would get OOMs while serializing the dataset passing it back to the client...

0454f350 773442eb [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat)
0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
0454f458 7964db64 System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f498 793ba2bb System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.Serialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f4c0 793b9cef System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Remoting.Messaging.Header[], System.Runtime.Serialization.Formatters.Binary.__BinaryWriter, Boolean)
0454f500 793b9954 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, System.Runtime.Remoting.Messaging.Header[], Boolean)
0454f524 6778c0b0 System.Runtime.Remoting.Channels.BinaryServerFormatterSink.SerializeResponse(System.Runtime.Remoting.Channels.IServerResponseChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f57c 6778bb0f System.Runtime.Remoting.Channels.BinaryServerFormatterSink.ProcessMessage(System.Runtime.Remoting.Channels.IServerChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders, System.IO.Stream, System.Runtime.Remoting.Messaging.IMessage ByRef, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f600 67785616 System.Runtime.Remoting.Channels.Tcp.TcpServerTransportSink.ServiceRequest(System.Object)
0454f660 67777732 System.Runtime.Remoting.Channels.SocketHandler.ProcessRequestNow()
0454f690 677762a2 System.Runtime.Remoting.Channels.RequestQueue.ProcessNextRequest(System.Runtime.Remoting.Channels.SocketHandler)
0454f694 67777693 System.Runtime.Remoting.Channels.SocketHandler.BeginReadMessageCallback(System.IAsyncResult)
0454f6c4 7a569ca9 System.Net.LazyAsyncResult.Complete(IntPtr)
0454f6fc 7a56a46e System.Net.ContextAwareResult.CompleteCallback(System.Object)
0454f704 79373ecd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0454f71c 7a56a436 System.Net.ContextAwareResult.Complete(IntPtr)
0454f734 7a569bed System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
0454f764 7a61062d System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f79c 79405534 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f93c 79e7c74b [GCFrame: 0454f93c]

My gut feeling was that they were SOL. I know that serialization is very memory expensive and that the resulting serialized xml strings can get enormous so I wasn't very surprised, especially knowing how large their datasets were.  

I am not a data access guru, but I have seen this type of issue enough times that I knew what the recommendation should be.

1. Re-think the architecture... what are you using these datasets for? who will be browsing through 100s of MBs of data anyways?  (and this still holds true,  in most cases where there is this much data involved only a very small part of it is needed and if that is the case, then only a very small piece of the data should be handled, i.e. filter out what you need and leave the rest)

2. Re-consider passing this data through remoting/webservices/out-of-proc session state or whatever it might be.  Once you start serializing and deserializing this amount of data you are threading on thin ice when it comes to the scalability of your application, both performance and memory wise.  Again, this still holds true, if the dataset itself is 100 MB you will only be able to have a handful of concurrent requests before you run out of memory for the datasets alone.

3. If you really really really need this much data and this architecture you need to start thinking about moving to 64 bit, but even there you need to be careful so that you have enough RAM and disc space to back up the memory you're using, and still you need to be careful, because the more memory you use, the longer it will take to perform full garbage collections.

We discussed a couple of options like bringing back partial datasets, chunking it up, but still most of it was a no-go.

Debugging

I created a very small remoting sample with just one method that returns a very large dataset (you can find the code for the sample at the bottom of this post...  just to see how much memory we were actually using for the serialization (the dataset itself was 102 MB).

I attached to the remoting server with windbg and loaded up sos (.loadby sos mscorwks) and then I set a breakpoint on mscorwks!WKS::gc_heap::allocate_large_object so that I could record the size of the allocation (?@ebx) and the stack (!clrstack) everytime we allocated a large object  (I figured this was enough for a rough estimate)

0:004> x mscorwks!WKS*allocate_large*
79ef212d mscorwks!WKS::gc_heap::allocate_large_object = <no type information>
0:004> bp 79ef212d "?@ebx;!clrstack;g"

Low and behold, the last attempted allocation before the OOM was a whooping 1 142 400 418 bytes (~1 GB!!!! for a 100 MB dataset)

Evaluate expression: 1142400418 = 4417a5a2

OS Thread Id: 0x128c (4)
ESP       EIP    
0454f350 79ef212d [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat)
0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
...

When you try to allocate an object like that it needs to be allocated in one chunk.  Since it is larger than the size of the LOH segment we will try to create a segment the size of the object, and in my case I just didn't have 1 GB of free space in my virtual memory in one large chunk, so the allocation fails with an OOM.

Fine,  what did I learn from this?  well, I just confirmed what I already knew, that serialization is very expensive.  In fact in my case I had to allocate 1 GB to serialize 100 MB so a factor of 10, and that is not even all...  if I would have been successful in allocating this, I would still have had to allocate some more intermediate strings in the neighborhood of a couple of hundred MBs, so all in all it seemed like an insurmountable task to serialize a dataset this big.

 

Solutions

I mentioned a few earlier, which basically include, don't serialize datasets this big, and if you must, then go to 64-bit.

I remembered though, that on 1.1 there was an article that had some suggestions on how to optimize the serialization by creating dataset surrogates, i.e. wrapper classes that performed their own serialization rather than using the standard one that remoting uses.  https://support.microsoft.com/kb/829740

I knew things had changed in 2.0 so that article was no longer applicable, but I didn't really know what it had changed to, so I went on an internet search and found this article that turned out to explain a loot of good stuff about serialization of datasets.

https://msdn.microsoft.com/en-us/magazine/cc163911.aspx

The article suggests that you should change the serialization method if you need to remote very large datasets.  I did this by adding one single line to the remoting server, before returning the dataset

ds.RemotingFormat = SerializationFormat.Binary;

Then I re-ran the test and didn't get the OOM.  Not only that, but when I ran it through the debugger with the same breakpoint... instead of the 1 GB allocation, I ended up with 5 * 240 k allocations and one 225 k allocation used for the serialization (not counting any non-large objects).  Memory wise, that is an improvement of 100 000% for one extra line in your code, that's a little bit hard to beat:)   

 

Have a good one,

Tess

 

 

 

Sample code used for this post

Server:

using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data;

namespace MyServer
{
    class Program
    {
        static void Main(string[] args)
        {
            MyServer();
        }

        static void MyServer()
        {
            Console.WriteLine("Remoting Server started...");
            TcpChannel tcpChannel = new TcpChannel(1234);
            ChannelServices.RegisterChannel(tcpChannel, false);

            Type commonInterfaceType = Type.GetType("MyServer.DataLayer");
            RemotingConfiguration.RegisterWellKnownServiceType(commonInterfaceType, "DataLayerService", WellKnownObjectMode.SingleCall);

            Console.WriteLine("Press ENTER to quit");
            Console.ReadLine();
        }
    }

    public interface DataLayerInterface
    {
        DataSet GetDS(int rows);
    }

    public class DataLayer : MarshalByRefObject, DataLayerInterface
    {
        public DataSet GetDS(int rows)
        {
            //populate a table with the featured products
            DataTable dt = new DataTable();
            DataRow dr;
            DataColumn dc;

            dc = new DataColumn("ID", typeof(Int32));
            dc.Unique = true;
            dt.Columns.Add(dc);

            dt.Columns.Add(new DataColumn("FirstName", typeof(string)));
            dt.Columns.Add(new DataColumn("LastName", typeof(string)));
            dt.Columns.Add(new DataColumn("UserName", typeof(string)));
            dt.Columns.Add(new DataColumn("IsUserAMemberOfTheAdministratorsGroup", typeof(string)));

            DataSet ds = new DataSet();
            ds.Tables.Add(dt);

            for (int i = 0; i < rows; i++)
            {
                dr = dt.NewRow();
                dr["id"] = i;
                dr["FirstName"] = "Jane";
                dr["LastName"] = "Doe";
                dr["UserName"] = "jd";
                dr["IsUserAMemberOfTheAdministratorsGroup"] = "No";
                dt.Rows.Add(dr);
            }

            ds.RemotingFormat = SerializationFormat.Binary;      //<-- this line makes a world of difference
            return ds;
        }

    }
}

Client:

using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data;
using MyServer;

namespace Client
{
    class Program
    {
        static void Main(string[] args)
        {
            TcpChannel tcpChannel = new TcpChannel();
            ChannelServices.RegisterChannel(tcpChannel, false);

            Type requiredType = typeof(DataLayerInterface);
            DataLayerInterface remoteObject = (DataLayerInterface)Activator.GetObject(requiredType, "tcp://localhost:1234/DataLayerService");

            DataSet ds = remoteObject.GetDS(600000);
            Console.WriteLine("Number of rows in ds: " + ds.Tables[0].Rows.Count.ToString());

            Console.ReadLine();
        }
    }
}