Loading the dictionary, part 6: Taking advantage of our memory allocation pattern


After our latest round of optimization, the 100ms barrier teased us, just milliseconds away. Profiling the resulting program reveals that 60% of the CPU is spent in operator new. Is there anything we can do about that?

Indeed, we can. Notice that the memory allocation pattern for the strings in our dictionary is quite special: Once a string is allocated into the dictionary, it is never modified or freed while the dictionary is in use. When the dictionary is freed, all the strings are deleted at once. This means that we can design an allocator tailored to this usage pattern.

I don't know whether there is a standard name for this thing, so I'm just going to call it a StringPool. A string pool has the following characteristics:

  • Once you allocate a string, you can't modify or free it as long as the pool remains in existence.

  • If you destroy the string pool, all the strings in it are destroyed.

We implement it by using the same type of fast allocator that the CLR uses: A single pointer. [25 May 2005: The blog server software corrupts the diagram, sorry.]

allocated free
p

To allocate memory, we just increment p by the number of bytes we need. If we run out of memory, we just allocate a new block, point p to its start, and carve the memory out of the new block. Destroying the pool consists of freeing all the blocks.

Note also that this memory arrangement has very good locality. Instead of scattering the strings all over the heap, they are collected into one location. Furthermore, they are stored in memory in exactly the order we're going to access them, which means no wasted page faults or cache lines. (Well, you don't know that's the order we're going to access them, but it's true. This is one of those "performance-guided designs" I mentioned a little while ago.)

class StringPool
{
public:
 StringPool();
 ~StringPool();
 LPWSTR AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd);

private:
 union HEADER {
  struct {
   HEADER* m_phdrPrev;
   SIZE_T  m_cb;
  };
  WCHAR alignment;
 };
 enum { MIN_CBCHUNK = 32000,
        MAX_CHARALLOC = 1024*1024 };

private:
 WCHAR*  m_pchNext;   // first available byte
 WCHAR*  m_pchLimit;  // one past last available byte
 HEADER* m_phdrCur;   // current block
 DWORD   m_dwGranularity;
}; // colorization fixed 25 May

Each block of memory we allocate begins with a StringPool::HEADER structure, which we use to maintain a linked list of blocks as well as providing enough information for us to free the block when we're done.

Exercise: Why is HEADER a union containing a structure rather than just being a structure? What is the significance of the alignment member?

inline RoundUp(DWORD cb, DWORD units)
{
    return ((cb + units - 1) / units) * units;
}

StringPool::StringPool()
 : m_pchNext(NULL), m_pchLimit(NULL), m_phdrCur(NULL)
{
 SYSTEM_INFO si;
 GetSystemInfo(&si);
 m_dwGranularity = RoundUp(sizeof(HEADER) + MIN_CBCHUNK,
                           si.dwAllocationGranularity);
}

At construction, we compute the size of our chunks. We base it on the system allocation granularity, choosing the next multiple of the system allocation granularity that is at least sizeof(HEADER) + MIN_CBCHUNK in size. Since a chunk is supposed to be a comfortably large block of memory, we need to enforce a minimum chunk size to avoid having an enormous number of tiny chunks if we happen to be running on a machine with a very fine allocation granularity.

LPWSTR StringPool::AllocString(const WCHAR* pszBegin, const WCHAR* pszEnd)
{
 size_t cch = pszEnd - pszBegin + 1;
 LPWSTR psz = m_pchNext;
 if (m_pchNext + cch <= m_pchLimit) {
  m_pchNext += cch;
  lstrcpynW(psz, pszBegin, cch);
  return psz;
 }

 if (cch > MAX_CHARALLOC) goto OOM;
 DWORD cbAlloc = RoundUp(cch * sizeof(WCHAR) + sizeof(HEADER),
                                                          m_dwGranularity);
 BYTE* pbNext = reinterpret_cast<BYTE*>(
                  VirtualAlloc(NULL, cbAlloc, MEM_COMMIT, PAGE_READWRITE));
 if (!pbNext) {
OOM:
  static std::bad_alloc OOM;
  throw(OOM);
 }

 m_pchLimit = reinterpret_cast<WCHAR*>(pbNext + cbAlloc);
 HEADER* phdrCur = reinterpret_cast<HEADER*>(pbNext);
 phdrCur->m_phdrPrev = m_phdrCur;
 phdrCur->m_cb = cbAlloc;
 m_phdrCur = phdrCur;
 m_pchNext = reinterpret_cast<WCHAR*>(phdrCur + 1);

 return AllocString(pszBegin, pszEnd);
}

To allocate a string, we first try to carve it out of the remainder of the current chunk. This nearly always succeeds.

If the string doesn't fit in the chunk, we allocate a new chunk based on our allocation granularity. To avoid integer overflow in the computation of the desired chunk size, we check against a fixed "maximum allocation" and go stright to the out-of-memory handler if it's too big.

Once we have a new chunk, we link it into our list of HEADERs and abandon the old chunk. (Yes, this wastes some memory, but in our usage pattern, it's not much, and trying to squeeze out those last few bytes isn't worth the added complexity.) Once that's done, we try to allocate again; this second time will certainly succeed since we made sure the new chunk was big enough. (And any decent compiler will detect this as a tail recursion and turn it into a "goto".)

There is subtlety here. Notice that we do not update m_pchNext until after we're sure we either satisfied the allocation or allocated a new chunk. This ensures that our member variables are stable at the points where exceptions can be thrown. Writing exception-safe code is hard, and seeing the difference between code that is and isn't exception safe is often quite difficult.

StringPool::~StringPool()
{
 HEADER* phdr = m_phdrCur;
 while (phdr) {
  HEADER hdr = *phdr;
  VirtualFree(hdr.m_phdrPrev, hdr.m_cb, MEM_RELEASE);
  phdr = hdr.m_phdrPrev;
 }
}

To destroy the string pool, we walk our list of chunks and free each one. Note the importance of copying the HEADER out of the chunk before we free it!

Using this string pool requires only small changes to the rest of our program.

struct DictionaryEntry
{
 bool Parse(const WCHAR *begin, const WCHAR *end, StringPool& pool);
 // void Destruct() {
 //  delete[] m_pszTrad;
 //  delete[] m_pszSimp;
 //  delete[] m_pszPinyin;
 //  delete[] m_pszEnglish;
 // }
 LPWSTR m_pszTrad;
 LPWSTR m_pszSimp;
 LPWSTR m_pszPinyin;
 LPWSTR m_pszEnglish;
};

class Dictionary
{
public:
 Dictionary();
 // ~Dictionary();
 int Length() { return v.size(); }
private:
 vector v;
 StringPool m_pool;
};

// Dictionary::~Dictionary()
// {
//  for (vector<DictionaryEntry>::iterator i = v.begin();
//       i != v.end(); i++) {
//   i->Destruct();
//  }
// }

We no longer need to free the strings in the DictionaryEntry manually, so the Destruct method and the Dictionary destructor can go.

bool DictionaryEntry::Parse(
   const WCHAR *begin, const WCHAR *end,
   StringPool& pool)
{
 const WCHAR* pch = std::find(begin, end, L' ');
 if (pch >= end) return false;
 m_pszTrad = pool.AllocString(begin, pch);
 begin = std::find(pch, end, L'[') + 1;
 if (begin >= end) return false;
 pch = std::find(begin, end, L']');
 if (pch >= end) return false;
 m_pszPinyin = pool.AllocString(begin, pch);
 begin = std::find(pch, end, L'/') + 1;
 if (begin >= end) return false;
 for (pch = end; *--pch != L'/'; ) { }
 if (begin >= pch) return false;
 m_pszEnglish = pool.AllocString(begin, pch);
 return true;
}

Dictionary::Dictionary()
{
 ...
    if (de.Parse(buf, buf + cchResult, m_pool)) {
 ...
}

And finally, we pass our string pool to DictionaryEntry::Parse so it knows where to get memory for its strings from.

With these changes, the dictionary loads in 70ms (or 80ms if you include the time it takes to destroy the dictionary). This is 70% faster than the previous version, and over three times as fast if you include the destruction time.

And now that we've reached our 100ms goal, it's a good time to stop. We've gotten the running time of dictionary loading down from an uncomfortable 2080ms to a peppier 70ms, a nearly 30-fold improvement, by repeatedly profiling and focusing on where the most time is being spent. (I have some more simple tricks that shave a few more milliseconds off the startup time. Perhaps I'll bring them into play if other changes to startup push us over the 100ms boundary. As things stand, the largest CPU consumers are MultiByteToWideChar and lstrcpynW, so that's where I would focus next.)

That's the end of the first stage. The next stage will be displaying the dictionary in an owner-data listview, but you'll have to wait until next month.

Comments (25)
  1. Pete says:

    You missed a closing tag somewhere – the whole thing is blue.

  2. Tim Smith says:

    I’ve used this technique for symbol tables. It makes it very easy to remove symbols from the table in bulk by moving the end of the symbol table back (and the hash entry table).

  3. Dan McCarty says:

    This might be a dumb question, but why is MIN_CBCHUNK set to 32,000 instead of 32K?

  4. Ben Hutchings says:

    Maybe you should decide whether the type of a memory size is DWORD, SIZE_T or size_t. Then use that as the return type and argument type of RoundUp rather than relying on the "implicit int" rule which was removed from C++ about, oh, 15 years ago.

  5. Jon Asbury says:

    "I don’t know whether there is a standard name for this thing"

    Wouldn’t it be called a stack. In this case, you are always pushing a never popping. When you want to pop all, you simply reset the stack pointer.

    Many years ago the version of Pascal I used supported this method of allocation in the language. I can’t remember all the keywords, but it had a way of marking the stack and then releasing all the allocations back to the mark.

  6. autist0r says:

    All this work has already be done… Simply use HeapCreate() && HeapAlloc() !

  7. mfx says:

    This storage scheme is an instance of an "arena"-based allocator.

    I first encountered it in Fraser and Hansons book "A retargetable C compiler", in which the implemention of the lcc compiler is described in detail.

  8. EricLippert says:

    Indeed, this is an excellent technique for implementing programming languages. The MSFT script engines use this technique extensively.

    Building a parse tree, for example, requires many many small allocations all of which will be freed at exactly the same time: when the code generator is done walking the parse tree.

  9. Ulric says:

    Is there an advatage to using reinterpret_cast<type> instead of the good old standard cast (type) ?

  10. Mike says:

    Ulric asked:

    "Is there an advatage to using reinterpret_cast<type> instead of the good old standard cast (type) ?"

    Yes.

    The advantage is that you have clearly marked "What I’m noing now is likely ugly! Be aware!". Old-style C casts are easily missed, making it seem like something common, when in fact casting in a strongly typed language is something exceptional and should be marked so.

  11. Well, it’s time for me to surrender.&amp;nbsp; Sort of :)

    Raymond pulls out all the stops in his sixth version….

  12. Michael Bacarella says:

    "Exercise: Why is HEADER a union containing a structure rather than just being a structure? What is the significance of the alignment member?"

    I didn’t follow the code very closely because C++ makes my head hurt, but I’m assuming the reason is alignment issues.

    The CPU is really picky about the way you ask for values stored at an address. If you ask it to look up any address that isn’t cleanly divisible by some specific number, it throws an exception. Compilers hide this for you, and more low level stuff, like malloc(), have system-dependent information built into the library that go far from preventing this from happening.

    When you implement your own memory allocation scheme, you get exposed to the ugly underbelly of CPU memory addressing.

    The union trick gets the compiler to round up to a certain CPU happy multiple, so the CPU doesn’t end up with an address that makes it choke.

  13. I’m curious as to why you did not use placement new / delete to implement the pool allocator? That way you wouldn’t have dramatically different syntax (stuff like pool.AllocString() for example).

  14. Why not do the following?

    1. Preprocess the file into a stream of (chinesepinyinenglish)* (in Unicode)

    2. Read the file into memory

    3. Append an extra just in case

    4. Loop through the data, count number of dictionary entries

    5. Allocate precise number of dictionary entries

    6. Loop through the data, populate the pointers in the dictionary entries

    Actually, even that can be improved.

    1. Preprocess the file into a stream of (chinesepinyinenglish)* (leave in Big5)

    2. Make the file a resource and compile it into the executable

    3. Load the resource into memory

    4. Allocate the precise number of dictionary entries (since you preprocessed it, you probably know this)

    5. Loop through the data, populate the dictionary entries with pointers

    Note that leaving it as Big5 here will substantially reduce your working set. I’ll argue that the extra calls to MultiByteToWideChar in your GUI are worth the smaller dataset of leaving the data (which is mostly 7-bit anyway) in Big5.

    Plus, making the dictionary part of the executable makes the program easier to distribute.

    In summary, this removes the overhead of

    (a) MultiByteToWideChar

    (b) String allocations

    (c) Parsing

    Sounds like it would be pretty fast to me.

  15. asdf says:

    To clarify, when I say cast back when talking about reinterpret_cast I mean reinterpret_cast back, you cannot reinterpret_cast to one type and static_cast back to the other and expect that to be work in general.

  16. asdf says:

    What you did is called an arena allocator (I also first encountered it in Hansen’s excellent book). A pool allocator for the most part is a freelist of fixed sized objects. I’ve seen that claim before about the CLR allocator and I don’t believe it’s that simple given that it has to support pinned objects.

    You can throw OOM with "throw std::bad_alloc()" instead of creating a static object. You should stop using DWORD/int when you mean size_t, ptrdiff_t, container::size_type, or container::distance_type. Please *DO NOT* use a windows header definition of those types or their bastardized all caps version, they’ve historically botched the actual type by typedeffing it to a type that just happens to be the same size. Not so important in C (except when creating a pointer to it) but this is extremely important in C++’s very subtle overload resolution (VC has a stupid quirk where int[op]long returns int instead of long, now you know where it comes from). Stop going past the end of the array and stop assuming relational operators work when you’re not comparing elements of an array. Stop reinterpret_casting to/from void pointer, static_cast is the thing you want. My guideline on reinterpret_cast is that there are only 2 places where it can be used: 1. casting to/from a pointer to an integer big enough to hold it so you can do non-portable, implementation defined things to it 2. casting a pointer to a suitably aligned pointer and back. The second case is where you can reinterpret_cast to void *, but ONLY if you cast back to the original type. This style of code has been implementation defined when C correctly replaced all questionable instances of char with void:

    foo *f;

    unsigned char *uc;

    uc = (unsigned char*)f;

    // aka uc = reinterpret_cast<unsigned char*>(f);

    it’s always portable and correct to do this:

    uc = (unsigned char*)(void*)f;

    // aka uc = static_cast<unsigned char*>(static_cast<void*>(f));

    You may think I’m being anal (and I am), but compilers are increasingly becoming standards compliant in terms of supported and unsupported behavior so I think it’s much better to learn what C/C++ actually guarantees over what compiler switches your current compiler guarantees.

  17. Frederik Slijkerman says:

    The goto is unnecessary — you can just replace it by throw(OOM) which is about the same number of characters, and put the OOM declaration a bit further up.

  18. Mihai says:

    </font>

    Just tried to close the blue font tag :-)

  19. Anonymous says:

    </font>

    > Just tried to close the blue font tag :-)

    It worked! It worked! It just took a week to work.

  20. DannyG says:

    First, big thanks for the very interesting series.

    Just one thing – I wonder, when you originally wrote the dictionary application, did you actually follow all the steps you’ve shown here? What I mean is, did you start with the pure STL version, and then slightly moved on to what we see now or, for example, the memory-mapped file was there from the very beginning, and you just wrote the STL stream version for the blog’s sake?

    Sorry, I didn’t ask this earlier but [Raymond was currently on vacation].

  21. Filling a listview with tens of thousands of items.

  22. Owner-data listviews let you take over data management from the listview.

  23. Finally we start searching.

  24. Belated answers to exercises and other questions.

  25. Well, it’s time for me to surrender. Sort of :) Raymond pulls out all the stops in his sixth version

Comments are closed.