Loading the dictionary, part 3: Breaking the text into lines


Even after moving the character conversion out of the getline function, profiling reveals that getline is still taking nearly 50% of our CPU. The fastest code is code that isn't there, so let's get rid of getline altogether. Oh wait, we still need to break the file into lines. But maybe we can break the file into lines faster than getline did.

#include <algorithm>

class MappedTextFile
{
public:
 MappedTextFile(LPCTSTR pszFile);
 ~MappedTextFile();

 const CHAR *Buffer() { return m_p; }
 DWORD Length() const { return m_cb; }

private:
 PCHAR   m_p;
 DWORD   m_cb;
 HANDLE  m_hf;
 HANDLE  m_hfm;
};

MappedTextFile::MappedTextFile(LPCTSTR pszFile)
    : m_hfm(NULL), m_p(NULL), m_cb(0)
{
 m_hf = CreateFile(pszFile, GENERIC_READ, FILE_SHARE_READ,
                   NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
 if (m_hf != INVALID_HANDLE_VALUE) {
  DWORD cb = GetFileSize(m_hf, NULL);
  m_hfm = CreateFileMapping(m_hf, NULL, PAGE_READONLY, 0, 0, NULL);
  if (m_hfm != NULL) {
   m_p = reinterpret_cast<PCHAR>
                 (MapViewOfFile(m_hfm, FILE_MAP_READ, 0, 0, cb));
   if (m_p) {
    m_cb = cb;
   }
  }
 }
}

MappedTextFile::~MappedTextFile()
{
 if (m_p) UnmapViewOfFile(m_p);
 if (m_hfm) CloseHandle(m_hfm);
 if (m_hf != INVALID_HANDLE_VALUE) CloseHandle(m_hf);
}

This very simple class babysits a read-only memory-mapped file. (Yes, there is a bit of oddness with files greater than 4GB, but let's ignore that for now, since it's a distraction from our main point.)

Now that the file is memory-mapped, we can just scan it directly.

Dictionary::Dictionary()
{
 MappedTextFile mtf(TEXT("cedict.b5"));
 typedef std::codecvt<wchar_t, char, mbstate_t> widecvt;
 std::locale l(".950");
 const widecvt& cvt = _USE(l, widecvt); // use_facet<widecvt>(l);
 const CHAR* pchBuf = mtf.Buffer();
 const CHAR* pchEnd = pchBuf + mtf.Length();
 while (pchBuf < pchEnd) {
  const CHAR* pchEOL = std::find(pchBuf, pchEnd, '\n');
  if (*pchBuf != '#') {
   size_t cchBuf = pchEOL - pchBuf;
   wchar_t* buf = new wchar_t[cchBuf];
   mbstate_t state = 0;
   char* nextsrc;
   wchar_t* nextto;
   if (cvt.in(state, pchBuf, pchEOL, nextsrc,
                   buf, buf + cchBuf, nextto) == widecvt::ok) {
    wstring line(buf, nextto - buf);
    DictionaryEntry de;
    if (de.Parse(line)) {
     v.push_back(de);
    }
   }
   delete[] buf;
  }
  pchBuf = pchEOL + 1;
 }
}

We simply scan the memory-mapped file for a '\n' character, which tells us where the line ends. This tells us the location and length of the line, which we use to convert it to Unicode and continue our parsing.

Exercise:Why don't we have to worry about the carriage return that comes before the linefeed?

Exercise:Why don't we have to worry about possibly reading past the end of the file when we check *pchBuf != '#'?

With this change, the program now loads the dictionary in 480ms (or 550ms if you include the time it takes to destroy the dictionary). That's over twice as fast as the previous version.

But it's still not fast enough. A half-second delay between hitting Enter and getting the visual feedback is still unsatisfying. We can do better.

[Raymond is currently on vacation; this message was pre-recorded.]

Comments (28)
  1. Davy says:

    Exercise:Why don’t we have to worry about possibly reading past the end of the file when we check *pchBuf != ‘#’?

    because that check is done at:

    while (pchBuf < pchEnd)

  2. stefang says:

    Hmm

    Something is strange here.

    When I run your version 2 and 3 I get the following times (constructor only) :

    version 2: 570 ms

    version 3: 430 ms

    So, on my system version 3 is only 25% faster.

    This is far from over twice as fast.

    Are you sure that you did not change anything else between the two version ?

    Perhaps you changed to a single-threaded library, or you changed some optimization setting ?

    What compiler are you using?

    What does your compiler command line look like ?

    My profiling of your version 2 indicates that the biggest problem is still in the character conversion.

    40% is spent in do_in, only 12% in getline

    I am using VS 2003 with a single-threaded library with /O2 optimization

    If I dig a little deeper in the profiling of do_in I find that it is still calling MultiByteToWideString once for each character in the file.

    A much better solution is to call MultiByteToWideString manually for each line instead of using codecvt

  3. ricom says:

    Raymond’s on vacation but maybe I can offer a consolation reply. As I recall he did the time testing on an older machine of his at home. It’s entirely possible that different microarchitectures have different costs.

    I have a copy of exactly what Raymond built and I’ll be posting another analysis on my own blog later today and we’ll see if my measurements (on modern hardware) are more in line with what you see.

    I suspect he built it with /O2 and that’s all but I’m not sure it would matter much because virtually all the cost (to-date) has been in the libraries.

  4. srs says:

    What are you guys using to profile the code? Is there some tool in vs.net that I’m not aware of?

  5. Alex says:

    What tool do you use to profile your code?

    Are there any good free (or open source) profilers out there and what’s the best commercial one?

  6. The fun continues as today we look at Raymond’s third improvement.

    Raymond starts using some pretty…

  7. Tim Smith says:

    This program hits a lot of different elements of the computer, disk drives, memory, and CPU. Each different hardward configuration will have bottlenecks and performance issues manifest in different areas.

  8. Dean Harding says:

    Perhaps the codecvt class in VS.NET 2005 does the whole string at once, while the one in VS.NET 2003 does it one character at a time? I think Raymond’s using 2005 (from his last post anyway).

  9. Matt Barry says:

    I have a question. Is it possible for a control character such as ‘n’ to appear in the trail byte of a DBCS character? If the answer is yes, will std::find safely ignore these bytes?

  10. Berlin Brown says:

    I have a side question though. Why aren’t there newline or any extra features in notepad.exe. I know a lot of us use notepad for the most basic of files. I was wondering why there isn’t support of different encodings and such?

  11. Lucas says:

    Berlin Brown:

    Because (as answered in another post) notepad is a simple wrapper around the TextEdit standard windows control (and wordpad a wrapper around the RichText control).

  12. I’m wondering how long it will take before the entire C++ runtime library is gone from this example :)

  13. while (pchBuf < pchEnd) {

    const CHAR* pchEOL = std::find(pchBuf, pchEnd, ‘n’);

    pchBuf = pchEOL + 1;

    }

    The last last line had better end with ‘n’. Otherwise you’ll get extremely poor performance

  14. Ben Hutchings says:

    I don’t like the way the MappedTextFile constructor silently fails. I think you should be using exceptions to report errors. Much of the standard library can throw exceptions anyway, so you have no hope of avoiding them.

    indranilbanerjee wrote: "The last last line had better end with ‘n’. Otherwise you’ll get extremely poor performance"

    If std::find() doesn’t find the target value, it returns the "last" iterator (pchEnd). This does mean that "pchBuf = pchEOL + 1;" could potentially put pchBuf beyond the end of the buffer, which in general has undefined behaviour, but by using memory-mapped files Raymond has already stepped outside portable C++.

  15. Nikos,

    File mapping IS the disk cache on NT. The disk cache on NT works by mapping the file into memory and then reading from the mapped section.

  16. In your constructor’s intializer list, the order of initialization should match the order of declaration for your members.

  17. Dean Harding says:

    File mapping IS the disk cache on NT. The disk cache on NT works by mapping the file into memory and then reading from the mapped section.

    But I don’t see, then, how a memory-mapped file can be much faster than just doing sequential reads? Unless every call to ReadFile results in a switch to kernel-mode and back again…

  18. nikos says:

    my personal experience from mapped files is at best a mixed bag

    i’m using a read-only filemap like raymond’s to search for text in files (ok a bit more complex since i’m not mapping all the file at once caring for huge files, win9x etc, but basically the same)

    many times i feel that straight CreateFile/ReadFile is faster. My suspicions are:

    [*] the overhead associated with file mapping is significant — especially so for small files

    [*] somehow file mapping misses the disk cache. So whereas repeated searches the simple way (ReadFile) are practically instantaneous from the 2nd time on (the file is read from the cache rather than HD), searches based on file mapping take more or less the same time, be it the first or 10th scan

    any feedback on performance? It isn’t unconceivable i’m doing something wrong, but the read loop is so simple so…

    thanks

  19. cjm says:

    Dean,

    Every call to ReadFile goes to kernel mode. File mappings only need to go to kernel mode (through a page fault) if a page that has been touched does not have the data it should have on it.

  20. Dean Harding says:

    Ah, makes sense then. Thanks!

  21. indranil banerjee says:

    I assume Raymond has a good reason for using the multithreaded C runtime library.

    Looking at the analysis that ricom has done over at http://blogs.msdn.com/ricom/, Locking plays a major part in the poor performance.

  22. stefang says:

    Nikos:

    You must be doing something wrong. I have never seen reading from mapped files be slower than traditional ReadFile.

    Can you produce an example program that demonstrates what you are saying ?

  23. srs says:

    Would reading a file from start to finish (just once, then closing it) be faster with a mapped file, or would it be the same as using ReadFile?

  24. nikos says:

    Larry

    I’m not trying to dispute you or anything. But I’m definite that repeated reads of a file while mapped don’t get any faster, whereas normal reads get *much* faster from the 2nd time onwards. This is from experience.

    It’s also different in 2000/XP: the former gets slightly faster even in mapping reads, whereas XP is a total dog in this respect :)

  25. nikos says:

    have never seen reading from mapped files

    > be slower than traditional ReadFile

    i didn’t say that; i said that subsequent reads get faster with readfile but not with mapped files

    > Can you produce an example program

    > that demonstrates what you are saying

    you asked for it :)

    the ReadFile method is found in my old file manager, 2xExplorer:

    http://netez.com/2xExplorer

    the new and supposedly improved memory mapped file version is within xplorer2:

    http://zabkat.com

    both programs have a Mark menu with a command to "search for text in files" <ctrl+G>

    try it in a large folder with lots of files (like the platform SDK headers dir) looking for some text that doesn’t exist (so all files are read completely)

    you’ll find that in repeated searches for the same text, the old 2xExplorer is faster. Could be that mapped files are $#!+e or that i’ve foobar’d something somewhere :)

  26. stefang says:

    Well, I have tried both your programs, and I see no noticeable difference between the two.

    Both programs behave just as you might expect:

    The first time you run a search it is very slow because it is actually reading from the disk.

    The second time, it is very fast and the disk is not accessed at all.

    This is exactly as it is supposed to work.

    I am running on Windows XP SP2 with a 1.2 GHZ Pentium 4 and 700 MB memory.

  27. The fun continues as today we look at Raymond’s third improvement . Raymond starts using some pretty

Comments are closed.

Skip to main content