Excursions in composition: Adding rewind support to a sequential stream


Here's a problem "inspired by actual events":

I have a sequential stream that is the response to a request I sent to a web site. The format of the stream is rather messy; it comes with a variable-length header that describes what type of data is being returned. I want to read that header and then hand the stream to an appropriate handler. But the handlers expect to be given the stream in its entirety, including the bytes that I have already read. Since this is a sequential stream, I can't change the seek position. How can I "unread" the data and give the handlers what they want?

Right now, I'm just closing the stream I get and issuing a second request. I give that second stream to the handler that I determined by parsing the first stream. Of course, this makes the rather unsafe assumption that the server will send the same data stream back the second time, and I'm issuing twice as many requests as I really need.

I tried reading the entire stream into memory and creating a new stream on that, but the data stream can be very big, and I feel bad allocating all that memory just so I can "unread" a few dozen bytes.

All the customer wants here is to be able to "unread" some bytes from a stream. This is a perfect job for our composite sequential stream. The first stream consists of the bytes we want to push back, and the second stream is the rest of the stream. Here's an explanation in pictures. The pink stream is the original stream returned from the web site, and the green stream remembers the bytes we want to push back.

Initial state:

  ABCDEFGHIJK...

After reading a few bytes:

 ABCD EFGHIJK...

After reading a few more bytes:

 ABCDEFG HIJK...

Finally, we have determined the handler, but the handler expects to start reading from the "A", so we will take these two streams and concatenate them so they look like one stream again:

 ABCDEFG HIJK...

Okay, let's do it. All we have to worry about is filling the green stream with the bytes that we read out of the pink stream; we're going to use CConcatStream to do the composition at the end.

class CRewindStream : public CROSequentialStreamBase
{
public:
 CRewindStream(ISequentialStream *pstm);
 ISequentialStream *Rewind();

 // *** ISequentialStream ***
 STDMETHODIMP Read(void *pv, ULONG cb, ULONG *pcbRead);

protected:
 ~CRewindStream();

 bool m_fRewound;
 IStream *m_pstm1;
 ISequentialStream *m_pstm2;
};

CRewindStream::CRewindStream(ISequentialStream *pstm)
 : m_fRewound(false), m_pstm2(pstm)
{
 CreateStreamOnHGlobal(NULL, TRUE, &m_pstm1);
 m_pstm2->AddRef();
}

CRewindStream::~CRewindStream()
{
 if (m_pstm1) m_pstm1->Release();
 m_pstm2->Release();
}

HRESULT CRewindStream::Read(void *pv, ULONG cb, ULONG *pcbRead)
{
 ULONG cbRead = 0;
 HRESULT hr;
 if (m_fRewound) {
  hr = E_FAIL;
 } else if (!m_pstm1) {
  hr = E_OUTOFMEMORY;
 } else {
  hr = m_pstm2->Read(pv, cb, &cbRead);
  if (SUCCEEDED(hr)) {
   hr = m_pstm1->Write(pv, cbRead, NULL);
  }
 }
 if (pcbRead) *pcbRead = cbRead;
 return hr;
}

ISequentialStream *CRewindStream::Rewind()
{
 if (!m_pstm1 || m_fRewound) return NULL;

 m_fRewound = true;
 const LARGE_INTEGER li0 = { 0, 0 };
 m_pstm1->Seek(li0, STREAM_SEEK_SET, NULL);
 return new CConcatStream(m_pstm1, m_pstm2);
}

Our RewindStream takes a sequential stream and creates a memory stream via CreateStreamOnHGlobal to remember the parts that we read so we can stick them back when it's time to rewind.

When reading from the stream, we read from the sequential stream and append the result to the memory stream before returning it. In this manner, each byte read from the head of the pink stream gets appended to the end of the green stream.

When it's time to rewind, we seek the memory stream back to the beginning and create a concatenated stream out of the memory stream and the unread portion of the sequential stream.

Here's a simple program that illustrates our new rewindable sequential stream:

int __cdecl _tmain(int argc, TCHAR **argv)
{
 CoInitialize(NULL);
 IStream *pstmFile;
 if (SUCCEEDED(SHCreateStreamOnFile(argv[1], STGM_READ,
                                    &pstmFile))) {
  CRewindStream *pstmRewind = new CRewindStream(pstmFile);
  if (pstmRewind) {
   char ch;
   ULONG cb;
   while (SUCCEEDED(pstmRewind->Read(&ch, 1, &cb)) && cb) {
    printf("Header: '%c'\n", ch);
    if (ch == ' ') {
     ISequentialStream *pstmRewound = pstmRewind->Rewind();
     if (pstmRewound) {
      PrintStream(pstmRewound);
      pstmRewound->Release();
     }
     break;
    }
   }
   pstmRewind->Release();
  }
  pstmFile->Release();
 }
 CoUninitialize();
 return 0;
}

For illustration purposes, let's assume that the header ends when we find a space. We create a stream on the file whose name is passed on the command line and put it inside a CRewindStream. We then read from that stream a byte at a time until we find that space. (To prove that we're really rewinding, we also print each byte from the header as we encounter it.) Once we find the space, we ask the CRewindStream to create a new stream that represents the original stream rewound back to the beginning. We pass that stream to the PrintStream helper function from last time, to prove that the resulting stream really is the entire file contents starting from the beginning.

Notice that we consumed only as much memory as necessary to remember the parts of the stream that needed to be replayed. Even if the file were a megabyte in size, we only need to remember the bytes that we read up until we decided that we were finished reading the header.

Now, this isn't the most beautiful implementation of a rewindable stream, but it was convenient for expository purposes. I leave you to make as many prettifications as meet your aesthetic requirements.

Exercise: Discuss what would be needed in order to support rewinding more than once and the performance consequences thereof.

Comments (15)
  1. BryanK says:

    Oops, I need to proofread better.  In the second paragraph, last sentence:

    s/local variable/member variable/

  2. JP says:

    BryanK, the problem there would be once your "content handler" starts processing the whole multi-megabyte stream, you’re still Writing that onto your memory stream, so you would need another function to flag that we’re done Rewinding.

  3. Angstrom says:

    Isn’t this the sort of thing you’re supposed to use the Content-Type header for?

  4. Tim says:

    @Angstrom:

    Without wanting to put words in Raymond’s mouth, his blog entries are often of a particular scope and (even when inspired by actual events, as in this case) intended to be pedagogical.

    In other words, he’s generally showing you either a feature you might not know about, and/or how to use that feature correctly.

    In other words, your Content-Type question is tangentially relevant at best.  Imagine this same article, but it’s not about a stream coming from a website/http, but from somewhere else.

    (I’m not having a go, just pointing out the facts :-))

  5. manuelg says:

    Implemented this in Python recently, posted in the Python Cookbook

    http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/502304

    I needed to "pushback" onto the "stream", and also test if the stream is empty, without side-effects (i.e. without eating a character of the stream during the test).

    Python is slow, but you can squeeze some efficiency back from Python with minimal pain, because Python accepts an "iterator" (Python’s name for a stream) in most places where Python accepts a list.  If Python sees a "next" method, it is happy, you don’t have to do any "Java-esque" Object Oriented interface push-ups.

    Implementation follows:

    ‘class pushback_wrapper(object):

    ‘    

    ‘    def init(self, it):

    ‘        self.it = it

    ‘        self.pushed_back = []

    ‘        

    ‘    def iter(self):

    ‘        return self

    ‘    

    ‘    def nonzero(self):

    ‘        if self.pushed_back:

    ‘            return True

    ‘        try:

    ‘            self.pushed_back.insert(0, self.it.next())

    ‘        except StopIteration:

    ‘            return False

    ‘        else:

    ‘            return True

    ‘    

    ‘    def next(self):

    ‘        try:

    ‘            return self.pushed_back.pop()

    ‘        except IndexError:

    ‘            return self.it.next()

    ‘        

    ‘    def pushback(self, item):

    ‘        self.pushed_back.append(item)

  6. scott says:

    Is this problem "inspired by" Internet Explorer?  I feel like I’ve been on the server side of "closing the stream I get and issuing a second request", having to deal with the effects of that "solution".

  7. ac says:

    OT, but I loved the hanselminutes interview

  8. BryanK says:

    Haven’t thought about this much, but I’d guess that the easiest way to rewind more than once is to make a new CRewindStream whose constructor is passed the result of the Rewind call on the original CRewindStream.  Of course this ends up creating a new stream every time you need to rewind a stream, and I’m sure there are more efficient ways to do it.  But it might be workable, depending on the specific requirements.

    Actually, it’d probably be better to have Rewind affect the current stream instead of returning a new stream, if you have to rewind more than once.  Make the CRewindStream read from m_pstm2 when it’s first created, but have a member boolean that you can set to make it read from m_pstm1 instead.  Then, in the Rewind call, set that local variable and do a Seek to the beginning of m_pstm1 (so reads come from the start of it, not the end).

    Read would have to check this member boolean, and read from the correct sub-stream.  Read would also have to swap from m_pstm1 forward to m_pstm2 if it gets to the end of m_pstm1 (just like the concatenation stream from yesterday did), and in that case it’d also have to reset the member variable.  Finally, the interface would of course change; you’d set up a stream, Read() from it N times, then call Rewind() and Read() again (N more times).  And then of course you can Rewind() again, and Read() again N *more* times.  Etc., etc.

    This has the advantage of not re-duplicating the header data if there are multiple Rewind calls (e.g. a stream that rewinds four times won’t store four copies of the first few bytes, where my first design would have, since the memory-streams weren’t shared).  But it’s more complicated, too; it’s a merging of the concatenation stream from yesterday and the rewindable stream from today.

  9. BryanK says:

    Dang, you’re right.  You’d end up with two copies of the entire stream, which is likely to be more memory than N copies of the first M bytes (for my first solution: N rewinds and M bytes read on average on each of them).  Plus reads would take twice as long.  Hmm…  I’m not sure how to fix that short of adding the new function to flag the end of rewinding, as you say.

    I know — you could read everything out of the ISequentialStream and write it into a memory stream, Release the ISequentialStream, and use the memory stream instead!  Then it’s Seek()able, and therefore can be rewound, and you only have one copy of the data!  :-P

    (Yes, yes, I know — the ISequentialStream’s contents may not all be in memory at once.  E.g., network-type streams would pull data from a socket as it’s requested.)

    I think C++ will let you create another overloaded CRewindStream constructor whose parameter is a concatenation stream (instead of the ISequentialStream interface), right?  If so, you could make that constructor, and expose some of the internals of the concatenation stream to the rewind stream.  Then the rewind stream could use those internals to store each byte read from the final underlying stream only once, when its constructor gets a concatenation stream.  (Of course this requires only one non-concatenated stream to begin with; if your first stream is a concatenated one whose first sub-stream doesn’t have the exposed buffer, then it will break pretty horribly.)

    And this is an even more-complicated solution than the one above that stores stuff twice (exposing class internals to that extent is Really Bad).  Probably the only way that will work reliably and not copy the entire stream is to create a manual tree of CRewindStream wrappers.

    Actually, that’s not as bad as I thought at first — you’ll only have one previous memory stream hanging around, not all of them.  The reason is, the first Rewind() should be followed by a Release() on the CRewindStream that Rewind was called on (since it shouldn’t be used again), which will release m_pstm1 and m_pstm2.  The stream returned from Rewind will still have a reference, so they won’t go away.  But after you call Rewind again, you’ll Release the second CRewindStream, which will Release the concatenation stream, and the memory stream from the first CRewindStream *will* go away at that point.

    You could even make the concatenation stream release its first stream once it Reads to the end of it (since the base streams are sequential, the stream interface is useless after you read to its end), which could release the memory stream even earlier, depending on the usage.

    Unless I’m forgetting something else.  ;-)

  10. meh says:

    Yes, the interview was very nice.

  11. Igor says:

    Someone here is doing what they shouldn’t most likely due to the poor design so this exercise is pretty pointless.

    Handler shouldn’t need the information which has already been consumed in order to determine the handler.

    Parser shouldn’t consume information which is vital to the handler.

    Either way, it is completely broken.

    [Then I guess the HTML META tag is completely broken. So too are the various image file formats. -Raymond]
  12. BryanK says:

    HTML META tag — yeah, I’d say that is pretty broken.  But then, I’d say that a lot of HTML is pretty broken, too.  ;-)

    The META tag is the only way to override broken web server configurations that don’t specify a character set or don’t specify the correct content-type; the proper fix is to use a web server that does specify the proper character set or content-type for your files.  (And for shared hosting, that would be a web server that allows these values to be changed per-file, e.g. by using another control file per-host, or separate settings per-host that can be changed by the people being hosted.)

    Image file formats: ideally, you’d be able to use the content type (or some other out-of-band method) to decide on your handler, but if the image is sitting on a local disk, you don’t have a content type available (on most FSes, anyway).  Without that, you’re right, something like this is needed.

  13. Igor says:

    Raymond said:

    “Then I guess the HTML META tag is completely broken. So too are the various image file formats.”

    There, you said it and I agree. They are broken and this is a fix but from the wrong side. Like trying to plug the pipe with your finger instead of closing the valve.

    That is why most software suck — because it is built as an ugly hack around a broken or completely flawed specifications.

    [You can solve the problem your way (refusing to write this code and demanding instead that the HTML and image format specifications be fixed); I’ll solve it my way. We’ll see who’s finished first. -Raymond]
  14. Igor says:

    BryanK:

    You see, this is the heart of the problem:

    You either:

    1. Pass the image completely to the handler and let it determine if it can handle it

    or:

    1. Determine correct image type and pass the rest to the handler which does not waste time performing the same checks again but instead just decodes it

    What Raymond has done here is a workaround for two pieces of code (a parser and a handler) which were never meant to work together. How can you tell that? Because one part of their work is overlapping. What we end up is a waste of resources in the name of "reusable" code.

Comments are closed.

Skip to main content