Programmatic access to closed captioning data in Media Center

One of the folks I talked to at CES 2006 last week asked me about how to use Media Center extensibility APIs to access the closed captioning (teletext) stream from a TV broadcast. At a high leve, he wanted to be able to monitor some television channels, store the teletext data to a database and then do searches on the data later on. I asked around a little bit when I got back to the office this week and found that we don't have any officially supported means of accessing teletext data with our extensibility APIs.

However, I also found that Stephen Toub has written an interesting white paper, blog post and some excellent sample code that parses and exposes all of the closed captioning data from an NTSC, non-high definition DVR-MS file. Here are links to the things he has written about the DVR-MS file format and how to parse it:

  • Fun with DVR-MS white paper - discusses the DVR-MS file format, introduces DirectShow, and shows how to use DirectShow to work with DVR-MS files; you should read this first to get an overview of DVR-MS files and the data they contain in addition to the video stream
  • DVR-MS closed captioning parser - blog post (which is as in-depth and well written as a white paper) that introduces a managed library Stephen wrote to parse closed captioning data
  • Sample code for closed captioning parser

Using and extending the techniques that Stephen describes in these articles should allow you to parse a closed captioning (teletext) stream from a recorded TV show, and once you have parsed the data you can store it in a database or manipulate it in other ways as needed for your scenarios. For inspiration, check out the list of cool ideas at the bottom of Stephen's blog post...

<update date="2/8/2006"> Fixed the link to the sample code for the closed captioning parser </update>