Well the answer may not be what you expect.
I used Expression Encoder 2 (namely the new video overlay feature and marker meta data), Expression Blend 2.5, Visual Studio 2008 and Silverlight 2 Beta 1 to build an application.
The output from Expression Encoder 2 is a single video with two streams combined, and markers saying when the PIP was focused on me.
I then uploaded the video to Microsoft's streaming servers. You can view it in it's raw form using Windows Media Player. This technique vastly simplifies the sync issues that can get created when using two separate streams and would work for multiple cameras at live events as well.
I then used a single media element, clipping and two video brushes to produce the resulting application.
1 Video Stream + a 134kb XAP file is what produced the final effect.
Given the interest I thought I would post my source code and show you the Silverlight 2 (I could have used Silverlight 1) version of my application hosted online.
I used the free Silverlight Streaming service to host the application and you can view the finished product either online as a web page and as a Windows Application. I didn't put in any UI to show buffering so please be patient for the video to start. I also added code to pop to full screen triggered by the user clicking on the main video feed.
P.S. for the vimeo and facebook versions I used Hypercam Screen Capture software to record the screen while the Silverlight application was running!