Fun with Iterators and state machines

I recently was given a piece of C# code with statements like:

yield return value;

and

yield break;

This was the beginning of my descent into the loopy world of C# 2.0 iterators. It took me awhile to wrap my head around them, and when I tried to explain them to other team members I got looks of total confusion. (I admit it, some of us don't keep the C# 2.0 specification on the night stand. :-) If you're totally familiar with iterators, you can skip this blog entry. If not, plunge in with me. Also, I'm not trying to give a comprehensive overview of C# iterators. (For that, see the MSDN article) Rather, I'm focused on the cool compiler and run time magic that makes them possible.

<<Disclaimer: This code is based on the Whidbey Beta 1 version of C#, and various details may change before release.>>

Consider the following program:

//
// Iterator1.cs - Requires C# 2.0 or later
//

using System;
using System.Collections.Generic;

class SomeContainer
{
public IEnumerable < string > MyIterator()
{
yield return "Foo";

yield return "Bar";

yield return "Baz";
}
}

class foo
{
static void  Main()
{
SomeContainer container = new SomeContainer();

foreach( string s in container.MyIterator() )
Console.WriteLine( s );
}
}

Compiling the program ( "csc iterator1.cs" ) and running it gives these results:

Foo

Bar

Baz

If you try to mentally trace the flow of control, you might notice something odd. What the heck does a "yield return" statement do? Looking at the MyIterator method, it seems like there's two possibilities:

1) The CPU executes only part of MyIterator() at any given time. At points, control comes back to the method, and it executes another chunk.

or

2) The entire MyIterator method runs to completion, and all the possible return values are put into a temporary collection.

It's easy enough to test the first hypothesis. Let's add some WriteLine calls to MyIterator and see when they appear:

//
// Iterator2.cs - Requires C# 2.0 or later
//

using System;
using System.Collections.Generic;

class SomeContainer
{
public IEnumerable < string > MyIterator()
{
Console.WriteLine( "Entering MyIterator" );

yield return "Foo";
Console.WriteLine( "After Foo" );

yield return "Bar";
Console.WriteLine( "After Bar" );

yield return "Baz";
Console.WriteLine( "After Baz" );
}
}

class foo
{
static void  Main()
{
SomeContainer container = new SomeContainer();

foreach( string s in container.MyIterator() )
Console.WriteLine( s );
}
}

Compiling the new program ( "csc iterator2.cs" ) and running it gives these results:

Entering MyIterator

Foo

After Foo

Bar

After Bar

Baz

After Baz

Hmm... Based on these results, it seems that the execution path definitely bounces back and forth between the Main() method and the MyIterator() method. What the heck is going on here?

Since I work on the Visual Studio team, I could have asked a few questions, but some times it's more fun to figure things out on your own. Luckily, with a tool like ILDASM, it's easy enough to discern what's happening.

First, let's recompile with Iterator2.cs with optimizations ( "CSC /O+ iterator2.cs" ) and see what ILDasm shows us:

___[MOD] iterator2.exe
| M A N I F E S T
|___[CLS] SomeContainer
| | .class private auto ansi beforefieldinit
| |___[CLS] <MyIterator>d__0
| | | .class nested private auto ansi sealed beforefieldinit
| | | implements class [mscorlib]System.Collections.Generic.'IEnumerable`1'<string>
| | | implements [mscorlib]System.Collections.IEnumerable
| | | implements class [mscorlib]System.Collections.Generic.'IEnumerator`1'<string>
| | | implements [mscorlib]System.Collections.IEnumerator
| | | implements [mscorlib]System.IDisposable
| | | .custom instance void [mscorlib]System.Runtime.CompilerServices.CompilerGeneratedAttribute::.ctor() = ( 01 00 00 00 ) ...
| | |___[FLD] <>1__state : private int32
| | |___[FLD] <>2__current : private string
| | |___[FLD] <>4__this : public class SomeContainer
| | |___[MET] .ctor : void(int32)
| | |___[MET] MoveNext : bool()
| | | System.Collections.Generic.IEnumerable<System.String>.GetEnumerator : class [mscorlib]System.Collections.Generic.'IEnumerator`1'<string>()
| | | System.Collections.Generic.IEnumerator<System.String>.get_Current : string()
| | |___[MET] System.Collections.IEnumerable.GetEnumerator : class [mscorlib]System.Collections.IEnumerator()
| | |___[MET] System.Collections.IEnumerator.Reset : void()
| | |___[MET] System.Collections.IEnumerator.get_Current : object()
| | |___[MET] System.IDisposable.Dispose : void()
| | |___[PTY] 'System.Collections.Generic.IEnumerator<System.String>.Current' : instance string()
| | |___[PTY] System.Collections.IEnumerator.Current : instance object()
| |
| |___[MET] .ctor : void()
| | MyIterator : class [mscorlib]System.Collections.Generic.'IEnumerable`1'<string>()
|
|___[CLS] foo
| | .class private auto ansi beforefieldinit
| |___[MET] .ctor : void()
| |___[STM] Main : void()
|

For our very simple SomeContainer class, the C# compiler sure generated a lot of methods and fields. Particularly are the variables <>1__state and <>2__current. They sound suspiciously like parts of a state machine. A state machine could explain how just portions of a method execute.

Let's dig a little deeper. Expand the MoveNext() method and examine the disassembly (which I've added a few extra lines to in order to increase readability):

.method private hidebysig newslot virtual final
instance bool MoveNext() cil managed
{
.override [mscorlib]System.Collections.IEnumerator::MoveNext
.override method instance bool class [mscorlib]System.Collections.Generic.'IEnumerator`1'<string>::MoveNext()
// Code size 164 (0xa4)
.maxstack 2
.locals init (int32 V_0)
IL_0000: ldarg.0
IL_0001: ldfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_0006: stloc.0
IL_0007: ldloc.0
IL_0008: switch (
IL_0022,
IL_0047,
IL_006c,
IL_0091)
IL_001d: br IL_00a2

IL_0022: ldarg.0
IL_0023: ldc.i4.m1
IL_0024: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_0029: ldstr "Entering MyIterator"
IL_002e: call void [mscorlib]System.Console::WriteLine(string)
IL_0033: ldarg.0
IL_0034: ldstr "Foo"
IL_0039: stfld string SomeContainer/'<MyIterator>d__0'::'<>2__current'
IL_003e: ldarg.0
IL_003f: ldc.i4.1
IL_0040: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_0045: ldc.i4.1
IL_0046: ret

IL_0047: ldarg.0
IL_0048: ldc.i4.m1
IL_0049: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_004e: ldstr "After Foo"
IL_0053: call void [mscorlib]System.Console::WriteLine(string)
IL_0058: ldarg.0
IL_0059: ldstr "Bar"
IL_005e: stfld string SomeContainer/'<MyIterator>d__0'::'<>2__current'
IL_0063: ldarg.0
IL_0064: ldc.i4.2
IL_0065: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_006a: ldc.i4.1
IL_006b: ret

IL_006c: ldarg.0
IL_006d: ldc.i4.m1
IL_006e: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_0073: ldstr "After Bar"
IL_0078: call void [mscorlib]System.Console::WriteLine(string)
IL_007d: ldarg.0
IL_007e: ldstr "Baz"
IL_0083: stfld string SomeContainer/'<MyIterator>d__0'::'<>2__current'
IL_0088: ldarg.0
IL_0089: ldc.i4.3
IL_008a: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_008f: ldc.i4.1
IL_0090: ret

IL_0091: ldarg.0
IL_0092: ldc.i4.m1
IL_0093: stfld int32 SomeContainer/'<MyIterator>d__0'::'<>1__state'
IL_0098: ldstr "After Baz"
IL_009d: call void [mscorlib]System.Console::WriteLine(string)
IL_00a2: ldc.i4.0
IL_00a3: ret
} // end of method '<MyIterator>d__0'::MoveNext

The listing breaks up into five distinct blocks. Some of the key highlights include:

· IL_0000 - IL_001D: Load the value of the <>1__state field, and branch to one of five different blocks in the code. This is consistent with what a state machine will do.

· IL_0022 - IL_0024: Store the value "-1" into <>1__state.

· IL_0029 - IL_002E: Print "Entering MyIterator"

· IL_0033 - IL_0039: Store the string value "Foo" into the <>2__current field.

· IL_003E - IL_0040: Store the value "1" into <>1__state.

· IL_0045 - IL_0046: Make the method exit with a return value of 1.

Looking at another block (starting at IL_0047) we see a similar sequence of instructions. The most key difference is at IL_0063 - IL_0065, where the <>1__state is set to '2", rather than "1". In yet another block it's set to "3". In short, the <>1__state field is a numeric value that keeps track of which state the MoveNext() method is in. Each time MoveNext() is called, control begins at the top of the code, and branches to the appropriate handler block for that state. At the end of that block, <>1__state is updated to the next state.

What I've shown here is a rather simple example of C# iterators and how they're supported by rather sophisticated run time support which takes your code and creates a state machine around it.

If you think you've got iterators down, try wrapping your head around this slightly more complicated example:

//
// Iterator3.cs - Requires C# 2.0 or later
//

using System;
using System.Collections.Generic;

class SomeContainer
{
public IEnumerable < string > MyIterator()
{
Console.WriteLine( "Entering MyIterator" );

for ( int i = 0; i < 10; i ++ )
{
yield return i.ToString();
}
}
}

class foo
{
static void  Main()
{
SomeContainer container = new SomeContainer();

foreach( string s in container.MyIterator() )
Console.WriteLine( s );
}
}

Compile it with "csc iterator3.cs" and run the executable. Pretty nifty, eh?

Next time, I'll show a very cool way of using C#'s built in state machine support to do idle time processing. For instance, the Visual Studio IDE has a variety of code that needs to execute while the IDE is idle. Using C# iterators, it's fairly easy to accomplish this.