C# 2.0: Fast string comparison with string-interning


<Also read the later posting on the same topic>


Frequently in code we do string comparisons which is culture-agnostic. While we should follow the string comparison guidelines, there is a much faster way of getting it done using string interning.


As a sample lets take method that accepts a string and does some action based on the string.

static void ExecuteCommand(string command)

{

if (command == “START”)

Console.WriteLine(“Starting Build…”);

else if (command == “STOP”)

Console.WriteLine(“Stopping Build…”);

else if (command == “DELETE”)

Console.WriteLine(“Deleting Build…”);

else

Console.WriteLine(“Invalid command…”);

}


The problem with this code is that this actually does a full string comparison using string.Equals(command, “START”, StringComparison.Ordinal); which results in iterating through each of the bytes of the two strings and comparing them. In case the strings are long and there are a lot of strings to compare, this becomes a slow process. However if we are sure that the string command is an interned string we can make this much faster. Using the following code

static void ExecuteCommand(string command)

{

if (Object.ReferenceEquals(command,“START”))

Console.WriteLine(“Starting Build…”);

else if (Object.ReferenceEquals(command, “STOP”))

Console.WriteLine(“Stopping Build…”);

else if (Object.ReferenceEquals(command, “DELETE”))

Console.WriteLine(“Deleting Build…”);

else

Console.WriteLine(“Invalid command…”);

}


This uses just the reference comparison (memory address comparison) and is much faster. However, the catch is that the command has to be an interned string. In case the command is not a literal string and is either generated or accepted from the user as a command line argument then this will not be an interned string and the comparisons will always fail. However we can intern it using the following

string command = string.Intern(args[0].ToUpperInvariant());

ExecuteCommand(command);


Here we have taken care of both the case handling as well as ensured that the command is interned. Another generic way of doing this could be

string internedCmd = string.IsInterned(command) ?? string.Intern(command);


Here IsInterned returns the interned string if the string is interned of else null. Intern interns the string and returns a reference to it. Interestingly this expression also uses the new ?? C# 2.0 operator.

Comments (9)

  1. I forgot to add that as always code for clarity and easy maintainability. Profile to see perf-bottlenecks and only then use optmizations as these….

  2. Anonymous says:

    I wonder how fast the IsInterned check is. If it’s fast it would have been really nice for Microsoft to add code like this into the BCL:

    public static operator ==(string a, string b) {

    string aint = a.IsInterned;

    string bint = b.IsInterned;

    if (aint != null && bint != null) return object.ReferenceEquals(aint, bint);

    // … rest of the existing implementation

    }

  3. Anonymous says:

    Isn’t the whole performance idea lost the moment you call string.intern and ToUpper() ?

  4. This works mainly for interned string.

    Call to string.intern does have a performance hit. This’ll work only if you need to intern the string say once and use comparisons multiple times with long strings. As with most optimization you need to profile and see if this is making a good difference to your perf and then use it.

  5. Anonymous says:

    I knew that intern has a perf hit; I was wondering whether IsInterned has the same perf hit or whether that could be done cheaply. If IsInterned is cheap, it would be possible to use it inside the standard equality operator for System.String so that interned strings compare fast with "==" while non-interned strings fall back to the slower path.

    If IsInterned is expensive of course that trick doesn’t work…

  6. Anonymous says:

    I was getting excited till I reached this line :

    However, the catch is that the command has to be an interned string.

  7. IsInterned would be too expensive to call from within String.Equals, so that not going to work.

    On loading your assembly all the literal strings are interned and which could be 100’s of string. This is load time perf-hit which is OK. However, the explicit Intern call is run-time. However, if its done once say in Main for the command string and used for comparison 100’s of times then the gain is significant….

  8. Manip says:

    I don’t like this coding style… The reason being that if a maintainer changed a line of code somewhere else in the project (const/compile.string to std.string) then it could break at this point, and every other point when these "performance improvements" have been made.

    Worse still it would be VERY difficult to track down unless you knew what to look for… :-/

  9. Anonymous says:

    There is a cost when you call IsInterned as well.  The implementation of the interning table is a HashTable so every insert and lookup requires the calculation of a hash value.  On the other hand a string comparator will return false as soon as two bytes disagree.  

    Use normal string comparison unless you are in the position of being to hash things once before doing a very large number of comparisons on them.