“Linkifying” a String

I had a need to create a string “Linkify” function for a web site that I am working on; a function that takes a string, parses it for URL’s, and replaces those URL’s with HTML anchors. The end result is when the string is passed to a web client it would allow the user to browse to the URL.

I started out thinking that it could be done with some simple string manipulation, but quickly realized that finding the end of a URL was not trivial. It then occurred to me that Regular Expressions would solve the problem from my experience with Perl. Regular expressions are exceedingly powerful, but with its’ power comes complexity.

The .NET framework has a namespace dedicated to Regular Expressions: System.Text.RegularExpressions. The RegEx class has a constructor that takes a pattern and options as parameters. The RegEx.Matches(string) method takes a string as an input and returns a MatchCollection, which contains a lot of buried information. For my purposes I was able to avoid having to drill down deep into the collections contained within the object, but it’s good to know that it has the ability to capture the individual groups that a pattern returns.

The trickiest part for me, was recalling how to construct a pattern for this type of query. RegEx pattern syntax is a language of its own, that takes me time to wrap my head around every time I need to work with Regular Expressions, which is not all too often.

So how did I overcome my cerebrally-challenged problem? In searching on Live.com… incidentally in Windows Vista, by default, hit the Windows key, type your search word(s), press the down arrow and press enter… I came across a free 3rd party tool Regular Expression Designer by Rad Software, (this tool is not endorsed by Microsoft) that helped me to quickly construct and try out the pattern that I needed. It also has a handy quick reference to show the meaning of the esoteric symbols used in RegEx patterns.

Construction of the pattern

Anything within parentheses in a RegEx pattern is called a “group.” In this application want 3 groups; the first: the protocol, the second: the domain, and the third: the path. For the protocol group we want to find an occurrence of http or and ftp. The | operator indicates an Or, followed by the literal ://. The domain group is one or more matches + that do not contain a forward slash, a carriage return, a close paren, or a quote mark [^/\r\n\")], note that some are escaped by preceding backslashes, and the path group needs to have zero or more matches * that begin with a forward slash , but not a carriage return, close paren or a quote mark [^\r\n)\"], and lastly we want to find *all* matches in the string with a ?

Putting it all together we arrive at this: "(http|ftp)://([^/\r\n\")]+)(/[^\r\n\")]*)?"

We could have given the groups friendly names as in: "(?<protocol>http|ftp)://(<?<domain>[^/\r\n\")]+)(?<path>/[^\r\n\")]*)?", if we had wanted to access the individual values by name, for some other purpose. Without assigning names, by default, the groups are numbered in order of occurrence within the pattern starting with 0.

Following is a method call that wraps it all up:

using System.Text.RegularExpressions;

/// <summary>

/// This method takes a string and looks for URL’s within it.

/// If a URL is found it makes a HTML Anchor out of it and embeds it

/// in the output string. This method assumes that we are not passing

/// in HTML. This method would have to be revised to support that.

/// </summary>

/// <param name="input"></param>

/// <returns>Clickified String</returns>

public static string LinkifyString(string input)

{

     Regex regex = new Regex("(http|ftp)://([^/\r\n)\"]+)(/[^\r\n)\"]*)?", RegexOptions.IgnoreCase);

     MatchCollection mc = regex.Matches(input);

     foreach (Match m in mc)

     {

          string url = m.Value;

          string link = Globals.CreateAnchorTargetBlank(url, url);

          input = input.Replace(url, link);

     }

     return input;

}

public static string CreateAnchorTargetBlank(string href, string description)

{

     return string.Format("<a href=\"{0}\" target=\"_blank\">{1}</a>", href, description);

}

MSDN has some RegEx examples that may be worth taking a look at.

Be advised that this LinkifyString function will corrupt HTML if you pass an HTML’ified string to it, since we are not checking for anchor elements surrounding the URL’s. To do this one would have to add additional groups that ignore matches inside anchors. i.e. <a href=” https://www.foo.com”>https://www.foo.com</a>.

-Hans Hugli