Regular Expressions via Active Patterns


Edit 2/8/2010: Updating code samples for recent language changes. 


 


In my last post I introduced Active Patterns. Sure they seem neat, but what do you actually do with them?. This post coverage how to leverage Active Patterns to make your code simpler and easier to understand. In particular, how you can use Regular Expressions in Active Patterns.


 


Our first pair of examples will show how to simplify the code required to extract groups from regular expressions. If you have ever needed to do a lot of string parsing, you probably know how tedious Regular Expressions can be. But with Active Patterns we make the matching not only straight forward but fit well into the ‘match and extract’ workflow.


 


We will define two Active Patterns, first a Parameterized Active Pattern that will return the match group given a regex and input string. The second Active Pattern will take that match group and return three sub group values. For example, parsing a simple DateTime string or the version number from the F# Interactive window.


 


Note how the second Active Pattern, Match3, calls into the first using ‘match (|Match|_|) pat inp with’.


 



open System


open System.Text.RegularExpressions


  


(* Returns the first match group if applicable. *)


let (|Match|_|) (pat:string) (inp:string) =


    let m = Regex.Match(inp, pat) in


    // Note the List.tl, since the first group is always the entirety of the matched string.


    if m.Success


    then Some (List.tail [ for g in m.Groups -> g.Value ])


    else None


 


(* Expects the match to find exactly three groups, and returns them. *)


let (|Match3|_|) (pat:string) (inp:string) =


    match (|Match|_|) pat inp with


    | Some (fst :: snd :: trd :: []) -> Some (fst, snd, trd)


    | Some [] -> failwith “Match3 succeeded, but no groups found. Use ‘(.*)’ to capture groups”


    | Some _ -> failwith “Match3 succeeded, but did not find exactly three matches.”


    | None -> None  


 


// —————————————————————————-


 


// DateTime.Now.ToString() = “2/22/2008 3:48:13 PM”


let month, day, year =


    match DateTime.Now.ToString() with


    | Match3 “(.*)/(.*)/(.*).*” (a,b,c) -> (a,b,c)


    | _ -> failwith “Match Not Found.”


 


let fsiStartupText = @”


MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved


F# Version 2.0.0.0, compiling for .NET Framework Version v2.0.50727″;


  


let major, minor, dot =


    match fsiStartupText with


    | Match3 “.*F# Version (\d+)\.(\d+)\.(\d+)\.*” (a,b,c) -> (a,b,c)


    | _ -> failwith “Match not found.”


 


A More Complex Example


The code seems simple enough, but notice how the Active Pattern removes any need to deal with RegularExpression namespace entirely. You simply match the input string with the regex string and get a tuple of values back.


 


We can take this concept one step further and show a more complex example. Consider the task of extracting all the URLs from a webpage. To do this we will use two Active Patterns. One to convert an HTML blob into a list of URLs (RegEx GroupCollections) and a second to normalize  relative URL paths (“/foo.aspx” => “http://…/foo.aspx”) .


 


(* Ensures that the input string contains given the prefix. *)


let (|EnsurePrefix|) (prefix:string) (str:string) =


    if not (str.StartsWith(prefix))


    then prefix + str


    else str


 


(* Returns all match groups if applicable. *)


let (|Matches|_|) (pat:string) (inp:string) =


    let m = Regex.Matches(inp, pat) in


    // Note the List.tl, since the first group is always the entirety of the matched string.


    if m.Count > 0


    then Some (List.tail [ for g in m -> g.Value ])


    else None


 


(* Breaks up the first group of the given regular expression. *)


let (|Match1|_|) (pat:string) (inp:string) =


    match (|Match|_|) pat inp with


    | Some (fst :: []) -> Some (fst)


    | Some [] -> failwith “Match3 succeeded, but no groups found. Use ‘(.*)’ to capture groups.”


    | Some _  -> failwith “Match3 succeeded, but did not find exactly one match.”


    | None -> None 


 


// —————————————————————————-


 


open System.Net


open System.IO


 


// Returns the HTML from the designated URL


let http (url : string) =


    let req = WebRequest.Create(url)


    // ‘use’ is equivalent to ‘using’ in C# for an IDisposable


    use resp = req.GetResponse()   


    let stream = resp.GetResponseStream()


    let reader = new StreamReader(stream)


    let html = reader.ReadToEnd()


    html


      


// Get all URLs from an HTML blob


let getOutgoingUrls html =


    // Regex for URLs


    let linkPat = “href=\”([^\”]*)\””


    match html with


    // The matches are the strings which our Regex matched. We need


    // to trim out the ‘href=’ part, since that is part of the rx matches collection.


    | Matches linkPat urls


        -> urls |> List.map (fun url -> match (|Match1|_|) “href=(.*)” url with


                                        | Some(justUrl) -> justUrl


                                        | _ -> failwith “Unexpected URL format.”)


    | _ -> []


 


// Maps relative URLs to their fully-qualified path


let normalizeRelativeUrls root urls =


    urls |> List.map (fun url -> (|EnsurePrefix|) root url)


 


// —————————————————————————-


 


let blogUrl = “http://blogs.msdn.com/chrsmith”


let blogHtml = http blogUrl


 


printfn “Printing links from %s…” blogUrl


 


blogHtml


    |> getOutgoingUrls


    |> normalizeRelativeUrls blogUrl


    |> List.iter (fun url -> printfn “%s” url)


 


As you can see, Active Patterns are a powerful addition to F# and definitely something to keep in mind when you find yourself writing a lot of repetitive code.

Comments (4)

  1. In response to an earlier post a reader wrote: Sent From: namin Subject: (|EnsurePrefix|) — why is it

  2. Ben says:

    I’m trying to get a grip on active patterns, but in this example I’m having a hard time grasping how their use has a benefit over functions.  In particular, it seems to me that the (|Match1|) and the (|EnsurePrefix|) patterns could have just as easily been written as functions and it would change anything.  

    What have I missed?

  3. ChrSmith says:

    Great question. I’ve addressed the difference between single-case active patterns and function calls in:

    http://blogs.msdn.com/chrsmith/archive/2008/03/27/single-case-active-patterns-vs-function-calls.aspx

    If that doesn’t clear things up let me know.

    Cheers!