Regular Expressions via Active Patterns

Edit 2/8/2010: Updating code samples for recent language changes.  

 

In my last post I introduced Active Patterns. Sure they seem neat, but what do you actually do with them?. This post coverage how to leverage Active Patterns to make your code simpler and easier to understand. In particular, how you can use Regular Expressions in Active Patterns.

 

Our first pair of examples will show how to simplify the code required to extract groups from regular expressions. If you have ever needed to do a lot of string parsing, you probably know how tedious Regular Expressions can be. But with Active Patterns we make the matching not only straight forward but fit well into the ‘match and extract’ workflow.

 

We will define two Active Patterns, first a Parameterized Active Pattern that will return the match group given a regex and input string. The second Active Pattern will take that match group and return three sub group values. For example, parsing a simple DateTime string or the version number from the F# Interactive window.

 

Note how the second Active Pattern, Match3, calls into the first using ‘match (|Match|_|) pat inp with’.

open System

open System.Text.RegularExpressions

  

(* Returns the first match group if applicable. *)

let (|Match|_|) (pat:string) (inp:string) =

    let m = Regex.Match(inp, pat) in

    // Note the List.tl, since the first group is always the entirety of the matched string.

    if m.Success

    then Some (List.tail [ for g in m.Groups -> g.Value ])

    else None

(* Expects the match to find exactly three groups, and returns them. *)

let (|Match3|_|) (pat:string) (inp:string) =

    match (|Match|_|) pat inp with

    | Some (fst :: snd :: trd :: []) -> Some (fst, snd, trd)

    | Some [] -> failwith "Match3 succeeded, but no groups found. Use '(.*)' to capture groups"

    | Some _ -> failwith "Match3 succeeded, but did not find exactly three matches."

    | None -> None

// ----------------------------------------------------------------------------

// DateTime.Now.ToString() = "2/22/2008 3:48:13 PM"

let month, day, year =

    match DateTime.Now.ToString() with

    | Match3 "(.*)/(.*)/(.*).*" (a,b,c) -> (a,b,c)

    | _ -> failwith "Match Not Found."

let fsiStartupText = @"

MSR F# Interactive, (c) Microsoft Corporation, All Rights Reserved

F# Version 2.0.0.0, compiling for .NET Framework Version v2.0.50727";

  

let major, minor, dot =

    match fsiStartupText with

    | Match3 ".*F# Version (\d+)\.(\d+)\.(\d+)\.*" (a,b,c) -> (a,b,c)

    | _ -> failwith "Match not found."

 

A More Complex Example

The code seems simple enough, but notice how the Active Pattern removes any need to deal with RegularExpression namespace entirely. You simply match the input string with the regex string and get a tuple of values back.

 

We can take this concept one step further and show a more complex example. Consider the task of extracting all the URLs from a webpage. To do this we will use two Active Patterns. One to convert an HTML blob into a list of URLs (RegEx GroupCollections) and a second to normalize relative URL paths (“/foo.aspx” => “https://.../foo.aspx”) .

 

(* Ensures that the input string contains given the prefix. *)

let (|EnsurePrefix|) (prefix:string) (str:string) =

    if not (str.StartsWith(prefix))

  then prefix + str

    else str

(* Returns all match groups if applicable. *)

let (|Matches|_|) (pat:string) (inp:string) =

    let m = Regex.Matches(inp, pat) in

    // Note the List.tl, since the first group is always the entirety of the matched string.

    if m.Count > 0

    then Some (List.tail [ for g in m -> g.Value ])

    else None

(* Breaks up the first group of the given regular expression. *)

let (|Match1|_|) (pat:string) (inp:string) =

    match (|Match|_|) pat inp with

  | Some (fst :: []) -> Some (fst)

    | Some [] -> failwith "Match3 succeeded, but no groups found. Use '(.*)' to capture groups."

    | Some _ -> failwith "Match3 succeeded, but did not find exactly one match."

    | None -> None

// ----------------------------------------------------------------------------

open System.Net

open System.IO

// Returns the HTML from the designated URL

let http (url : string) =

    let req = WebRequest.Create(url)

    // 'use' is equivalent to ‘using’ in C# for an IDisposable

    use resp = req.GetResponse()

    let stream = resp.GetResponseStream()

    let reader = new StreamReader(stream)

    let html = reader.ReadToEnd()

    html

      

// Get all URLs from an HTML blob

let getOutgoingUrls html =

    // Regex for URLs

    let linkPat = "href=\"([^\"]*)\""

    match html with

    // The matches are the strings which our Regex matched. We need

    // to trim out the 'href=' part, since that is part of the rx matches collection.

    | Matches linkPat urls

        -> urls |> List.map (fun url -> match (|Match1|_|) "href=(.*)" url with

                                        | Some(justUrl) -> justUrl

                                        | _ -> failwith "Unexpected URL format.")

    | _ -> []

// Maps relative URLs to their fully-qualified path

let normalizeRelativeUrls root urls =

    urls |> List.map (fun url -> (|EnsurePrefix|) root url)

// ----------------------------------------------------------------------------

let blogUrl = "https://blogs.msdn.com/chrsmith"

let blogHtml = http blogUrl

printfn "Printing links from %s..." blogUrl

blogHtml

    |> getOutgoingUrls

    |> normalizeRelativeUrls blogUrl

    |> List.iter (fun url -> printfn "%s" url)

As you can see, Active Patterns are a powerful addition to F# and definitely something to keep in mind when you find yourself writing a lot of repetitive code.