Rethinking findstr with F# and Powershell

As a software engineer, I am frequently searching my projects’ source trees for various code snippets, or searching a collection of log files for a particular message, or some other type of text searching activity.  The traditional Windows utility for such things is findstr, but alas, it leaves much to be desired.  Among other things, the regex support is kind of funky and it’s a pain to parse and process the output in batch script.

Windows Powershell makes things easier with the cmdlet Select-String, but this approach is usually not as speedy as trusty findstr.

The natural solution?  Implement our own utility in F#!  In this blog we will develop a Windows Powershell cmdlet in F# which offers both the features and the performance that we need, and we will do it in fewer than 200 lines of code.

These are the parameters we will be able to specify:

  • The pattern to search for
  • A file name or extension filter to control which files are considered in the search
  • Whether  to search recursively from the current directory
  • What file encoding to use
  • Whether to do a case-sensitive or case-insensitive search
  • Whether to do a verbatim plain-text search, rather than a regex search

And these are the pieces of data we will be able to process from the output

  • The file path
  • The line number
  • The full line from the file
  • The substring which matched
  • Regex match groups

Some basic usage samples we will be enabling:

 # find all public class declarations in .cs files in the current directory
PS C:\src> Search-File 'public class' *.cs

 # find all failure messages from .fs and .fsx files recursively
PS C:\src> Search-File 'failwith "(?<message>[^"]*)"' *.fs,*.fsx -Recurse

If you aren’t familiar with Powershell, it’s the next-generation (current-generation, really) command-line shell and scripting framework for automating Windows tasks and IT workflows.  Powershell has shipped in-the-box as a default Windows component since Windows 7, and has since attracted a large following of scripters, administrators, and developers.  Powershell v3 is built into Windows 8 and Windows Server 2012.

The primary type of utility in Powershell is not an executable, but a cmdlet.  Powershell is built entirely on .NET, and every scriptable entity is a full-fledged .NET object.  Users have full access to types, properties, methods, events, etc, all from script.  Cmdlets, too, are implemented as .NET classes, inheriting from System.Management.Automation.Cmdlet (or PSCmdlet).

Following standard Powershell naming conventions, we will call our cmdlet “Search-File.”  Cmdlet parameters are implemented as public properties on the cmdlet class and tagged with the [<Parameter>] attribute.  Our cmdlet class thus looks like this (taking advantage of the new F# 3.0 syntax for auto-properties):

 namespace FSUtils

open System.Management.Automation
open System.Text
open System.Text.RegularExpressions

/// Grep-like cmdlet Search-File
[<Cmdlet("Search", "File")>]
type SearchFileCmdlet() =
    inherit PSCmdlet()

    /// Regex pattern used in the search.
    [<Parameter(Mandatory = true, Position = 0)>]
    [<ValidateNotNullOrEmpty>]
    member val Pattern : string = null with get, set
    
    /// Array of filename wildcards.
    [<Parameter(Position = 1)>]
    [<ValidateNotNull>]
    member val Include = [|"*"|] with get,set
    
    /// Whether or not to recurse from the current directory.
    [<Parameter>]
    member val Recurse : SwitchParameter = SwitchParameter(false) with get, set

    /// Endcoding to use when reading the files.
    [<Parameter>]
    member val Encoding = Encoding.ASCII with get, set
    
    /// Toggle for case-sensitive search.
    [<Parameter>]
    member val CaseSensitive : SwitchParameter = SwitchParameter(false) with get, set

    /// Do not use regex, just do a verbatim string search.
    [<Parameter>]
    member val SimpleMatch : SwitchParameter = SwitchParameter(false) with get, set
    
    /// Called once per object coming from the pipeline.
    override this.ProcessRecord() =
        // check if simple match is possible, even if not specified explicitly
        let simple = this.SimpleMatch.IsPresent ||
                     (fun s -> s = Regex.Escape(s)) (Regex.Replace(this.Pattern, "\s", ""))
                     
        let searcher = FileSearcher(this.Pattern,
                                    this.SessionState.Path.CurrentFileSystemLocation.Path,
                                    this.Include,
                                    this.Recurse.IsPresent,
                                    this.CaseSensitive.IsPresent,
                                    simple,
                                    this.Encoding)
        searcher.Search()
        |> Seq.iter(fun item -> this.WriteObject(item))

Although this tool will be invoked from the command line, we don’t need to do any argument parsing ourselves.  All argument processing will be handled by the Powershell runtime.  When ProcessRecord is called, we can assume all applicable [<Parameter>] properties have been set according to user input.

The cmdlet class is just a simple Powershell interop wrapper around the real workhorse, the FileSearcher class, which we can write in somewhat more idiomatic F# code.  For the objects which are actually returned by the cmdlet, we define the LineMatch class, which exposes all of the properties we are interested in.

To meet our performance goals, we parallelize the entire workflow with F# async and Tasks.  The enumeration of files will be done in an async block, and each discovered file will then be processed in its own dedicated Task.  As matching lines of text are found, LineMatch objects are constructed and dumped into a BlockingCollection, which handles synchronization for us.  The elements of the BlockingCollection are meanwhile streamed back to the user on the calling thread, so results appear as fast as they are found.

Here’s the code:

 namespace FSUtils

 open System
open System.Collections.Concurrent
open System.IO
open System.Management.Automation
open System.Text
open System.Text.RegularExpressions
open System.Threading.Tasks
 // holds info about lines of files matching the provided pattern
type LineMatch(filePath : string,
               currentPath : string,
               lineNumber : int,
               line : string,
               subStr : string,
               mo : Match option) = 

    member val Path = filePath
    member val RelativePath = filePath.Remove(0, currentPath.Length + 1)
    member val LineNumber = lineNumber
    member val Line = line
    member val Match = subStr

    // .[] access to match groups
    member this.Item 
        with get(i : int) =
            match mo with
            | Some(m) when m.Groups.[i].Success -> m.Groups.[i].Value
            | _ -> null

    member this.Item
        with get(i : string) = 
            match mo with
            | Some(m) when m.Groups.[i].Success -> m.Groups.[i].Value
            | _ -> null

 /// Class for doing regex searches of files on disk.
type FileSearcher(pattern : string, 
                  startingDirectory : string,
                  includePatterns : string[],
                  recurse : bool,
                  caseSensitive : bool,
                  simpleMatch : bool,
                  encoding : Encoding) =

    /// Function used to match individual lines of text
    let doMatch =
        match (simpleMatch, caseSensitive) with
        // full regex match, the slowest option
        | (false, _) ->
            let r = Regex(pattern, RegexOptions.Compiled |||
                     if caseSensitive then RegexOptions.None else RegexOptions.IgnoreCase)
            fun line ->
                match r.Match(line) with
                | m when m.Success -> Some(m.Value, Some(m))
                | _ -> None
        // case-sensitive match with String.Contains, the fastest option
        | (true, true) ->
            fun line ->
                match line.Contains(pattern) with
                | true -> Some(pattern, None)
                | false -> None
        // case-insensitive match with String.IndexOf, the second fastest opton
        | (true, false) ->
            fun line ->
                match line.IndexOf(pattern, StringComparison.OrdinalIgnoreCase) with
                | i when i >= 0 -> Some(line.Substring(i, pattern.Length), None)
                | _ -> None

    /// Returns all files in the specified directory matching
    ///    one or more of the wildcards in "includePatterns."
    let GetIncludedFiles dir = 
        includePatterns |> Seq.collect (fun p -> Directory.EnumerateFiles(dir, p))
    
    /// Pattern which matches a directory path with either
    ///   Contents(<files in the directory>, <directories in the directory>)
    ///   AccessDenied if unable to obtain directory contents
    let (|Contents|AccessDenied|) dir = 
            try
                Contents (GetIncludedFiles dir, 
                    if recurse then Directory.EnumerateDirectories(dir)
                    else Seq.empty
                )
            with | :? UnauthorizedAccessException -> AccessDenied
   
    /// Enumerates all accessible files in or under the specified directory.
    let rec GetFiles dir = 
        seq {
            match dir with
                | Contents(files, directories) ->
                    yield! files
                    yield! directories
                            |> Seq.collect GetFiles
                | AccessDenied -> ()            
        }

    /// Scans the specified file for lines matching the specified pattern and
    ///    inserts them into the blocking collection.
    let CollectLineMatches (collection : BlockingCollection<LineMatch>) file  =
        try
            File.ReadAllLines(file, encoding)
            |> Array.iteri (fun i line ->           
                match doMatch line with
                    | Some(str, mtch) ->
                        collection.Add(
                         LineMatch(file, startingDirectory, i + 1, line, str, mtch))
                    | None -> ()
              )
        with | :? IOException -> () // if we have issues accessing the file, just ignore
    
    /// Initiates the search for matching file content,
    ///   returning an enumerable of matching lines.
    /// Note that the search is executed in parallel and
    ///   thus the order of results is not guaranteed.
    member this.Search () =
        let bc = new BlockingCollection<LineMatch>()

        async {
            let tasks = 
                GetFiles startingDirectory
                |> Seq.map (fun file -> 
                    Task.Factory.StartNew(fun () -> CollectLineMatches bc file) )
                |> Seq.toArray
            
            if tasks.Length = 0 then
                bc.CompleteAdding()
            else
                Task.Factory.ContinueWhenAll(tasks, fun _ -> bc.CompleteAdding())
                |> ignore
        } |> Async.Start   
        
        bc.GetConsumingEnumerable()

The F# code is concise and very readable.  Language features such as active patterns, sequence expressions, and matching keep the code fairly tidy compared to equivalent C#.  And of course, we can test the different bits in F# Interactive as we code.

Developers already familiar with F# might be curious why Tasks were used for processing files, rather than F# async expressions.  We needed to fire off file processing jobs immediately as each file path was enumerated, rather than waiting until all files were enumerated, since this could potentially take a long time.  The Task Parallel Library provides great APIs for exactly this kind of parallel processing.  The F# async API, on the other hand, provides an experience geared more toward enabling asynchronous processing, especially the usage of Begin/End .NET APIs.  The difference between parallel and asynchronous is subtle, but significant.  Suffice it to say, Tasks were simply the better tool for the job in this case.  Some good discussion on exactly this topic can be found here.

Compiling these classes into FSUtils.dll, we can now consume them from Powershell command line or script by calling the cmdlet Import-Module.

 PS C:\src> Import-Module .\FSUtils.dll

The output from our “failwith” sample search is a collection of LineMatch objects which look something like this:

 Path         : C:\src\fs3sample\SampleProviders\Shared\ProvidedTypes-0.2.fs
RelativePath : SampleProviders\Shared\ProvidedTypes-0.2.fs
LineNumber   : 924
Line         :                       | _ –> failwith "at most one constructor allowed"
Match        : failwith "at most one constructor allowed"

The output can be condensed to 1 line per match by either piping to Format-Table, or by defining a format file which specifies exactly how to display objects of the LineMatch type.

Comparing performance against standard Powershell cmdlets, traditional findstr, and GNU grep, we do very well. Search-File was a bit faster than findstr and grep when searching the F# team source tree for both plaintext and regex patterns. All three handily beat Select-String.

small

 

Where we really see Search-File pull ahead is when chewing through a larger set of files, in this case the C# source tree for another project. The benefits of parallelization become more pronounced in this case.

large

 

There we have it!  In under 200 lines of F#, a highly usable and very speedy file search utility.  I look forward to combining the strengths of F# and Powershell again soon.

Download the full source code here.

Lincoln Atkinson

Visual Studio F# Test Team

PS: The code provided here will compile against .NET 4+.  Powershell v3 supports .NET 4 by default, but in order to import this to Powershell v2 (e.g. on a Windows 7 machine), you will first need to take some manual steps.