WideFinder–Naive F# Implementation


Jomo Fisher–Here’s an interesting problem that some people are having fun with. Don Box posted a naive implementation in C# so I thought I’d post the equivalent in F#: 



#light


open System.Text.RegularExpressions


open System.IO


open System.Text


 


let regex = new Regex(@”GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)”, RegexOptions.Compiled)


 


let seqRead fileName =


    seq { use reader = new StreamReader(File.OpenRead(fileName))


          while not reader.EndOfStream do


              yield reader.ReadLine() }


             


let query fileName =


    seqRead fileName


    |> Seq.map (fun line -> regex.Match(line))


    |> Seq.filter (fun regMatch -> regMatch.Success)


    |> Seq.map (fun regMatch -> regMatch.Value)


    |> Seq.countBy (fun url -> url)


And here’s the code to call it:   



for result in query @”file.txt” do


    let url, count = result


One nice thing is that F#’s interactive window has a #time;; option which shows you wall-clock time and CPU time. Here is the result from running the code above on a 256meg file I concatenated together (I couldn’t find the one Don was using):



Real: 00:00:06.899, CPU: 00:00:04.165, GC gen0: 416, gen1: 1, gen2: 0


It looks like the majority of the time is in CPU so there should be ample opportunity to parallelize. One thing to note: I think the interactive window is unoptimized–when I just compile and run the code, I get times in the sub 5-seconds range. My machine is a 4-way 2.4 GHz Core Duo.



This posting is provided “AS IS” with no warranties, and confers no rights.

Comments (4)

  1. ASPInsiders says:

    In my new ongoing quest to read source code to be a better developer , I now present the ninth in an

  2. JHugard says:

    Don’t forget to add "take the top 10" and "print to stdout":

    #light

    open System.Text.RegularExpressions

    open System.IO

    open System.Text

    let regex = new Regex(@"GET /ongoing/When/dddx/(dddd/dd/dd/[^ .]+)", RegexOptions.Compiled)

    let seqRead fileName =

       seq { use reader = new StreamReader(File.OpenRead(fileName))

             while not reader.EndOfStream do

                 yield reader.ReadLine() }

    let query fileName =

       seqRead fileName

       |> Seq.map (fun line -> regex.Match(line))

       |> Seq.filter (fun regMatch -> regMatch.Success)

       |> Seq.map (fun regMatch -> regMatch.Value)

       |> Seq.countBy (fun url -> url)

       |> Seq.take 10

    for result in query @"file.txt" do

       let url, count = result

       printfn "%A – %A" url count

  3. Tony Nassar says:

    I enjoy your blog, and it’s helped inspire me to learn F#. Since it’s hard to introduce it into production code (my colleagues, and the build machine, would have to have F# installed), I’m using it for one-off scripts. Wow, it’s strange to be using a REPL again! Anyway, I have to munge through text files, and would recommend Seq.generate_using for that purpose:

    let lines = Seq.generate_using (fun () -> File.OpenText(@”solveBatch.Cplex.txt”))

    (fun (stream : StreamReader) -> match stream.ReadLine() with | null -> None | line -> Some line);;

    That took me about 30 minutes to get right. It would have been faster to cut and paste, but in the end I learned something.