Exploring data with F# type providers


Guest post by Thomas Denny Microsoft Student Partner at the University of Oxford

thomasdenny

About Me

Hi I am currently studying Computer Science at the University of Oxford.

I am the president (2017-18) and was the secretary (2016-17) of the Oxford University Computer Society, and a member of the Oxford University History and Cross Country societies. I also lead a group of Microsoft Student Partners at the university to run talks and hackathons.

F#

F# is an incredibly flexible language, and amongst its many benefits is the ability to use type providers to access and manipulate data from external sources. A type provider allows you to create a .Net type at runtime without the need to declare the type in code - this facility is not dissimilar to LISP's macro features. In F# you might use a type provider in place of a code generation, e.g. for writing wrapper types for a database schema. In this article we use a web page to generate a type that we then use for extracting data from other similar pages, and then we look at how to extract data from a CSV file.

Getting started

So long as you have F# and NuGet installed you can follow this guide using any editor, but you can make your experience a little easier by also installing Visual Studio Code and the Ionide F# plugin. This plugin has several useful features, but the most useful are its IntelliSense and type annotations features, which are even available for types created by a type provider!

image

Visual Studio Code

Once you're setup you'll need to install the F# Data package from NuGet:

PM> Install-Package FSharp.Data -Version 2.3.3

Wikipedia tables

Parsing and consuming data from HTML is traditionally a heavy task requiring a large amount of code; often a task as simple as extracting the column names of a table will require dozens of lines of code.

We're going to take a look at a simple problem: each year the cast and crew members of a film will often win several different awards (e.g. Academy Award, Golden Globe), and we would like to find the names of the cast or crew members that won the most awards for that particular film.

To start off with, we'll take a look at the accolades received by Spotlight, 2016's Best Picture winner at the Oscars. The results are presented in a table like this:

image

Example table

To start off with, we need to use the HTML type provider to create a new type based on this page. Create a new file called awards.fsx (an F# script):

#r "FSharp.Data.2.3.3/lib/net40/FSharp.Data.dll"
open FSharp.Data

type AccoladeData = HtmlProvider<"https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)">

Next, we have to request the data for that specific page

let spotlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)")

spotlightData is an object of type AccoladeData, which has properties Html, Tables, and Lists - this is standard across all types created by the HTML type provider. However, the properties available off each of these properties varies based on the schema that the type was provided by. In our case, the Tables property has an Accolades property, which contains the table data from the page. If you use the Ionide plugin with Visual Studio Code, as described above, you can see this in the IntelliSense suggestions:

image

IntelliSense suggestions

Collecting the results together can be done in a few lines of F#. We need to do the following:

  • Filter out any results that were not wins
  • Group results by the winner
  • Count the number of wins for each winner
  • Sort the winners by number of wins

This can be done as a simple F# function that takes the accolade table as an argument:

let awardNumbers (data: AccoladeData) =
    data.Tables.Accolades.Rows
    |> Seq.filter (fun row -> row.Result = "Won")
    |> Seq.groupBy (fun row -> row.``Recipient(s) and nominee(s)``)
    |> Seq.map (fun (person, awards) -> (person, Seq.length awards))
    |> Seq.sortByDescending (fun (person, count) -> count)

Each table row is also of a type constructed by the type provider, and it will have properties for each column (e.g. the result, the recipient, etc). Finally, we can print the results:

for (person, count) in awardNumbers spotlightData do
    printfn "%s,%d" person count

Whilst this example is interesting for a single page, what about other pages with the same table of data? Simply by changing the URL that we load from we can also print the same results for another film:

let moonlightData = AccoladeData.Load("https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)")
for (person, count) in awardNumbers moonlightData do
    printfn "%s,%d" person count

Finally, we could then collect this data for several films at once in parallel and then print the results for each film:

let urls = [
    "https://en.wikipedia.org/wiki/List_of_accolades_received_by_Spotlight_(film)"
    "https://en.wikipedia.org/wiki/List_of_accolades_received_by_Moonlight_(2016_film)"
    "https://en.wikipedia.org/wiki/List_of_accolades_received_by_La_La_Land_(film)"
]

let allMovies =
    urls
    |> Seq.map AccoladeData.AsyncLoad
    |> Async.Parallel
    |> Async.RunSynchronously
    |> Seq.map awardNumbers

for movie in allMovies do
    for (p,c) in movie do
        printfn "%s,%d" p c

Extracting data from CSVs

The F# Data package also provides a type provider for CSV files. Much like the HTML provider, you can also access all the column names as properties. Here's a simple example that extracts data from the British Government's list of MOT testing stations:

let [<Literal>] MOTUrl =
  "https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/613984/active-mot-testing-stations.csv"
// No need to specifically declare a type from the type provider if we are
// loading from one source
let data = new CsvProvider<MOTUrl>()

let stationsPerArea =
  data.Rows
  // Once again, column headers are the properties
  |> Seq.groupBy (fun row -> row.``VTS Address Line 4``)
  |> Seq.map (fun (location, rows) -> (location, Seq.length rows))
  |> Seq.sortBy (fun (location, count) -> count)

for (area, count) in stationsPerArea do printfn "%s,%d" area count

Conclusion

This is just a small glimpse of what you can do with F# data providers - the F# Data package also includes data providers for JSON files, for example.

Extra reading


Comments (0)

Skip to main content