Introduction to Data Science using F# and Azure Notebooks

Guest post by Nathan Lucaussy, Microsoft Student Partner at Oxford University.

image_thumb

Introduction to Data Science using F# and Azure Notebooks - Part 1: Functional Programming Basics via Plotting and Genetic Algorithms

Hello! I hope you'll enjoy the following blog post - it details the particular kind of algorithm that got me so excited about studying Computer Science in the first place! I'm Nathan, a second-year student reading Computer Science and Philosophy at the University of Oxford. I'm mainly interested in Machine Learning, Algorithms and Concrete AI Safety (ensuring AI algorithms do as they're told) - but in my spare time I enjoy playing the guitar and travelling. Reach me on LinkedIn: https://www.linkedin.com/in/nathan-lucaussy-60a16816a/ In this blog post we'll be getting used to F#'s functional style before deploying it for some data analysis in Azure Notebooks: our main task will be modelling the temperature of London over 3 years. We'll start off with plotting a time series and cleaning the dataset. This will introduce us to

  • the functional concept of Higher-Order Functions
  • F#'s type providers
  • the XPlot charting package
  • F#'s package dependency manager.

In the second half of this post, we will look at devising a Genetic Algorithm for fitting a sinusoidal curve to temperature data. By devising this regression algorithm, we'll be doing some Data Science, but we'll also be looking at important F# features:

  • functional recursive functions
  • in-detail pattern matching
  • wholemeal programming

We will do all of this using Azure Notebooks (in fact, in this very notebook!). Azure Notebooks is a free, online platform in the Microsoft Cloud providing an interactive development environment in F# but also Python and R. Its interactivity is particularly useful for data analysis, allowing instant visualisation of results.

A/ Cleaning the data and plotting the temperatures in XPlot.Plotly

The most widely used library for data science in F# is FsLab. It provides a number of packages, of which we will use:

  • FSharp.Data - gives access to data expressed in structured file formats, in this case a .csv file
  • XPlot.Plotly - builds interactive charts for data visualisation

Azure Notebooks natively supports Paket (the dependency manager for .NET's - and by extension F#'s - package repository NuGet). Follow the steps below to load the required packages directly from the NuGet repository:

1) #load "Paket.fsx" enables Paket within the Azure Notebooks environment.

2) Paket.Dependencies.Install """ ... """ This adds dependencies from the Nuget repository

3) Paket.Package ["FsLab"] generates dependencies for the downloaded packages

4) #load "Paket.Generated.Refs.fsx" to perform the actual referencing

Finally, we use the "open" keyword to open a namespace to the Notebook environment - much like Python's import.

 #load "Paket.fsx"  
 Paket.Dependencies.Install """  
 frameworks: net45 
 source https: //nuget.org/api/v2  
 nuget FSharp.Data 
 nuget XPlot.Plotly 
 """  
 Paket.Package["XPlot.Plotly"  
       "FSharp.Data"]
 #load "XPlot.Plotly.Paket.fsx"
 #load "XPlot.Plotly.fsx"
 #load "Paket.Generated.Refs.fsx"  
 open System 
 open FSharp.Data 
 open XPlot.Plotly    

A.1 Preparing and cleaning the data

We now need to load in the weather data from the file Condition_Sunrise.csv. This is the data we will want to perform our analytics on. This is where F# really shines - F# offers type providers : an extremely efficient way to parse data and metadata from structured sources to an F#-intelligible schema. We make use of the CSV type provider.

The following type declaration:

  • creates a schema
  • infers types automatically from the CSV columns
  • retrieves column names from the first header row.

1. type Weather = CsvProvider < "/home/nbuser/library/Condition_Sunrise.csv" >  

F# specific: We are introduced to another of F#'s functional features: let bindings. In imperative languages, variables are bound to a memory address in which a value is placed - this may then be changed with another value. In F# let bindings bind an identifier with a value or function - they are immutable. Related to this idea is the fact that when used functionally F# treats instructions as expressions. That there is no real notion of state in a functional program makes it so much simpler to reason about program semantics.

We may now load the data from our CSV file - this is done by loading into an instance of the type given by the type provider:

 let my_data = Weather.Load("/home/nbuser/library/Condition_Sunrise.csv")  

The object created is a schema which may be converted to an iterable sequence by calling the function Rows

F# specific: Notice the |> infix operator (pronounced pipe forward). It passes its first argument as an argument to it's second argument, a another function. This operator makes a huge difference with regards to readability of code, especially when performing data transforms on arrays.

Azure Notebooks' interactivity means we can read the first row of our CSV data. Immediately, we observe that some of the data will be of no use to us - we only need the data in the first two columns: DateTime and Temp.

 let first_row = my_data.Rows | >; Seq.head 
 first_row  
 Out: ("December 12, 2012 at 07:07AM", 29, "Partly Cloudy", 46, 30)  

To select the required data we use an array comprehension, creating pairs of elements comprising of only the date and temperature.

 let data_array = [ |  
     for row in my_data.Rows - >; (row.DateTime, row.Temp) |  
 ]  

We can lighten the array even further. Because the first column gives time at sunrise and second column represents the temperature at sunrise, we remove the time from each string of time.

To do this, we split the partition the string before and after the keyword 'at', and take the initial portion of the split string, using the Array.head function.

 let removeTimeFromDateString(str: string) = str.Split([ | " at " | ], StringSplitOptions.None) | >; Array.head  

F# specific: We now have an array of tuples all of which have a string as a first element. But this string represents a date! How do we parse it as machine-understandable time? Here F#'s .NET Framework integration proves very useful: the DateTime package provides a function Parse that correctly parses our date format. ToOADate then converts DateTime objects into a numerical date format, to which we subtract the first date to make numbers more manageable i.e. starting from 0.

F# specific: In functional languages, Higher-Order Functions are prevalent. These are functions that either take in functions as arguments or return functions given arguments (or both). We have already met one: |> (pipe forward) . Note below the use of Array.map - it applies a function to every single element of an array.

_F# specific _ Lambdas, or anonymous functions, are often used in functional languages. They act like regular functions except they are unnamed - the syntax for defining lambdas in F# is: fun x -> 2*x, as an example. Below an anonymous function is used as argument to Array.map

 let pruned_array = Array.map(fun(x: string, y: int) - >; ((x | > removeTimeFromDateString | > System.DateTime.Parse).ToOADate() - 41255.0, y)) data_array  
 let date_values = pruned_array | >; Array.map fst 
 let temp_values = pruned_array | >; Array.map snd  

A.2 Plotting temperatures as a function of time using XPlot.Plotly

Since XPlot.Plotly's namespace is open to our environment, we can now create XPlot objects. We choose to create a Scatter object (corresponding to the data organisation of a scatter-plot) because we have more than 1000 data points and a histogram, for example, would hinder clarity.

Note that the x and y series are passed in as arrays.

 let trace1 = Scatter(x = (pruned_array | >; Array.map fst), y = (pruned_array | > Array.map snd), name = "Temperatures")  

Azure Notebooks allows for wonderful inline plotting:

 trace1 | > Chart.Plot  

image_thumb[3]

B/ Using the Genetic Algorithm to fit a sine curve line to a periodic phenomenon: temperature

With these basics in place, we can start the regression process. Ultimately, we aim to provide a line of best fit giving temp_values as a function of date_values

B.1 The Genetic Algorithm

The genetic algorithm is used to solve optimisation problems by mimicking the process of evolution. Given a candidate solution - an Individual's particular characteristics (usually called Traits), we may generate a Population of Individuals each differing slightly from the source Individual in its Traits through randomnness. Having devised a way of ranking individuals (Fitness) in a Population, we perform a crossover of the best Individuals with the rest of the Population, adding in some randomness to escape local efficiency maxima. Each generation improves on the previous one, hence approximating an optimal solution.

For a graphical illustration of the Genetic Algorithm process, the following video, where a genetic algorithm learns to walk, may be of interest: https://youtu.be/xcIBoPuNIiw

· In our case, the individuals are four-tuples of Double values, for which we create a type Individual: they correspond to the values (a,b,c,d) in the family of the functions of the form a×(sin(b×x+c))+d. Such tuples capture all possible sinusoidal functions.

· We devise additional types : a Population will be a list of Individuals, Parents a pair of Individuals

When devising this large piece of code, which will ultimately run as a single function, we will use a design heuristic called wholemeal programming: never once will we look at the individual data contained within the data arrays or the list of individuals. Instead, we will repeatedly apply functions to the whole of the population. This kind of programming is distinctly functional.

 type Individual = double * double * double * double 
 type Population = Individual list 
 type Parents = Individual * Individual  

Crucial to the genetic algorithm is a function that inserts randomness at various stages of the process - the following adds or removes up to 10% of a Double value:

 let addTenPercRandom(random_gen: Random)(x: double): double = x * ((double(random_gen.Next(-100, 100)) / 1000.) + 1.)  

We also need a higher-order function that applies a function _f_ to every element of our 4-tuple individuals. You might recognise this as a map, and indeed it is the natural map on the tuple structure.

 let tupleMap(f: Double - > Double)(w, x, y, z) = (f w, f x, f y, f z)  

B.2 Building the initial population

Because Genetic Algorithms are prone to getting stuck at local maxima, it is often useful to introduce a guess for the starting individual. We build our first population's generation around this individual.

  • The function makeIndividual creates a single individual with some randomness around a guess individual

F# Specific: When writing programs in a functional style, we aim to avoid using loops (indeed, in purely functional languages like Haskell, it is very hard, near impossible, to do so). The functional alternative is using recursion. The let rec keyword instructs the compiler that this is a recursive function. We pass a parameter count which is decreased at each iteration, until we reach a base case, 0.

F# Specific: To handle the base case and recursive cases differently, we use another distinctively functional feature: pattern matching. The match keyword compares the value of size with 0 or any other integer, defaulting to the empty list in the 0 case and building the list recursively when non-zero using the :: (cons) operator - it appends a first element to a list.

 let makeIndividual(random_gen: Random)(guess_tuple: Individual): Individual = 
 (tupleMap(addTenPercRandom random_gen) guess_tuple) 
 let rec makePopulation(random_gen: Random)(size: int)(guess_tuple: Individual): Population = 
         match size with 
         | 0 - >; List.empty 
         | n - >; 
 (makeIndividual random_gen guess_tuple):: makePopulation random_gen(size - 1) guess_tuple  

B.3 Evaluating the fitness of an individual

The second ingredient of the Genetic Algorithm is a function that evaluates the performance of an individual at the given task - called a fitness function.

In our case, it consists in approximating the temperature values - for this we create a function which calculates the image of the sine function for a given date and given individual. This is the purpose of findSinValue.

Fitness will be a measure of statistical squares: the sum of the squares of the differences between the simulated value and the actual temperature value, for each value in the temperature value array.

We define the type Result as a record of the Individual and the Fitness of that given individual - so that for clarity they are paired up.

The function simulate is defined in a very functional way:

  • findSinValue for a given individual is mapped to every value in the date array
  • the anonymous binary function (fun x y -> ((x - double y)**2.)) computes the squares
  • the higher-order function Array.map2 (analogous to Haskell's ZipWith) applies a binary function to values of two arrays in index order. It applies the anonymous function above to elements from the array of simulated sine values and the temperature values in turn.
  • Finally Array.sums sum the squares
 type Result = {  
        Fitness: Double;IndividualTested: Individual  
 }  
 let findSinValue(a, b, c, d)(x_val: Double): Double = a * (sin((b * x_val) + c)) + d  
 let simulate(individualTested: Individual): Result =  
     let chiSquared = (Array.map2(fun x y - >; ((x - double y) * * 2.))(date_values | > (Array.map(findSinValue individualTested))) temp_values) | > Array.sum {  
             Fitness = chiSquared;  
             IndividualTested = individualTested  
         }  

B.4 Evolving the next generation

Given a previously generated population, how do obtain a new, improved generation?

We devise a mechanism for crossing-over two individuals' traits. For balance, each parent gives half of the traits - there are thus 6 ways to arrange the traits. The merge function gives one of these ways depending on the number passed to it as argument.

The crossOver function selects a random way to merge parents' traits by passing one of six random numbers to the merge function.

We can now crossover individuals. For a whole population, we first extract the top-ranking half of the population:

  • sorting individual by fitness
  • taking the top half
  • extracting the individuals from the Result record

This is done by composing the three functions List.sortBy, List.take, List.map.

From the top half we extract the best two individuals:

  • they are immediately added to the next generation so as not to loose top performing individuals from each generation
  • the rest of the population is crossed-over with both the top individual and the second best, using the higher-order function map.
  • this newly-formed portion is then mutated using a mutation function, in our case 10% randomness.

This process yields a new generation that is at least as good as the previous one.

 1.    let rng = Random()  
  
 1.    let merge n(a, b, c, d)(a ',b', c ',d') = 
 2.        match n with 
 3.        | 0 - >; (a, b, c ',d') 
 4.        | 1 - >; (a ',b', c, d) 
 5.        | 2 - >; (a ',b,c', d) 
 6.        | 3 - >; (a, b ',c,d') 
 7.        | 4 - >; (a ',b,c,d') 
 8.        | 5 - >; (a, b ',c', d) 
 9.        | _ - >; raise(System.ArgumentException("There are only six cases!")
 10.        ) 
 11.    
 1.    let crossOver(parents: Parents): Individual =  
 2.        let randomCrossingOrder = rng.Next(6) 
 3.        merge randomCrossingOrder(parents | >; fst)(parents | > snd) 
 4.    
 5.    let generateNextPopulation(mutatePopulation: Population - >; Population)(crossOver: Parents - > Individual)(results_generation: Result list): Population =  
 6.            let best_individuals = results_generation 
 7.            |>; List.sortBy(fun result - > result.Fitness) 
 8.            //sort list in order of ascending mean squares  
 9.            |>; List.take((results_generation.Length + 2) / 2) 
 10.            //take the best half of the generation  
 11.            |>; List.map(fun result - > result.IndividualTested) 
 12.            //retrieve indivudals from elements of type Result record  
 13.            best_individuals 
 14.            |>; function | head::second::tail - > 
 15.                (head::second::
 16.        mutatePopulation
 17.            ([
 18.                tail | >; List.map(fun individual - > crossOver(head, individual));
 19.                tail | >; List.map(fun individual - > crossOver(second, individual))
 20.             ] | >; List.concat)) 
 21.                       | _ - >; 
 22.        raise(System.ArgumentException("Population not large enough to crossover!"))  
  

B.5 Repeating the simulation over 100 generations

Before running the evolution 100 times, we need to set starting parameters:

· As population size, we choose to have 1000 individuals so that there is sufficient variety of traits, including outliers which will avoid local maxima.

· As starting guess, we use basic mathematical properties to find start values by inspection: a is the amplitude of the periodical pattern, b is 2π÷period2π÷period(since this is a yearly phenomenon, we estimate the period to be approximately 365 days), c is the phase shift and d is the vertical shift.

· the repeatSimulation function is a natural candidate for recursion, taking the new generation's population each time. We obtain fitness for each individual using the previously-defined simulate function. We find the best individual via minimum squares-sum at each generation; for verbosity purposes this is printed each time. Finally, when we reach the last generation, the best individual is returned.

Notice how this last function uses only previously defined functions to manipulate data without ever looking at it. This is 'wholemeal programming'.

 1.    let starting_guess = (30., 0.015, (-20.), 50.) let starting_size = 1000  
 2.    let starting_population = makePopulation rng starting_size starting_guess  
 3.    let mutation(pop: Population): Population = pop | >; List.map(tupleMap(addTenPercRandom rng)) 
 4.    let rec repeatSimulation max round_number pop = 
 5.        match round_number with 
 6.        | count when count = max - >; pop | > List.head 
 7.        | count - >;  
 8.            let generation_results = pop | > List.map simulate  
 9.            let best = generation_results | >; List.minBy(fun result - > result.Fitness) 
 10.            printfn "Best fitness in this generation: %A"  
 11.            best.Fitness  
 12.        let new_generation = generation_results 
 13.            | >; generateNextPopulation mutation crossOver new_generation 
 14.            | >; repeatSimulation max(count + 1)  
  
 1.    repeatSimulation 100 0 starting_population  yields: 
 Best fitness in first generation: 155707.3092
 Best fitness in 100th generation: 92311.57632

Conclusion: Results

Result Individual: ( 20.61520275 , 0.01711181459 ,− 21.17986916 , 47.20517328 )

We obtain a close solution: the sum of squares is 92048.62, which translates to a Root-Mean-Square of approximately 7.5. Given how the data has high variance with respect to a least-mean-squares fit of a sinusoidal curve, this is a very good result.

To illustrate the fit, let's go back to the original graph. We can compute the values for our sinusoidal model and overlay it on the Plotly Chart:

image_thumb[6]

It is safe to say we have observed some of F#'s most tremendous capabilities, notably through it's interoperability with the .NET Framework, Type Providers (FSharp.Data - but they also give access to R and Python packages) and strong type system. These features make it particularly suited for data analysis. Note that the Genetic Algorithm is a fun way to learn about functional programming because it is simple to understand; however it is not very efficient. Watch this space for presentations of more efficient Machine Learning algorithms in F# using Azure Notebooks!

Try this in Azure Notebooks

Azure Notebook Version:

https://notebooks.azure.com/anon-ioqeiw/libraries/FSharpAzureDataScience/html/Intro_to_FSharp_AzureNotebooks.ipynb