DAWG-fight, Optimizing text search in F# II

Some of you may have seen Dr. Brian McNamara’s post refuting my credentials as an evil genius. Lies! All lies propagated by the liberal media! Evidently, he thinks that his superior implementation of a trie, which is faster and uses much less memory, is better than mine. If being an evil genius means having smaller, faster solutions then sure, he has out-eviled me. But seriously, come on. Being an evil genius is more than code efficacy, true evil comes from the passion to crush the will of men!

Even if Dr. McNamara’s code is better, I assure you I’ll have the last laugh. Muahhaha

Striking the balance

F# is a hybrid language and to truly harness its power for nefarious ends you need to embrace both function and object-oriented programming. Let’s review my trie implementation:

 type TrieNode =
    | Node of IDictionary<char, TrieNode> * bool
    | Eow
    
    member this.IsLegalWord = 
        match this with
        | Node(_, true)
        | Eow -> true
        | _   -> false

 /// seq<char list> -> TrieNode 
let rec generateTrieNode words =

    let hasEowNode = Seq.exists (fun word -> word = []) words

    if Seq.isEmpty words then
        Eow    
    else
        let nextWordFragments =
            words   
            |> Seq.filter (fun word -> word <> [])
            |> Seq.groupBy (fun word -> Seq.head word)
            |> Seq.map (fun (firstLetter, words) -> (firstLetter, words |> Seq.map List.tail))
            |> Seq.map(fun (firstLetter, rest) -> (firstLetter, generateTrieNode rest))
            |> Seq.fold
                (fun acc (letter, trieNode) ->
                    (Char.ToUpper(letter), trieNode) :: (Char.ToLower(letter), trieNode) :: acc
                )
                []

        Node(dict nextWordFragments, hasEowNode)

 let searchStep (nodeInTrieTree : TrieNode) (tilesInChain : ITile list) =

    if nodeInTrieTree.IsLegalWord then
        foundWords.Add(getWordText tilesInChain)

    let currentTile = List.head tilesInChain

    match nodeInTrieTree with
    | Eow -> None
    | Node(possibleLetterSteps, _) 
        ->  let possibleSteps =
                currentTile.Neighbors
                |> Seq.filter 
                    (fun neighborTile -> 
                        not (List.exists 
                                (fun (prevTile : ITile) -> prevTile.Location = neighborTile.Location) 
                                tilesInChain
                            )
                    )
                |> Seq.filter (fun neighborTile -> possibleLetterSteps.ContainsKey(neighborTile.Letter))
                |> Seq.map 
                    (fun neighborTile ->
                        (   possibleLetterSteps.[neighborTile.Letter],
                            neighborTile :: tilesInChain   )
                    )
                    
            if not (Seq.isEmpty possibleSteps) then Some(possibleSteps)
            else                                    None

Not to toot my own horn here, but this is very functional and very elegant. With little use of mutation, data just flows from one point into the next. You don’t need an army of Singularity AI bots to tell you that the code looks sexy.

Now let’s look at Brian’s implementation and see how they differ. First, how Dr. McNamara represented his trie tree:

 [<AllowNullLiteral>]
type TrieNode() =
    let nextChar = Array.create 26 null
    let mutable isEndOfWord = false
    member this.IsEndOfWord = isEndOfWord
    member this.Item with get(c:char) = nextChar.[int c - int 'A']
                     and private set(c:char) v = nextChar.[int c - int 'A'] <- v
    member this.Add(word) = 
        assert(word |> Seq.forall IsLetter)
        this.Add(word,0)
    member private this.Add(word:string,i) =
        if i = word.Length then
            isEndOfWord <- true
        else 
            match this.[word.[i]] with
            | null -> let newNode = new TrieNode()
                      newNode.Add(word,i+1)
                      this.[word.[i]] <- newNode
            | node -> node.Add(word,i+1)

The problem with my purely functional approach, is that when you are initializing a recursive discriminated union you have to have the entire tree laid out in memory before it gets initialized. Because once that root node has been set, you can’t update it. The second implementation however just kept an array of child nodes. This allows for quick indexing and makes it much easier to ‘fill out’ the tree. There isn’t any need to have some six-stage pipeline, Brian just created new nodes as necessary.

I was even more impressed by how Brian actually went about searching the Bookworm board for words. Each step I had to check that the word chain wasn’t ‘doubling back’ and using the same letter tile twice. Brian however gets around this by simply ‘zeroing out’ the tile once it is used and then restoring it later. I’ve added comments to point this out.

 let rec Find(col, row, t:TrieNode) = seq {
    let c = board.[col,row]
    if c <> ' ' then
        board.[col,row] <- ' '  // REMOVE TILE, SINCE IT'S BEEN USED

         match t.[c] with
        | null -> ()
        | node -> 
            if node.IsEndOfWord then
                yield [c]
            for suf in Find(col-1, row-1, node) do yield c::suf
            for suf in Find(col-1, row+1, node) do yield c::suf        
            for suf in Find(col  , row-2, node) do yield c::suf        
            for suf in Find(col  , row+2, node) do yield c::suf        
            for suf in Find(col+1, row-1, node) do yield c::suf
            for suf in Find(col+1, row+1, node) do yield c::suf        
        board.[col,row] <- c    // RESTORE TILE, so other recursive calls can use.
     }

Inside Programming F# I wrote that at times imperative solutions can be much faster than functional. While some tech reviewers gave me crap for saying that, Brian’s implementation definitely strikes a good balance between functional and imperative. Kudos.

So, for Round 1 I’ll admit defeat. But I absolutely am not going to give up my (hard-earned) title of evil genius!

Destroying that MOFO where he stands

Fortunately I have a trick up my sleeve. To review, the following image is of a trie or prefix tree. (Stolen from wikipedia in an evil fashion.)

File:Trie example.svg

But consider the words “button” and “mutton”, or more appropriately, “super-mega-evilgenius” (me) and “barely-an-evilgenius” (Brian). In both cases the words share a common suffix, –utton or –evilgenius. When represented in a trie tree that would mean a huge waste of memory, because the node for ‘B’ would have child nodes for ‘u->t->t->o->n’ and the letter ‘M’ would have child nodes for ‘u->t->t->o->n’. When in actuality, they could both share the same suffix nodes. That is, both ‘B’ and ‘M’ can go to ‘u->t->t->o->n’. This sharing of suffixes is just like how prefix trees share prefixes, and both “Button” and “Buttons” share ‘b->u->t->t->o->n’.

This property is exactly what gives rise to the DAWG or directed-acyclic-word-graph. (Again, evilly stolen from Wikipedia.) In the image, the words “TOP”, “TOPS”, “TAP”, and “TAPS” are present. On the left is the trie tree and on the right the DAWG which reuses common suffixes.

File:Trie-dawg.svg

The advantage of a DAWG is that it uses far less memory, since you are no longer duplicating shared suffixes. However, as one would expect, they are a bit more tricky to implement than a Trie.

Creating a DAWG in F#

The simplest way to create a DAWG is to simply create a trie-tree, and then unify common suffixes between tree nodes. (That is, if two nodes can add “-S” and form a complete word, we can have both child pointers point to the same TrieNode object.) To get started, let’s blatantly steal the wannabe-evil-genius-Dr.-McNarama’s trie implementation and add a few new properties.

 type TrieNode() =
    let nextChar = Array.create 26 null
    let mutable isEndOfWord = false
     // ...
     member this.Children   = nextChar
    member this.IsLeafNode = Array.forall (fun child -> child = null) nextChar
    member this.NumChildren = 
        Array.fold
            (fun acc child -> if child <> null then acc + 1 else acc)
            0
            nextChar
    static member CompareNodes(lhs : TrieNode, rhs : TrieNode) =
        if lhs = null && rhs = null then 
            true
        elif (lhs = null && rhs <> null) || (lhs <> null && rhs = null) then
            false
        elif lhs.IsLeafNode && rhs.IsLeafNode then
            (lhs.IsEndOfWord = rhs.IsEndOfWord)
        elif lhs.IsEndOfWord <> rhs.IsEndOfWord then
            false
        else
            // For children, we can use a fast compare since we know that if they shared a common
            // suffix they would have been 'merged' already. (Going from the bottom of the tree up.)
            [| 0 .. 25 |] |> Array.forall(fun i -> (lhs.Children.[i] = rhs.Children.[i]))

Exposing the child node array will make it easy for us to rewire the tree later, and we will use the CompareNodes function to identify if two suffixes are identical. (More on that later.)

Rather than try to explain the algorithm outright, let’s just walk through the code. The first step is to associate each trie node with its height in the tree. Height being the number of letters away it is from a leaf EOW node. In order to do the DAWG conversion efficiently, we want to unify nodes going from the bottom of the tree up. So knowing where each node stands is an important first step. (Note that the nodeHeightDict dictionary is filled up as a side effect of calling walkTree, this is evil imperative programming at its worst!)

 // Walk the tree and add associate each node with its height
let nodeHeightDict = new Dictionary<TrieNode, int>()

let rec walkTree (node : TrieNode) =
    if node.IsLeafNode then
        nodeHeightDict.Add(node, 0)
        0
    else
        let depth =
            node.Children
            |> Array.filter (fun child -> child <> null)
            |> Array.map walkTree
            |> Array.max
        nodeHeightDict.Add(node, depth + 1)
        depth + 1

let maxTreeHeight = walkTree trieRoot

Once we have associated each node with a height in the tree, then group nodes into buckets based on height. So all nodes with height 0 should be terminal EOW nodes. All nodes with height 1 should only have children which are EOW nodes, and so on.

 // Next, fill a separate dictionary mapping each height to a set of nodes. This
// allows us to efficiently 'merge' suffixes going from bottom up.
let heightNodesDict = new Dictionary<int, HashSet<TrieNode>>()
for i = 0 to 16 do
    heightNodesDict.[i] <- new HashSet<_>()

Seq.iter 
    (fun node -> let height = nodeHeightDict.[node]
                 ignore(heightNodesDict.[height].Add(node)))
    nodeHeightDict.Keys

Now things start to get interesting. We need to store common tree suffixes for reuse. Rather than searching the suffix list at every node, we will index it by height of the tree and number of child nodes. (That way we can easily differentiate a suffix that leads to a word by adding “-ED” from a suffix that leads to a word by adding “-ING”.)

 // Now add unique suffix nodes to our 'blessed' list, which
// should all be unique. We will then walk the tree bottom-up
// setting child nodes to the blessed/unique suffixes.
//
// We store our blessed suffix nodes in a (height * NumChildren) dictionary 
// for fast lookup if a suffix exists.
let suffixNodes = new Dictionary<int * int, HashSet<TrieNode>>()
for treeDepth = 0 to 16 do 
    for numChildren = 0 to 26 do
        suffixNodes.[(treeDepth, numChildren)] <- new HashSet<_>()

// All nodes at height zero are EOW nodes (EOWNode and IsLeafNode are true)
// So add exactly one EOW node to the blessed suffixNodes dictionary. All
// nodes at height 0 can be replaced with the one EOW node.
suffixNodes.[0, 0].Add(Seq.head heightNodesDict.[0]) |> ignore

Finally, we walk the nodes in the tree and rewire the child nodes with blessed suffixes. Again, we work from the bottom of the tree up for efficiency reasons. We don’t need to start at heightOfNode 0, since all of those nodes are identical (EOW) and will be replaced by suffixNodes.[0, 0] anyways.

We can compare child nodes using the TrieNode.CompareNodes methods. This checks if two child nodes contain essentially the same information.

 for heightOfNode = 1 to 16 do   

    printfn "DAWGifying child nodes at height %2d. [%s]" heightOfNode (System.DateTime.Now.ToString("hh:mm:ss"))

 
    for node in heightNodesDict.[heightOfNode] do
    
        for childIdx = 0 to 25 do
            
            let childNode = node.Children.[childIdx]
            if childNode <> null then
            
                let validSuffixes = 
                    let childNodeHeight = nodeHeightDict.[childNode]
                    suffixNodes.[childNodeHeight, childNode.NumChildren]
                
                let foundValidSuffix = 
                    Seq.tryFind 
                        (fun validSuffix -> TrieNode.CompareNodes(childNode, validSuffix))
                        validSuffixes

                // If this suffix has never been encountered, then add it to our list.
                // Otherwise, reuse the blessed suffix.
                if Option.isNone foundValidSuffix then
                    validSuffixes.Add(childNode) |> ignore
                else
                    node.Children.[childIdx] <- Option.get foundValidSuffix

Of course, once we go through this procedure we still have the same TrieNode for the root of the tree. How can we validate that our DAWG has fewer nodes? We can validate this by simply walking the tree and adding the hash code of each node to a set. (Note we are relying on the default implementation of GetHashCode.)

 let rec getUniqueNodes (node : TrieNode)  =
    
    let theSet = Set.singleton (node.GetHashCode())

    let children'sNodes =
        node.Children
        |> Array.filter (fun child -> child <> null)
        |> Array.map getUniqueNodes
        |> Array.fold
            (fun acc uniqueNodes -> Set.union acc uniqueNodes)
            Set.empty
        
    Set.union theSet children'sNodes

And, to drive the point home how much better our implementation is than that not-so-evil-genius Dr. McNamara we can use the following test function:

 printfn "Building trie..."
let regularTrie = new TrieNode()
for w in words do
    regularTrie.Add(w)
printfn "Trie: %d nodes" <| (getUniqueNodes regularTrie).Count

printfn "Building trie for DAWG conversion..."
let dawg = new TrieNode()
for w in words do
    dawg.Add(w)
convertTrieToDawg dawg

printfn "DAWG : %d nodes" <| (getUniqueNodes dawg).Count
    
printfn "Testing..."
for w in words do
    assert IsWord(w, regularTrie)
    assert IsWord(w, dawg)

That’s it! I’ll admit that my implementation for converting a trie tree to a DAWG is a little slow, taking several minutes, but the results are worth it. The original trie tree required 393,960 nodes while the DAWG only needs 54,822. I’m not sure what I’ll do with all the memory savings. Perhaps use it to prepare for my next evil plan, which may or may not involve orbiting death lasers and the GPS coordinates of my enemy’s condo.

image

So good-genius Dr. McNamara, how do you like them apples ?