Trie-ing to optimize text search in F#

Any evil genius knows that if you want to stay on top you need to exercise your mind. Most evil geniuses stay strong by lifting weights with their telekinetic abilities, however I prefer video games. PopCap.com’s Bookworm is one mental exercise I am particularly fond of.

Bookworm is simple. You have  7x7 hexagonal grid of letters and your goal is to create words by forming chains of adjacent tiles. Once you pick the word, the tiles are remove from the board and new ones drop down to fill the void.

The following screen shot shows a five letter word being selected. Obviously it is harder to find longer words, and thus they are worth more points.

image

Eventually you start getting Green, Gold, and Diamond tiles worth bonus points. However, if you make too many short words, typically those with only three-letters, you get flaming tiles. Each round a flaming tile exists on the screen it will burn through the tile below it. Once a flaming tile reaches the bottom of the screen it’s game over, man.

image

Mental exercise is important to keep any evil genius, well, a genius. However, a true evil genius would try to cheat. And that’s exactly what I intend to do in this blog post: write a program to find the longest word on a given Bookworm board.

Hexgrid Data Structure

In order to cheat at any game, first you need to write code to represent the game state. Which in the case of Bookworm is the hexagonal grid.  Creating our grid data structure isn’t going to be super interesting, but bear with me as it is required in order to do word searches later.

We’ll start by creating an interface to represent each tile on the board, which has a letter, set of neighbors, and position. Then we create a class for the actual letter grid.

 /// Representing a tile on the board
[<AllowNullLiteral>]
type ITile =
    abstract Letter    : char
    abstract Neighbors : seq<ITile>
    abstract Location  : int * int

/// Data structure for a hex grid of letters
type LetterGrid(tileColumns : string[]) = 
    
    // Store the grid of letters in an array. Note that it is jagged. In 
    // the flash version of Bookworm even columns have 8-tiles instead of 7.
    let mutable m_columns : char[][] = null
    
    // Cache of ITiles corresponding to tiles on the grid
    let mutable m_tileCache = Map.empty

Armed with these simple types, we can go about populating our LetterGrid class’ ITile cache. The following F# snippet shows using F# object expressions to create new instances of ITile. Note the recursive calls to getTile and then filtering out any tiles which are off the board. We can represent off-board tiles by simply having getTile return null. However, this requires some additional work.

By default any types defined in F# cannot be null, even interfaces like ITile. However, we can override this behavior by simply adding the [<AllowNullLiteral>] attribute.

 /// Get an instance of an ITile from the given index. (Generating it if necessary.)
let rec getTile colIdx rowIdx =
     if not <| isValidLocation colIdx rowIdx then
        null
    else
        
        if Map.containsKey (colIdx, rowIdx) m_tileCache then
            m_tileCache.[(colIdx, rowIdx)]
        else
            let newTile = 
                {
                    new ITile with
                        member this.Letter = getTileLetter colIdx rowIdx
                        member this.Neighbors =
                        
                            let neighbors =
                                
                                if (colIdx + 1) % 2 = 1 then
                                    // Odd column
                                    seq {
                                        yield getTile (colIdx - 1) (rowIdx + 0)  // NW
                                        yield getTile (colIdx + 0) (rowIdx - 1)  // N
                                        yield getTile (colIdx + 1) (rowIdx + 0)  // NE
                                        yield getTile (colIdx + 1) (rowIdx + 1)  // SE
                                        yield getTile (colIdx + 0) (rowIdx + 1)  // S
                                        yield getTile (colIdx - 1) (rowIdx + 1)  // SW

                                    }
                                else
                                    // Even column
                                    seq {
                                        yield getTile (colIdx - 1) (rowIdx - 1)  // NW
                                        yield getTile (colIdx + 0) (rowIdx - 1)  // N
                                        yield getTile (colIdx + 1) (rowIdx - 1)  // NE
                                        yield getTile (colIdx + 1) (rowIdx + 0)  // SE
                                        yield getTile (colIdx + 0) (rowIdx + 1)  // S
                                        yield getTile (colIdx - 1) (rowIdx + 0)  // SW
                                    }
                                    
                            // Filter out null (neighbors off the board)
                            neighbors |> Seq.filter ( (<>) null )
                    
                        member this.Location = (colIdx, rowIdx)
                }
        
            m_tileCache <- Map.add (colIdx, rowIdx) newTile m_tileCache
            newTile

Crappy Word Detection

Great! Now that we have a way to represent the hex grid, all we need to do is start searching for words. First, we need a dictionary. Floating around on the internet(s) you can find a copy of the Scrabble Tournament Word List (TWL). The TWL contains all legal Scrabble words, however Bookworm only honors a subset of them. (For example, most curse words are not recognized.)

Once we have a word list, perhaps represented as a Set<string> we just need to start trying out all combinations of word chains and seeing if they are valid words, right? Wrong! That is a terrible idea and here is why:

Bookworm accepts words between 3 and 12-letters long. The board starts with 52 tiles, and each tile has six neighbors. Ignoring the fact that a tile cannot be used twice in the same word chain, we get:

clip_image002

which WolframAlpha tells me is 22,638,535,920, and that is just word chains! Once we are done permuting every single word chain we then have to look it up in our dictionary. Which would be O(log n), where n is the number of words. O(log n) is quite fast, but multiplied by 22 billion is still way too big.

What we need is a way to efficiently prune our search space. If you ever have a word prefixed with “AKVWOP”, and you know it isn’t a word, then it doesn’t make sense to continue looking in the vague hope of finding “AKVWOPS”, “AWKVWOPED”, “AWKVWOPING”, etc. 

The Trie Data Structure

Fortunately, there is a classic data structure for doing this called a trie. (Pronounced ‘try’, as in I was trie-ing to be clever when I titled this post.) It is also referred to as a prefix tree. The data structure can be explained quite well with the following picture: 

image

 

You start with the root node, and the edge to any of its children requires a letter. For example ‘t’. The child of the root node associated with ‘t’, has a child using the letter ‘o’ and another child associated with the letter ‘e’. Since there is no child associated with the letter ‘q’, you can infer that there is no leaf nodes prefix with the string “tq”.

If we constructed a trie for the entire contents of the dictionary, we would then have an efficient way to determine if a given substring prefixes any valid word. (By simply checking if such a node exists in the tree or not.)

The code for implementing this in F# is straightforward. Each trie node stores a dictionary potentially mapping a letter to a child node, or the it is at a terminal node. Meaning that the prefix is a valid word, and there are no further words with that prefix.

 type TrieNode =
    // Letters go to other nodes, flag if it is a word node
    | Node of IDictionary<char, TrieNode> * bool
    // You are at the end of a legal word
    | Eow
    
    member this.IsLegalWord = 
        match this with
        | Node(_, true)
        | Eow -> true
        | _   -> false

To generate our tree, we use the following function which takes a sequence of words, represented as F# lists of characters, and converts it into a TrieNode. I’ve added a glut of comments to help explain things. The only bit of black magic is the use of the dict function.

dict works by taking a seq<’k, ‘v> and produces an IDictionary<’k, ‘v>.

 /// seq<char list> -> TrieNode 
let rec generateTrieNode words =

    // Does this represent an EOW node for anything?
    let hasEowNode = Seq.exists (fun word -> word = []) words

    if Seq.isEmpty words then
        Eow    
    else
        (* Now group all words by their first letter.
        [ "cat"; "cats"; "dogs" ] 
        =>
        seq {
            ('c', seq { "at"; "ats" })
            ('d', seq { "ogs" })
        }
        *)
        let nextWordFragments =
            // [""; "a"; "act"; "acts"; "actor"; ... ]
            words   
            // ["a"; "act"; "acts"; "actor"; ... ]
            |> Seq.filter (fun word -> word <> [])
            // seq { ("a", seq { "act"; "acts"; "actor"; ... }); ... }
            |> Seq.groupBy (fun word -> Seq.head word)
            // seq { ("a", seq { "ct"; "cts"; "ctor"; ... }); ... }
            |> Seq.map (fun (firstLetter, words) -> (firstLetter, words |> Seq.map List.tail))
            // seq { ("a", seq ("c", seq { ... } ...  }
            |> Seq.map(fun (firstLetter, rest) -> (firstLetter, generateTrieNode rest))
            // seq { ("a", seq ...); ("A", seq ...) ... }
            |> Seq.fold
                (fun acc (letter, trieNode) ->
                    (Char.ToUpper(letter), trieNode) :: (Char.ToLower(letter), trieNode) :: acc
                )
                []

        Node(dict nextWordFragments, hasEowNode)

Once you have the trie tree in memory, you can check if a string is a valid word or not by simply walking down the tree.

 let isWord word tree =
    
    let letters = word |> Seq.toList
    
    let rec isWord currentNode letters =
        match letters, currentNode with
        // When you are out of letters, check if you are on
        // an end-of-word node
        | [], Eow
        | [], Node(_, true) 
            -> true

        // Otherwise, walk down the tree...
        | nextLetter :: rest, Node(childNodes, _)
            ->  let hasNode, childNode = childNodes.TryGetValue(nextLetter)
                if hasNode then
                    isWord childNode rest 
                else
                    false

        | _ -> false

    isWord tree letters 

Searching

Searching through potential word chains can be parallelized trivially, but I’ll save that for a future blog post. For now we will use the following approach:

  • Start with each tile on the board and put them into a stack of ‘search states’
  • While the search stack is not empty do
    • Pop the search stack
    • Check that state to see if there are any legal words from that position (adjacent tiles leading to valid word prefixes)
    • Push any legal, subsequent steps to our search stack

Here is the F# code. Note that each state is represented by TrieNode * ITile list. This represents the current word prefix node in the trie tree and the list of tiles forming the word chain. We then seed that search stack with word chains starting at every tile in the board.

 let statesToExplore = new Stack<TrieNode * ITile list>()

// Start with searching all tiles
gameGrid.AllTiles 
|> Seq.iter (fun tile -> 
                let firstStepDownTrieTree = 
                    match trieRoot with
                    | Node(nextSteps, _) -> nextSteps.[tile.Letter]
                    | _ -> failwith "Shouldn't get here"
                statesToExplore.Push(firstStepDownTrieTree, [tile]))

// Now explore the entire search space
while statesToExplore.Count > 0 do
    let trieNode, prevLetters = statesToExplore.Pop()
    
    match searchStep trieNode prevLetters with
    | None -> ()
    | Some(nextSearchStates) ->
        // Add possible states to explore to our stack
        Seq.iter (statesToExplore.Push) nextSearchStates

All we need is a function to take some given search state – current node in the trie tree and tiles used so far – and then check all of the neighboring tiles if they can continue down the trie tree towards valid words.

 /// Given a spot in the Trie tree, a game tile, and a list of previously used tiles used
/// to form a word, return Some(seq< all possible steps>). If no legal words are possible
/// from this state, return None.
let searchStep (nodeInTrieTree : TrieNode) (tilesInChain : ITile list) =

    // Are we at a legal word in the Trie tree?
    if nodeInTrieTree.IsLegalWord then
        foundWords.Add(getWordText tilesInChain)

    let currentTile = List.head tilesInChain

    match nodeInTrieTree with
    // Cannot make any valid words from this point, no purpose in stepping
    | Eow -> None
    // If you can form legal words beond where we are at in the tree, continue
    // searching.
    | Node(possibleLetterSteps, _) 
        ->  let possibleSteps =
                currentTile.Neighbors
                // Filter out neighbors already used for the word chain
                |> Seq.filter 
                    (fun neighborTile -> 
                        not (List.exists 
                                (fun (prevTile : ITile) -> prevTile.Location = neighborTile.Location) 
                                tilesInChain
                            )
                    )
                // Filter out neighbors which couldn't lead to avalid word
                |> Seq.filter
                    (fun neighborTile -> possibleLetterSteps.ContainsKey(neighborTile.Letter))
                // Now map them to the next 'states' to search
                |> Seq.map 
                    (fun neighborTile ->
                        (   // Node in trie tree
                            possibleLetterSteps.[neighborTile.Letter],
                            // Tiles used in the word chain
                            neighborTile :: tilesInChain
                        )
                    )
                    
            if not (Seq.isEmpty possibleSteps) then
                Some(possibleSteps)
            else
                None

And that’s it! We can use this efficient word search cheat to our heart’s content. Attached to this post is the source code.

I used BookwormBot9000 to spot this gem on my iPhone version of the game. But that begs the question what the hell is a madrilene?

photo

Disclaimer

All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose. So in other words, if cheating at word games leads your brain to atrophy and turn to mush, don’t blame me or Microsoft.

BookWormBot9K v1.0.zip