While building the back end to host our “Rock, Paper, Scissors in the cloud” game, we faced a situation of where/how to store the log files for the games that are played. In my last post, I explained a bit about the idea; in the game, log files are essential at tuning your bot to play effectively. Just to give a quick example of what the top of a log file might look like:
In this match, I (bhitney) was playing a house team (HouseTeam4) … each match is made up of potentially thousands of games, with one game per line. From the game’s perspective, we only care about the outcome of the entire match, not the individual games within the match – but we need to store the log for the user.
There’s no right or wrong answer for storing data – but like everything else, understanding the pros and cons is the key.
We immediately ruled out Azure Tables, simply because the entity size is too big. But what if we stored each game (each line of the log) in an Azure Table? After all, Azure Tables shine at large, unstructured data. This would be ideal because we could ask specific questions of the data – such as, “show me all games where…”. Additionally, size is really not a problem we’d face – tables can scale to TBs.
But, storing individual games isn’t a realistic option. The number of matches played for a 100 player match 4,950. Each match has around 2,000 games, so that means we’d be looking at 9,900,000 rows per round. At a few hundred milliseconds per insert, it would take almost a month to insert that kind of info. Even if we could get latency to a blazing 10ms, it would still take over a day to insert that amount of data. Cost wise, it wouldn’t be too bad: about $10 per round for the transaction costs.
Blob storage is a good choice as a file repository. Latency-wise, we’d still be looking at 15 minutes per round. We almost went this route, but since we’re using SQL Azure anyway for players/bots, it seemed excessive to insert metadata into SQL Azure and then the log files into Blob Storage. If we were playing with tens of thousands of people, that kind of scalability would be really important. But what about Azure Drives? We ruled drives out because we wanted the flexibility of multiple concurrent writers.
Storing binary data in a database (even if that binary data is a text file) typically falls under the “guilty until proven innocent” rule. Meaning: assume it’s a bad idea. Still, though, this is the option we decided to pursue. By using gzip compression on the text, the resulting binary was quite small and didn’t add significant overhead to the original query used to insert the match results to begin with. Additionally, the connection pooling makes those base inserts incredibly fast – much, much faster that blob/table storage.
One other side benefit to this approach is that we can serve the GZip stream without decompressing it. This saves processing power on the web server, and also takes a 100-200k log file to typically less than 10k, saving a great deal of latency and bandwidth costs.
Here’s a simple way to take some text (in our case, the log file) and get a byte array of the compressed data. This can then be inserted into a varbinary(max) (or deprecated image column) in a SQL database:
And to get that string back:
Naturally, the downside of this approach is that if a browser doesn’t accept GZip encoding, we don’t handle that gracefully. Fortunately it’s not 1993 anymore, so that’s not a major concern.