Looking back over my blog here, I don't think I have ever done a book review. I spent the last couple of weeks working through a book which I intend to use for testing and figured this would make a good topic for an entry here.

The book is Data Analysis with Open Source Tools by Philipp K. Janert. Here's a link to the book for sale at Barnes and Noble (I have no affiliation with them).

I had been looking for books on data analysis for a few weeks and basing my thoughts on what is available, I had divided them into 2 categories. First was the "For Dummies" types of books. Normally, I love these books. When I worked technical support for Windows 95, I estimated about 20% of the questions I answered could have been answered with the Windows 95 for Dummies book by itself. And Gookin's original Dos for Dummies was truly great. But in this case, these books repeated simple statistical analysis techniques that I already knew so did not right for me. At the other end were books like (I am making this title up) Stochastic Analysis of Bayesian Networks using Prolog in Bioinformatics for Proteomics. These books are so specialized that they left my meager bachelor's degree in math behind from page 1.

Then I noticed Janert's book and picked it up. It has a few introductory chapters that don't focus so much on how to compute statistics but instead gives examples for how to use that stats (that are available in my mental categories from the Dummies books) to generate different graphs. He goes into detail for each graph type and explains how and why you would want to use a certain type of plot - such as a log scale plot - and how to read each plot. Just as important, he explains why a certain plot might not be good for different types of data. And on a factual level, when he explains that Pareto/Zipf/Power Law functions are met with consternation, he nailed it. I spent a couple of days working through this for a data point I was trying to isolate.

After this introductory matter he goes into detail for the math needed to prep data for different types of presentation. Again, he has a hands on style of approach and gives the theory along with practical examples of how to make use of each theorem. He has examples for where things can go wrong as well and explains how you might be able to notice this as you analyze your own data.

Finally, he covers predictive analysis and gives some solid techniques for using data analysis to make confident guesses about future behavior. Again, he gives practical examples from the industry and covers both cases in which the analysis goes well and cases in which the analysis goes poorly. His examples are easy enough to follow and are clearly explained.

The with Open Source Tools is kind of interesting. He covers mostly Python and R examples, but also Matlab, which is not open source. If anything, I wish there had been a few more code samples. Although each chapter covers multiple techniques of analysis, each chapter ends with only one worked example using one of the open source tools. I can't help but wonder if the publisher threw that "open source" phrase on the book for wider appeal. Still, the code samples are a good extension of each chapter and left me wanting more.

Overall, this book hits the sweet spot between the too basic and too advanced books I had been seeing. In two weeks of reading this, my book is already dog eared and has many little bookmarks scattered through it and this tells me I am making use of it. I might even get a few copies for other testers on the team.

Questions, comments, concerns and criticisms always welcome,

John