Book Review (Book 7) – Think Stats

This is a continuation of the books I challenged myself to read to help my career - one a month, for year. You can read my first book review here. The book I chose for December 2011 was: Think Stats, by Allen B. Downey

Why I chose this Book:

I originally chose another book for this month, but changed to this one after a difference in focus (sort of) in my technical career. That brings up a couple of interesting points right away. The first is that it’s OK to change a list - remember that the purpose of reading these books is to gain information that gets you closer to your professional goals. When you develop your list, you have a certain amount of knowledge, and as you read more, experience more, and are exposed to more, you get different information. When that happens, adapt.

The second point is that your goal itself may change. I am focusing on “Big Data” this year and with the changes we’ve made in Windows and SQL Azure at Microsoft, this fits neatly with my professional goals personally and the company I work for. Actually, my goals in technology haven’t changed in the 27+ years I’ve worked in IT, in roles from electronics, programming, consulting, management, architect and in my current technical role here at Microsoft. I think that it has always been about data - everything in IT is an interface to data. And I have always wanted to be at the center of that. Data Science involves not just the sourcing, administration and movement of data, but in applying scientific (with an emphasis on mathematical) disciplines to get at the meaning the situation needs.

So that brings me to this choice. My friend Jeremiah Peschka found this resource for a role I am VERY interested in - the “Data Scientist”. It’s a combination of high-end mathematics, Data Analysis and Big Data. The resource is a series of books from O’Reilly for that very title. You can find that here.

Personally, I find the grouping of books a little cobbled together. They are all fine books, but I’m not certain how they lead you through the series of knowledge required for the topic, but that’s a post for another day. Within that series of books is the one I’m reviewing today. I started (since there is no implied order in the books) with the “Data Analysis” book, but it seemed to start in the middle of some topics I needed to research, so I switched to reading this one, and chose it as my December book.

Another note here - December is a tough month. Since so many people take vacation time during this month, most of my clients try to get as much work in before the Holidays as possible. Since they are all doing that at once, it makes for a lot of overtime. Also, I travel to see family, which of course puts me out of pocket for a while myself. So staying on track with the books - especially one that makes heavy use of computing, math and focus is hard. So it’s tough to maintain your goals all of the time - but keeping in mind why you do this is the important thing. It will keep you on track.

What I learned:

This book focuses more on what the title says - it’s more about being mindful of the way you use statistics than the statistics themselves. It’s assumed you know not only the basics of statistics (I used these free lessons as a refresher, along with some of my old stats books) but how they are used.The author doesn’t stop to explain a lot of stats he uses, but periodically he does show why a given formula works the way it does. This is very useful, and helps with understanding the point of using one method over another. He also does a great job of using statistics to verify other statistics.

Although it should be obvious, the meaning of the data is essential. We think about this when we deal with the result of data processing, but not necessarily when we work with the sources. For instance - as the author explained some central tendency, smoothing and so on using statistical methods, he introduced some numbers and asks you to guess the central number from the set. Dutifully you work out the answer, but in time he reveals that it’s a series of numbers on a die - which of course can only be whole numbers. The point is that you’re so focused on getting the right answer, you don’t define what the real problem is first.

Another great tool  - and a fascinating study that I need to look into further - is the fact that you can often make at least educated inferences into data you might not imagine. For instance, he talks about the example of a series of train cars, numbered sequentially. You see a train car numbered “60” - can you guess with any certainty how many train cars the company has? Fascinating stuff.

He includes a glossary at the end of each chapter. I found this a great approach for summarizing the information in one place, and really helpful in making sure I understood everything before moving on. I didn’t always, so I had to re-read parts of the book and freshen up my stats knowledge along the way as well.

He uses Python as the language of choice - which I found a bit unusual. Most of the stats profession uses something more like the R language, which I’ve also started learning, and one of the other books in this series includes R as a primary subject. Because the author uses Python, he includes references to a series of libraries you add into it to work through the examples. Python certainly is a Data Scientist’s tool, just normally not for statistics. The author uses great examples and assignments, but doesn’t really follow up on those. I guess I’d rather see those introduced earlier in the chapter and explained better. He tends to jump around a bit, and his references are to Wikipedia, which isn’t always as reliable or thorough as it can be. But these are small quibbles. It’s a good book, and a I learned a lot reading it. In fact, I have lots of concepts to unpack based on what I read.

Comments (1)

  1. RK says:

    "Python certainly is a Data Scientist’s tool, just normally not for statistics."

    Correction here — It is definitely used for solving problems that are  statistical in nature. For Big datasets, R is particularly painful to work with. It is just that statisticians tend to use R than Python.

    Thanks for the review …

Skip to main content