Hello Dear Reader! Continuing on our focus of real data, last year Josh Luedeman (@JoshLuedeman) were trying to figure out this exact problem. How could we find real, free, and very cool data? Enter RetroSheet.org.
RetroSheet is a team of volunteers that have managed to salvage the play by play data for every game of baseball from 1922 - 2016*. Josh and I looked at this data set as a great way to use some new tools and learn more about the teams and game we love. We then used our data skills to ETL out all of the box scores, cleanse the data, and review the data for accuracy. Then we decided to review a subset of data based on the modern era of baseball. For this we selected all games played between 1990 and 2016.
Now we had the data, what we needed was a question. Eventually we settled on what was the best day of the week to see your team play. My favorite team is the Chicago Cubs, has been my whole life. Last year was a good year to be a Cubs fan, they broke a 108 year drought and won the World Series. This picture is from my trip up to Wrigley Field to see Game 5 of the World Series last year.
Enough exposition let's look at the data! This is an interactive embedded Power BI report, please pick your favorite team and look at the results! I'll keep typing below.
My predicament is that I have a family of 5 that I would want to take to a game. Attending an in person baseball game can get expensive. Winning and losing in sports can have a lot of different criteria. I wanted to keep this simple. Could I find the best day of the week to see my team play?
Historic Data Analysis based on:
- Day of the week a game is played on
- If it is a home or away game
- Does my team have a winning or losing record
- What opponent are they facing
I felt this would give me a good basis to determine if it was worth it to make an expensive trip to take the family to a game. I started out my experiment in R. I found very quickly that a trend emerged when I looked at only the days of the week and wins and losses. This is a plot I created in R studio.
In this we can see that the day of the week the Cubs win the most on is within a grouping of Saturday, Friday, and Sunday. Monday looks low on the list. When I look at the day they lose the least, that happens to be Monday.
|Day of the Week||Wins||Losses||TotalGames|
After looking at the data it is clear the most likely day for me to go see the Cubs and have them win is on Monday! At home on Monday when the Cubs have a winning record, I interpreted that to mean a record of .500 or greater, I have a 60% chance of seeing them win. If I'm going to take the family all the way up to Chicago to see a game, I should do it on a Monday.
There are a lot of cool things waiting to be discovered in this data. For instance did you know that if you are a New York Yankees fan and the Yankees have a loosing record, they have a better chance of winning on a Thursday than they do if they had a winning record! We'll talk about how we built this in a future blog post. We used Azure Storage, Polybase, SQL Server with advanced analytics, SQL Azure DB, R, R Studio, and Power BI to develop this post. Thank you for reading!
How about you Dear Reader? Do you have a favorite team? Did you notice any interesting trends? Sound off below!
*There are some games where data may be missing, and there may be some errors in the data set, for many Retrosheets.org is the go to place for baseball data. This data and anything we put forward may have a +/- confidence value of 1 standard deviation. This Sample Code is provided for the purpose of illustration only and is not intended to be used in a production environment. Please don't use this information for anything that could cost you money. This data is for fun, and should not be used to interpret future events or outcomes.