Thursday, October 25, 2012

The Toughest Part ...

...finding the data.

One reason that my posts have been getting sparser is the difficulty in finding the data that I'm looking for.  Even in this age of "big data", finding the data of interest is not trivial (in fact, it may not exist).  Here are 2 examples supporting my point.

First, even when the data does exist, it can take a long time to get it all collected.  For example, in my post explaining why Roger Federer should never hit a second serve, I collected the data by going through match statistics for over 30 matches on the Wimbledon website and recording the numbers in Excel.  As you can imagine, this gets very boring very quickly.

Second, sometimes it is unreasonably difficult to find the correct data, even when you are sure that it exists somewhere.  For example, I was interested in writing a post about whether NFL players should attempt a kick-off return when catching the ball in the end zone, or if they should take a touch-back and start on the 20 yard line.  After about an hour of searching on Google, I came up dry.  The closest thing that I could find is the average kick-off return and number of touch-backs per game, but it doesn't tell us where the player caught the ball (in the end zone or not).  Additionally, this would still require me to aggregate the data by hand across all games.

If you ever find a "ready" dataset that could help answer a interesting question, send me an email and I'll be happy to take a shot at analyzing the data and turning it into a post.

No comments:

Post a Comment