No Sweat Statistics

Thursday, October 25, 2012

The Toughest Part ...

...finding the data.

One reason that my posts have been getting sparser is the difficulty in finding the data that I'm looking for. Even in this age of "big data", finding the data of interest is not trivial (in fact, it may not exist). Here are 2 examples supporting my point.

First, even when the data does exist, it can take a long time to get it all collected. For example, in my post explaining why Roger Federer should never hit a second serve, I collected the data by going through match statistics for over 30 matches on the Wimbledon website and recording the numbers in Excel. As you can imagine, this gets very boring very quickly.

Second, sometimes it is unreasonably difficult to find the correct data, even when you are sure that it exists somewhere. For example, I was interested in writing a post about whether NFL players should attempt a kick-off return when catching the ball in the end zone, or if they should take a touch-back and start on the 20 yard line. After about an hour of searching on Google, I came up dry. The closest thing that I could find is the average kick-off return and number of touch-backs per game, but it doesn't tell us where the player caught the ball (in the end zone or not). Additionally, this would still require me to aggregate the data by hand across all games.

If you ever find a "ready" dataset that could help answer a interesting question, send me an email and I'll be happy to take a shot at analyzing the data and turning it into a post.

Friday, October 12, 2012

Why the NFL is supporting the wrong cancer research

Unless you live at the bottom of the ocean, you have probably noticed that the NFL is showing support for breast cancer research by having the players and officials wear pink accessories (sounds a little girly when I say it like that). As a cancer researcher, I think its great that the league with the most exposure in the USA is joining the American Cancer Society in its fight to end cancer. However, the NFL is making a huge mistake by choosing to support breast cancer research over prostate cancer. Here are a few statistics from the American Cancer Society that may surprise most people:

1. Approximately 1 out of every 6 men will develop prostate cancer in his lifetime. In contrast, 1 out of every 8 women and less than 1 out of every 1,000 men will develop breast cancer in her/his lifetime.

2. There will be an estimated 241,740 patients diagnosed with prostate cancer in 2012, compred to 229,060 new cases of breast cancer (<1% of those cases ocuring in males).

3. Treatments are not as effective for breast cancer as prostate cancer, and this is reflected in the 5 year survival rates: 99% of prostate cancer patients will survive 5 years, compared to only 89% for breast cancer. However, until the 1990s, males with prostate cancer had a higher 5-year mortality rate than females with breast cancer.

4. It is estimated that 39,510 women and 410 men will die of breast cancer in 2012. An estimated 28,170 men will die of prostate cancer this year. That is, over 65 times more men will die of prostate cancer than breast cancer.

5. Prostate cancer is the second most deadly cancer type for males, behind only lung cancer.

6. African American males are 1.6 times more likely to develop prostate cancer and 2.5 times more likely to die from it than white males.

Do these numbers surprise you? While breast cancer is a more deadly disease than prostate cancer, all of the support for breast cancer month and "wearing pink" makes it seem like the disparity between the two diseases is much larger. Considering that there is not a single female player in the NFL, the league is going out of its way to promote research for a cancer that its players are 150 times less likely to develop than prostate cancer (supporting evidence: allowing players to wear pink shoes and towels but fining them $5,000 for wearing a red undershirt). Additionally, with the NFL consisting of a large proportion of African Americans males, you would think it would support research for a disease with significant racial disparities.

If you happen to meet Roger Goodell on the street and point these facts out to him, he will mention that the NFL is committed to promoting prostate health, and he is technically correct. However, I don't see the NFL encouraging players to wear blue during September. A bit hypocritical, don't you think?

Saturday, September 8, 2012

How accurate are football preseason polls?

I'm a few weeks late on this post, but I think its still worth blogging about. Every year, the media (especially ESPN) makes such a huge deal about college preseason football polls. Without any games yet to be played, these polls are little more than speculation. I wanted to look into how accurate these polls are at choosing that season's national champion. In this post, I will be using exclusively the AP poll preseason and final results, which can make a difference in years before the BCS when there could be multiple national champs based on the poll used.

This first plot shows the final ranking of preseason top 5 teams since 1990. The bar furthest to the right shows the teams that were in the top 5 preseason poll but finished the year unranked.

A few interesting notes:
1. 14 of the past 22 national champions were ranked in the preseason top 5 (I think the last winner outside the top 5 was Auburn led by then-unknown Cam Newton).
2. More national champs were ranked preseason #2 than preseason #1. This is good news for Alabama who started this season ranked #2 behind USC (but who jumped to #1 after their first win).
3. Looking only at the preseason #1 teams (blue bars), they are more likely to finish the season ranked 3rd than any other rank. Also, no preseason #1 has finished worse than #16 in the final polls.

Next, I wanted to look whether teams ranked higher in the preseason poll tended to be ranked higher at the end of the season. A simple way to do this is to look at the median finish of the top 5 preseason teams.

**Median Final Ranking since 1990**
Preseason Rank	Median Final Rank
1	3
2	5
3	6.5
4	9.5
5	8

So although the national champions are not always ranked preseason #1, the top preseason teams in general finish higher in the standings than the other preseason teams. The exception is that teams ranked #5 tend to finish the season ranked better than the #4 preseason team. There could be some bias causing this result, as there needs to be some tie between teams with the same final season record, and this may be influenced by the preseason rankings.

Finally, I wanted to see how the final rankings of the previous season influence the preseason polls of the next season. For example, do teams who finish the year #1 tend to be the top ranked preseason team the following year (even though 1/4 of the team likely graduated)?

This plot shows that the preseason top 5 teams tended to finish the previous season ranked highly. We see that, since 1990, 9 of the 23 teams finishing the previous season #1 were the top ranked preseason team. This is a somewhat questionable strategy, as only 2 teams have repeated as national champs since 1990: Nebraska in 1994-95 (who was not ranked preseason #1 in 1995) and USC in 2003-04.

In summary, while the top preseason team more often than not does not win the national championship, on average they finish the season ranked better than any other preseason team.