No Sweat Statistics: 2012

Sunday, November 25, 2012

Moneyball: Grinnell College Style

One of the big sports stories this past week was Jack Taylor of Grinnell College scoring 138 points in a single basketball game. When discussing this story with my family and friends, the conversation always seems to steer towards the "ethical-ness" (or sportmanship) of the coach's strategy: essentially going for steals using full-court defensive pressure and giving up easy 2's in order to take lots of open 3-pointers at the other end (Taylor was 27-for-71 from the 3-point line and 52-for-108 overall).

It turns out that the Grinnell coach's strategy is based on a student's project done in the early 1990's that pointed out statistical patterns that were keys to winning. These include:

Attempt at least 25 more shots than their opponent
Take at least 94 shots per game
At least half of all shots are 3-point attempts
Rebound at least 33% of its missed shots
Force the opponent into at least 32 turnovers

More recently, a pair of students at Grinnell re-examined these keys and came to a slightly different conclusion (namely, that turnover differential is more important than number of opponent turnovers). You can access the paper here, about half of which is statistical, but the other half should be easily understood by all. While the analysis is nice, by far the coolest thing about the paper is that the students interviewed the head coach, David M. Arseneault, got his feedback on their work, and included his comments in their paper. While Grinnell is only a DIII school, I think it's great that a head college basketball coach is not only willing to meet with and discuss a statistics project with undergrads, but has built his strategy based on previous students' work (no mention if he has incorporated these latest suggestions to his strategy).

Congrats to Jack Taylor and Coach Arseneault, and only time will tell if all of the publicity surrounding this game will prompt other coaches to adopt his strategy, or at least devise their own strategies based on statics (with bonus points for using undergrads to help)!

(Thanks to Simply Statistics for the link to the student paper described here.)

Thursday, October 25, 2012

The Toughest Part ...

...finding the data.

One reason that my posts have been getting sparser is the difficulty in finding the data that I'm looking for. Even in this age of "big data", finding the data of interest is not trivial (in fact, it may not exist). Here are 2 examples supporting my point.

First, even when the data does exist, it can take a long time to get it all collected. For example, in my post explaining why Roger Federer should never hit a second serve, I collected the data by going through match statistics for over 30 matches on the Wimbledon website and recording the numbers in Excel. As you can imagine, this gets very boring very quickly.

Second, sometimes it is unreasonably difficult to find the correct data, even when you are sure that it exists somewhere. For example, I was interested in writing a post about whether NFL players should attempt a kick-off return when catching the ball in the end zone, or if they should take a touch-back and start on the 20 yard line. After about an hour of searching on Google, I came up dry. The closest thing that I could find is the average kick-off return and number of touch-backs per game, but it doesn't tell us where the player caught the ball (in the end zone or not). Additionally, this would still require me to aggregate the data by hand across all games.

If you ever find a "ready" dataset that could help answer a interesting question, send me an email and I'll be happy to take a shot at analyzing the data and turning it into a post.

Friday, October 12, 2012

Why the NFL is supporting the wrong cancer research

Unless you live at the bottom of the ocean, you have probably noticed that the NFL is showing support for breast cancer research by having the players and officials wear pink accessories (sounds a little girly when I say it like that). As a cancer researcher, I think its great that the league with the most exposure in the USA is joining the American Cancer Society in its fight to end cancer. However, the NFL is making a huge mistake by choosing to support breast cancer research over prostate cancer. Here are a few statistics from the American Cancer Society that may surprise most people:

1. Approximately 1 out of every 6 men will develop prostate cancer in his lifetime. In contrast, 1 out of every 8 women and less than 1 out of every 1,000 men will develop breast cancer in her/his lifetime.

2. There will be an estimated 241,740 patients diagnosed with prostate cancer in 2012, compred to 229,060 new cases of breast cancer (<1% of those cases ocuring in males).

3. Treatments are not as effective for breast cancer as prostate cancer, and this is reflected in the 5 year survival rates: 99% of prostate cancer patients will survive 5 years, compared to only 89% for breast cancer. However, until the 1990s, males with prostate cancer had a higher 5-year mortality rate than females with breast cancer.

4. It is estimated that 39,510 women and 410 men will die of breast cancer in 2012. An estimated 28,170 men will die of prostate cancer this year. That is, over 65 times more men will die of prostate cancer than breast cancer.

5. Prostate cancer is the second most deadly cancer type for males, behind only lung cancer.

6. African American males are 1.6 times more likely to develop prostate cancer and 2.5 times more likely to die from it than white males.

Do these numbers surprise you? While breast cancer is a more deadly disease than prostate cancer, all of the support for breast cancer month and "wearing pink" makes it seem like the disparity between the two diseases is much larger. Considering that there is not a single female player in the NFL, the league is going out of its way to promote research for a cancer that its players are 150 times less likely to develop than prostate cancer (supporting evidence: allowing players to wear pink shoes and towels but fining them $5,000 for wearing a red undershirt). Additionally, with the NFL consisting of a large proportion of African Americans males, you would think it would support research for a disease with significant racial disparities.

If you happen to meet Roger Goodell on the street and point these facts out to him, he will mention that the NFL is committed to promoting prostate health, and he is technically correct. However, I don't see the NFL encouraging players to wear blue during September. A bit hypocritical, don't you think?

Saturday, September 8, 2012

How accurate are football preseason polls?

I'm a few weeks late on this post, but I think its still worth blogging about. Every year, the media (especially ESPN) makes such a huge deal about college preseason football polls. Without any games yet to be played, these polls are little more than speculation. I wanted to look into how accurate these polls are at choosing that season's national champion. In this post, I will be using exclusively the AP poll preseason and final results, which can make a difference in years before the BCS when there could be multiple national champs based on the poll used.

This first plot shows the final ranking of preseason top 5 teams since 1990. The bar furthest to the right shows the teams that were in the top 5 preseason poll but finished the year unranked.

A few interesting notes:
1. 14 of the past 22 national champions were ranked in the preseason top 5 (I think the last winner outside the top 5 was Auburn led by then-unknown Cam Newton).
2. More national champs were ranked preseason #2 than preseason #1. This is good news for Alabama who started this season ranked #2 behind USC (but who jumped to #1 after their first win).
3. Looking only at the preseason #1 teams (blue bars), they are more likely to finish the season ranked 3rd than any other rank. Also, no preseason #1 has finished worse than #16 in the final polls.

Next, I wanted to look whether teams ranked higher in the preseason poll tended to be ranked higher at the end of the season. A simple way to do this is to look at the median finish of the top 5 preseason teams.

**Median Final Ranking since 1990**
Preseason Rank	Median Final Rank
1	3
2	5
3	6.5
4	9.5
5	8

So although the national champions are not always ranked preseason #1, the top preseason teams in general finish higher in the standings than the other preseason teams. The exception is that teams ranked #5 tend to finish the season ranked better than the #4 preseason team. There could be some bias causing this result, as there needs to be some tie between teams with the same final season record, and this may be influenced by the preseason rankings.

Finally, I wanted to see how the final rankings of the previous season influence the preseason polls of the next season. For example, do teams who finish the year #1 tend to be the top ranked preseason team the following year (even though 1/4 of the team likely graduated)?

This plot shows that the preseason top 5 teams tended to finish the previous season ranked highly. We see that, since 1990, 9 of the 23 teams finishing the previous season #1 were the top ranked preseason team. This is a somewhat questionable strategy, as only 2 teams have repeated as national champs since 1990: Nebraska in 1994-95 (who was not ranked preseason #1 in 1995) and USC in 2003-04.

In summary, while the top preseason team more often than not does not win the national championship, on average they finish the season ranked better than any other preseason team.

Thursday, August 16, 2012

Swimming and the Fast Suit Aftermath

At the end of 2009, the governing body of competitive swimming, FINA, banned the use of high-tech full-body fast suits (try saying that 5 times!). See here for a summary. As proof of the influence of fast suits, all but 2 world records (both men's and women's) were broken in either 2008 or 2009! The general consensus was that these world records would be untouchable for a long time, and for the most part, that has been true. However, 8 world records were broken at the 2012 Olympics. I am now going to answer the question, "Were the winning Olympic times significantly slower than the world records?"

First, let's look at a box plot of the difference between the world record and the winning Olympic time (a value less than zero denotes that the world record was broken). Note: 2 men's world records were set in 2011, so I am comparing to these current records rather than pre-2010 records.

While the times are generally above zero (slower than WR time), the boxplot whiskers do extend below zero. There is one clear outlier for the men, and this ocurred when the 1500m free WR was broken by 3 seconds. Most of the variation is due to events being different distances (50, 100, 200, 400, 800 and 1500m). To account for this difference, I have normalized all times to 100m (multiply 50m time differences by 2, divide 200m time differences by 2, etc.). The normalized times are reported in the following box plots. Now the times are much less variable and there are no clear outliers.

To officially answer our question of whether times were significantly slower without fast suits, I performed a t-test for mean difference. Our null hypothesis is:

Ho: no difference between average world record time and winning Olympic time.

Leaving out the details, we obtain p-values of 0.18 for the men and 0.26 for the women. Thus, since these p-values are large (> typical cutoff of 0.05), we fail to reject the null hypothesis. We can conclude that there is no significant evidence that the winning swimming times in the 2012 Olympics were significantly slower than the world records. I also repeated the calculations after removing the 3 relays from the analysis and arrived at the same conclusion.

We cannot tell from this analysis if the fast suits has a smaller influence on time decreases as originally thought, or if swimmers are just training harder and getting stronger (I tend to believe the latter). It's also too early to tell if any of these records will be thought of as unbreakable (example: Phelp's 2008 Olympic performance + fast suit = some really fast world records). But, I think we can safely conclude that, unlike the steroid era in baseball, world records set in the fast suit era will not require an asterisk.

Monday, August 13, 2012

How Usain Bolt can rival Michael Phelps

OK, so this post isn't exactly statistics related, but watching Olympic coverage talk comparing Usain Bolt to Michael Phelps is ridiculous. First, let me say that without a doubt, Bolt is the fastest man alive. His performances are the highlight of track and field Olympics. But, winning back to back gold medals in the 100 and 200 is nowhere near Phelp's 22 medals (18 gold) over 3 Olympics. Here are 3 ways, in my opinion, for Bolt to end his career on the same page as Phelps.

1. Compete in at least 4 Olympic games. Phelps competed as a 15 year old at Sydney, swimming in the 200 fly. Combined with Athens, Beijing and London, Phelps swam in 4 Olympics. Bolt is only half way there with 2 Olympics.

2. Win both the 100 and 200 at Rio 2016. Phelps became the first swimmer to (twice) win gold in the same event in 3 consecutive Olympics (100 fly, 200 IM), while just missing out on three-peating with the 200 fly.

3. Add additional events. Every commentator who says that Bolt does not have as many opportunities to race as Phelps should be fired on the spot. Here are other reasonable events for him to race.

400 m: This is only running 2-200's in a row.
4 x 400 relay: This is the most reasonable race for him to add. Phelps swims the 4x100 free relay (turning in the 2nd fastest split this Olympics), yet he has never swum the 100 free as an individual event. Bolt doesn't need to be the fastest 400 runner to win a medal, but be part of the fastest team.
110 m hurdles: Yes, this involves hurdles, but he's tall enough to make it over the hurdles and would definitely be the fastest pure runner in the race.
Long jump: Jesse Owens won 4 gold medals in the 1936 Olympics including the 100, 200, 4x100 and long jump. Bolt has never had a single Olympics as successful as Owen's performance.
High jump and triple jump: see above.

I just listed 6 additional events for Bolt to possibly compete in. Yes, many of the events would take him out of his comfort zone, but winning these off events is what distinguishes legends from greats. If he added the 4x400 relay with another individual event and won golds in those events, I would then start to think of Bolt as competing on equal footing with Phelps.

And please don't argue that the schedule wouldn't work. Phelps (and Lochte, Franklin, etc.) won gold medals within an hour of swimming in another final or semi-final. I've yet to see a top Olympic track athlete push themselves and race multiple finals/semi races in the same day. So Bolt could be innovative in this manner too.

Finally, Ranomi Kromowidjojo is a female swimmer from the Netherlands won gold in both the 50 and 100 free and silver in the 4x100 free relay this Olympics. If she wins gold in all 3 events next Olympics and sets a few records in the process, will she be considered the greatest female swimmer ever? NO. But isn't her event schedule comparable to Bolt's (minus the relay)? YES!

Saturday, July 28, 2012

Why Roger Federer (and the entire women's tour) should never hit a second serve

In the 2012 Wimbledon Men's Championship match, Roger Federer beat Andy Murray to win his record tying 7th Wimbledon title and record 17th major overall. But it was a rough start for Federer, as he lost the first set, which included losing his first 7 second serve points (he won 3/11 overall in the first set). Federer was having a much easier time with his first serve (as do most players), winning 65% of the points when he made his first serve. This got me thinking ... does it ever make strategical sense for a player to never hit a slower second serve but always hit a first serve? The answer may surprise you (although I guess I give it away in the title).

I went through several matches for both men and women at this year's Wimbledon. Let's use the men's final as an example. Over the whole match, Federer made his first serve 68.7% of the time, winning 75.6% of those first serve points. Multiplying these two numbers together, we see that 51.9% of time time Federer made his first serve and won the point (conversely, 48.1% of the time he either made his first serve and lost the point or had to hit a second serve). Federer won 48.8% of the points when he hit a second serve. This shows that Federer would have been better off going for his first serve even when hitting a second serve, as he would be expected to win an additional 3.1% (51.9 - 48.8) of second serve points. Federer hit 41 second serves, so he would expect to win an additional (41)*(3.1%) = 1.3 points in the match had he not hit his regular second serve. This doesn't sound like much, especially since Federer won the match, but it could translate to winning an additional game.

One drawback of only hitting first serves is an increased number of double faults. Because Federer made 68.7% of his first serves, we would expect him to miss his first serve 31.3% of the time. If we consider two serves as independent events, the probability that he would miss two in a row and double fault is (31.3%)(31.3%) = 9.8%, resulting in an expected 12.8 double faults (he served 131 points in the match), much higher than the 3 double faults he actually served. But he would have also been expected to serve an additional 3.75 aces (work not shown, but trust me). So this turn out to a net decrease of (12.8 expected double faults) - (3 actual df's) - (3.75 additional aces) = 6.05 points. How can this make sense if I just said that we expect Federer to win a net of 1.3 additional points? When he does make his first serve, he wins a much higher percentage of those points compared to second serve points (75.6% to 48.8%) that it more than makes up for the double faults. In other words, the risk of only hitting first serves pays off for Roger Federer.

I decided to look at all men's matches starting with the 4th round of Wimbledon. Of the 30 matches (or 60 players, some counted more than once), 10 of the players would have expected to benefit by only hitting first serves. Roger Federer was the only player to show up twice on the list (Finals and QF matches). Of these 10, 7 lost the match (Federer won both matches and Djokovic won his QF). In 3 of Andy Murray's 4 matches that I looked at, his opponent would have been better off only hitting first serves (Federer, Tsonga and Cilic), showing how great of a returner Murray is (the other opponent was Ferrer, who doesn't have a big first serve). The largest expected gain was for Tsonga, who could have expected to win an additional 7.7 points (2 games) against Murray in the semis.

I wanted to look a little more closely at how this strategy may impact the match. My original thought was that a player probably loses a set when he is serving poorly, so only hitting first serves would result in a lot more double faults and a net loss of points. I was surprised to see that this was not the case. Looking at the QF through the Finals, players lost 26 sets. In 10 of these 26 sets, the player who lost the set would have expected to win more points by serving only first serves. This definitely supports players only hitting first serves when they lose the set, but I guess this doesn't help much retrospectively.

Looking at sets won, for 19 of 26 sets, the winner would have expected to lose more points by only hitting first serves. However, only 10 of those 19 would be expected to lose more than 2 points. Interestingly, Federer could have won more points in 3 of the 9 sets he won by only hitting first serves and never would have expected to lose an additional 2.5 points in the 9 sets that he won. This shows how great his first serve was working during the Championships.

The players are back at Wimbledon for the next 2 weeks, but this time trying to win an Olympic medal. My recommendations would be for Federer to only hit first serves, but he is probably the only male player I would recommend this to regardless of opponent. Plus, this may have the added benefit of screwing up the opponent's game plan. I would also suggest that all of Andy Murray's opponents to employ this strategy. But as I said before, no coach will tell their player to use this strategy. My guess is that this is due to aesthetic reasons - no coach wants to see his player double fault in the double digits even if he would win more points in the long run.

I also performed the same analysis for the women. To keep things short, every women player should only hit first serves. Of the final 7 matches (14 players) that I looked at, only 5 players would not have benefited by this strategy. In fact, 4 of those 5 players who would not have benefited lost the match anyway (the only exception being Serena Williams in the SF). On average, the women would expect to win an additional 1.1 points per match. Compare this to the men, who on average would expect to lose an additional 2.5 points.

Will any players employ this strategy at the Olympics? I'm guessing not.

Sunday, July 15, 2012

Wimbledon 2012 wrap-up

Sorry for the long delay in posting - I'm still alive, but have been extremely busy with the new job. I have also decided to spend more time on "novel" posts where I analyze the data myself to answer new questions, as these seem to get a much better response from readers. But these posts take more time to write, so I will not be posting as often. I am finishing up a new analysis with some data from Wimbledon, and I hope to blog about this soon. For now, I want to give a quick update on my Wimbledon picks (you can read about my picks here).

First the good news: I correctly picked Serena Williams. The bad news: I picked Novak Djokovic, but he lost to eventual champion Roger Federer. So I went 1/2, or 3/4 when combining my 2 correct French Open Picks. When looking at the ESPN expert picks over these past 2 grand slams, only Chris Evert has correctly picked all 4 champions.

Let's hope I can keep up the steam going into the US Open in August/September.

Sunday, July 1, 2012

Are Free Agents Worth the Money? (Plot of the Week 9)

This week I created a few plots to look at the effect of major MLB and NBA free agent players switching teams.

First, I wanted to see if players tended to move to teams with a better chance at winning a championship (think LeBron James), or if they moved to worse teams that would pay them more money (Albert Pujols who left after winning a championship with the Cardinals). The first plot shows the regular-season team winning percentage for the three years before the player switched teams (old team) and three years after switching to a new team.

Note: Since some players recently switched teams (LeBron, Bosh, Pujols, Fielder), we don't have a full 3 years of data to look at.

Teams that win championships are denoted by small squares. LeBron and Bosh are the only players to win a championship within 3 years of switching teams. Pujols was the only player to switch teams after winning a championship within the past 3 years.
For this small sample size, it looks like basketball players (dashed lines) that switched teams moved to better teams, where baseball players received a lot of money to move to teams that had roughly equivalent records.

Next, let's look at the winning percentage of teams before and after signing a big free agent (i.e, the Heat before and after LeBron). The thought is that spending all of this money to sign a marquee player will help the team win more.

Basketball players have an immediate effect on increasing the win percentage.
Baseball teams seem to have a worse record in the year immediately after signing a big free agent. None of the baseball players brought a championship to their new team within 3 years.
This shows that it is worth trying to sign big name basketball free agents. However, baseball free agents don't seem to be worth the big bucks in terms of immediately winning more games.
One could argue that it makes economical sense to sign big name free agents (increased ticket sales, marketing, etc.) even if they don't increase winning percentage.

Finally, I wanted to look at the winning percentage of the teams that lose a free agent (i.e., the Cavaliers before and after LeBron).

Again, we see that basketball stars have a huge effect on their team. Every team had a much lower winning percentage after losing a key free agent. But the teams seem to recover within a few years, presumably because they are able to rebuild quickly through the draft.
Baseball teams do not suffer the same losses after losing a key free agent. In fact, some teams have a huge increase in win percentage immediately after losing they player. This includes the 2003 Chicago White Sox who won the World Series the year after Magglio Ordonez left for the Tigers.
The effects of losing a key player may have a longer-lasting impact, with win percentage decreasing 2 and 3 years after losing a free agent.

Overall conclusion: Baseball players don't have as large an impact on win percentage as basketball players. Thus, they probably don't deserve the huge contracts they are earning.

One caveat about this analysis is that the sample size is very small. I have limited the analysis to players who sign huge free agent contracts with another team, and exclude players who were traded. I would love to add more players to this analysis, so leave a comment if you can think of a player who you would like to see included.

Saturday, June 30, 2012

Wimbledon Picks 2012 (Explanation)

With the first week of Wimbledon now a thing of the past, I figured I should actually give my reasoning for my Wimbledon picks.

First, I went with my gut (over data because I didn't have the time) when picking the men's and women's sleepers - players outside the top 10 to go the furthest. I picked John Isner and Venus Williams who both lost first round. Guess I shouldn't listen to my gut feeling when making picks anymore.

So let's get to the picks that I actually put some thought into. I only looked at one stat when choosing my women's champion. Here are the results of the Williams sisters (both singles and doubles) at Wimbledon since 2000.

**Williams Sisters Wimbledon Results**
Year	Doubles Result	Venus Singles	Serena Singles
2000	WON	WON	SF
2001	3RD	WON	QF
2002	WON	RUP	WON
2003	3RD	RUP	WON
2004	-	2RD	RUP
2005	-	WON	3RD
2006	-	3RD	-
2007	2RD	WON	QF
2008	WON	WON	RUP
2009	WON	RUP	WON
2010	QF	QF	WON
2011	-	4RD	4RD

Since 2000, every year that the Williams sisters have played doubles at Wimbledon (highlighted), one of them also wins the singles title (8/8, with both Williams reaching the Final 4 times). Notice that it doesn't matter how they perform in the doubles, as long as they are playing. The 4 years that the Williams did not play doubles, often due to one sister being injured, only once did one of them win the singles title (Venus in 2005).

Serena and Venus entered to play doubles this year (they are currently in the 2nd round). Looking at past results, the probability that a Williams wins the single title given they play doubles is 1. So all I had to do was choose between Serena or Venus winning. I went with Serena because she is the higher ranked and Venus is coming back from injuries. As of this post, Serena has won her first 3 singles matches and is still alive.

For the men's, I looked at the winner's seed and previous year's performance since 2003 (Federer's first Wimbledon win).

**Men's Wimbledon Champions**
Year	Winner	Seed	Previous Result
2003	Roger Federer	4	1st
2004	Roger Federer	1	WON
2005	Roger Federer	1	WON
2006	Roger Federer	1	WON
2007	Roger Federer	1	WON
2008	Rafael Nadal	2	RUP
2009	Roger Federer	2*	RUP
2010	Rafael Nadal	2	-
2011	Novak Djokovic	2	SF

* Federer was the 2 seed, but he was technically the top seed after Nadal withdrew as the number 1 seed with injury before the 2009 tournament began.

Because Federer, Nadal and Djokovic have won 28 of the last 29 major titles, I have eliminated the rest of the field. I eliminated Federer because it has been one of the top 2 seeds to win every year since 2004. I went with Djokovic over Nadal because the defending champion has won 6 of the last 9 years (if I consider Federer the defending champ in 2009 since Nadal withdrew and Nadal the defending champ in 2010 since he had not defended his title.**) Also, Federer and Nadal each defended their first Wimbledon title by winning the following year that they played. This gives me belief that Djokovic will continue this trend and win his second straight Wimbledon title.

My men's pick is looking even better after Nadal lost in the second round. (Still, I'll be cheering for Roger Federer to win the title as he is my favorite player.)

** Under this argument, both Federer and Nadal would have been considered defending champions in 2010, so you could change the fraction to 6/10 defending champs winning.

Sunday, June 24, 2012

MIT Sloan Sports Analytics Conference

I came across a great article today (via Simply Statistics) describing the largest sports analytics (i.e. statistics) conference in the world, held at MIT. A few points that I found most interesting:

The guy that they highlight (Kirk Goldsberry) created the plot of the week last week.
They mentioned that they had a panel on tennis analytics for the first time this year, including Pete Sampras and Roger Federer's coach Paul Annacone and former player Todd Martin. Two great quotes:

On why tennis analytics is lacking: "There's no shared service," says Martin. "This isn't a team sport with a $500,000 budget for analytics."

"[Analytics] should be a strength of our game. Tennis is a game of patterns." - Craig O'Shannessy

This conference could be worth a vacation next year. Might depend on whether I have one or two great blog posts to report on.

Plot of the Week 8

I've been dreaming up some cool plots to show off on my blog, but as I've mentioned before, thinks are a little crazy for me right now. So this week I am borrowing another neat visualization that I found on the web.

http://thepowerrank.com/visual/NCAA_Tournament_Predictions

This is an interactive plot: put your mouse over your favorite team to see their probability of winning it all (based on this group's predictive model). Or put your mouse over a specific game to see the probability that each team will win that game. This visualization relies on java, so if its not working for you, blame your web browser!

Saturday, June 23, 2012

Wimbledon 2012 predictions

Wimbledon starts on Monday and I wanted to announce my picks (I was 2/2 for the French Open).

I am going to follow ESPN's format for their picks (which you can see here):

Men's winner: Novak Djokovic

Men's sleeper (someone outside of the top 10 to go the furthest): John Isner

Men's toughest road (top 10 player to lose earliest): Janko Tipsarevic

Women's winner: Serena Williams

Women's sleeper: Venus Williams

Women's toughest road: Samantha Stosur

Things are very busy for me, so I do not have the time to justify my picks right now - but I hope to later in the week.

Tuesday, June 19, 2012

Plot of the Week 7

Sorry for the delay in posting. I started my new postdoc position yesterday and am still trying to get settled in. Here is an interesting chart showing the NBA players with the lowest shooting percentage in different areas of the court. Enjoy!

http://courtvisionanalytics.com/behold-the-worst-shooters-of-2012/

Friday, June 15, 2012

Pissing off an NFL player

Chicago Bears cornerback Charles Tillman does not like your pro-Packers math!

None of my homework or exam questions ever elicited this type of response.
Question on her next hw assignment: If 100 NFL players attempted this stat problem, how many would answer correctly?

Monday, June 11, 2012

French Open Picks: Follow Up

A few weeks ago, I made my picks for the 2012 French Open champions (Part 1 and Part 2). Using my statistical wizardry (ok, more like logic and process of elimination), I correctly picked Rafael Nadal as the men's champion and Maria Sharapova as the women's.

Compare this to the ESPN expert picks: 11 of the 12 correctly picked Nadal, but only Chris Evert picked Sharapova. Maybe if I keep this up for Wimbledon and the US Open, ESPN will start knocking on my door.

Sunday, June 10, 2012

Plot of the Week 6

Yesterday, Union Rags won the 2012 Belmont Stakes. For this week's edition of Plot of the Week, I wanted to look at the winning times at the Belmont Stakes (since 1926, when the current distance of 1.5 miles was established).

The red circles designate that the betting favorite won, and the blue +'s show the horses that won the Triple Crown in that year. The green and magenta lines show the average and median winning time, respectively.

Some interesting takeaways:

The record time is 144 seconds, run by Secretariat in 1973. No horse has gotten within 2 seconds of this winning time.
There is an overall downward trend in time, although winning times seem to have leveled off in the past few decades.
There were 2 horses that won the Triple Crown but were not favorites at the Belmont Stakes.
The past 3 Belmont Stakes winners have had slower than average times.
The betting favorite has not fared well lately - only 7 of the favorites have won since 1980.

Now let's look at the betting odds of the winning horse.

Obviously, all of the favorites that won had low betting odds.
The 2 horses that won the Triple Crown but were not the betting favorites still had very low betting odds.
There have been many more long-shot winners (> 20:1 odds) in the past 15 years (5) than all previous years (3). I don't know if this has to do with the format of the race (more horses allowed to run) or there is greater disparity in horse racing today.

Did I miss anything?

Friday, June 8, 2012

Check My Stats

Anthony Davis rocking a "Check My Stats" t-shirt. Image borrowed from following link.

Robinson not buying Davis hype

1. Where can I get this shirt?

2. I'm thinking that if Robinson wears a shirt reading "Numbers Don't Lie", another NBA prospect should wear one saying "Torture Numbers and They'll Confess to Anything", a quote from Gregg Easterbrook. Maybe someone who had an underwhelming college season - Harrison Barnes?

Tuesday, May 29, 2012

Djokovic and Federer: Semifinal Attraction

For the past 20 grand slam tournaments (5 years, including this French Open), Federer, Nadal and Djokovic have been ranked/seeded in the top 3. Nadal has been either the #1 or #2 seed in all of these tournaments, meaning that either Federer or Djokovic has been the #3 seed. A majority of the time, Murray has been the #4 seed, but there are some exceptions (Soderling, del Potro, Ferrer, Davydenko, Roddick).

Nadal Getting Lucky with Grand Slam Draws?

In tennis, the placing of seeded players in the draw includes some randomness. That is, half the time the #1 seed will draw the #3 seed in the semi-finals, and the other half of the time will draw the #4 seed. This is different from basketball and other tournaments, where the semis will always consist of #1 vs #4 and #2 vs #3 (barring upsets). This means that Federer and Djokovic would expect to be in the same half of the draw (i.e., scheduled to meet in the semis) in 10 of these 20 Grand Slams. In actuality, Federer and Djokovic have been placed in the same half of the draw 15 times (actually playing 7 times in the semis, with Djokovic leading 4-3). This means that Nadal has been scheduled to play the #4 seed in 15 of the last 20 Grand Slams, giving him the easier path to the final. In fact, 5 of the 7 majors won by Nadal in this time span were won when Federer and Djokovic were both on the other side of the draw (only defeating both players in the 2008 French Open). Does this supply evidence that the draw is rigged in favor of Nadal, thus increasing his chances of winning?

This problem can be rephrased in terms of flipping a coin. If we flip a coin 20 times, how likely is it that we see 15 heads (i.e., Fed and Djokovic are on the same half of the draw)? Coin flips are independent, so we can easily calculate the probability of 15 heads to be 0.0148, or about a 1.5% chance. This is very unlikely, but not completely unexpected.

To better illustrate the point, I ran 10,000 simulations where I flip a coin 20 times and record the number of heads, shown in the following graph. These simulations show that we expect to see at least 15 heads about 2% of the time (proportion to the right of the red line), which is similar to the theoretical value of 1.5%.

Nice to See You Again...

Another interesting pattern is that Federer and Djokovic were placed in the same half of the draw for 7 straight Grand Slams (2008 Wimbledon - 2010 Australian Open). How likely is this occurrence? That is, how likely is it to see 7 heads in a row when flipping a coin 20 times? The theoretical calculation is more complicated, so I again ran 10,000 simulations of 20 flips and recorded the maximum number of heads to appear in a row.

The probability of seeing a run of 7 heads (i.e., 7 heads in a row) or more is 0.1112, or about 11%. This means that if you flip a coin 20 times, you will see a run of at least 7 heads 11% of the time, which is probably much more common than you might expect.

There was another stretch of 6 grand slams where Fed and Djokovic were again on the same side of the draw (2012 Wimbledon - 2011 US Open; combined with the previous run of 7, Federer and Djokovic were on the same half of the draw 13 out of 14 grand slams!). What's the probability that the second longest run when flipping a coin 20 times is at least 6? It's about a 1% chance, which is much less common than seeing a run of 7.

Will it Ever End?

While a run of 7 grand slams in a row is not very uncommon, having a second run of 6 or being placed on the same side of the draw 15 out of 20 tournaments is not very likely. I do not actually believe that the draw is rigged, but the frequency of Fed and Djokovic meeting is only expected to occur about 1% of the time.

The Law of Large Numbers tells us that, eventually, we expect Djokovic and Federer to be on opposite sides of the draw 50% of the time. However, as neither player will be playing infinitely many more grand slams, this does not impact the probability at the next Grand Slam. That is, assuming the rankings stay the same, there is still a 50% chance that Federer and Djokovic will be placed on the same half of the draw at Wimbledon, the next Grand Slam tournament.

Sunday, May 27, 2012

Plot of the Week 5

Last week, David Epstein (from Sports Illustrated) reported that, contrary to popular belief, recent studies have shown that NFL players actually have a longer life expectancy than non-NFL players. You can find his original article here. The printed SI article included some graphics, but I thought that the data could be visualized better. So I am presenting the same data as in the article but in a more effective way.

The first plot shows that fewer NFL players have died than would be expected in the general US male population (*when looking at men with similar age and race to the NFL players in the study), 334 vs. 625, a 43% decrease. The different colors represent different death causes.

The second plot looks at the expected and actual death rates for 3 special death categories: suicide, heart disease and cancer. In each of these categories, NFL players have a lower death rate than the US male population (including suicide, 9 vs 22, a 59% decrease!!!).

The original academic paper that reported these numbers and results can be found here. I only glanced through the paper for a minute, but the statistical methods seemed reasonably sound to me. Even though this data was collected in 2007, these number seem to suggest that anecdotal evidence (i.e., the media) is responsible for exaggerating the link between playing in the NFL and suicide.

Friday, May 25, 2012

French Open Bracket Challenge

The French Open draws came out today. For the past few years, I have been filling out grand slam brackets online with my family for bragging rights. Normally we all make awful picks, but one of my parents manages to crush me. This year, I thought that I would open up our family pool to all of my blog followers. To enter your brackets, follow the links, then create an account and your ready to go! Note that the entries must be received by Sunday May 27 at 5 AM.

Men's bracket:
http://www.tourneytopia.com/RacquetBracketFrenchOpenWTA/nosweatstats/Default.aspx

Women's bracket:
http://www.tourneytopia.com/RacquetBracketFrenchOpenWTA/nosweatstats2/Default.aspx

If you are having any difficulty, leave a comment and I will try to help you figure it out. You can also enter the main pool, which will allow you to win some prizes if you finish first. Winners will get a shout-out in a future post.

Don't forget to check out my picks for the winners.

Sunday, May 20, 2012

Plot of the Week 4

Here are 2 neat plots from Skeptical Sports Analysis. Follow the link to read the whole article for an in-depth analysis between LeBron and Larry Bird. The short story is that LeBron had his best year in 2008 in Cleveland, and Larry Bird never had a losing season.

Saturday, May 19, 2012

College Football Playoff

It is looking more likely by the day that there will soon be a 4-team playoff to determine the national champion for college football. One of the details that is still being determined is how the 4 playoff teams will be selected, described in this SI article. There are 2 interesting issues raised by this article that I want to address.

1. Computer programs are currently used in the BCS system to determine which 2 teams will play for the national title. The problem is that "computer programmers ... refuse to reveal the formulas that determine their rankings". One of the current hot topics in statistics is reproducibility of research - when you get your work published in a journal, you need to describe your methods so that anybody else reading the article can reproduce the results. If you don't reveal your method, then other researchers are not able to accurately evaluate your work and your conclusions become suspect. How can we trust in a ranking where we don't know the input variables and model? This is especially relevant in this context, as there is no way to evaluate the rankings (i.e., there is no "true" ranking that we can compare model performance with).
If the authors are worried about others stealing their work, all they have to do is file for a patent/copyright. If that is not the issue, I am left to believe that they are worried that others will improve upon their model and obtain "better" results. That doesn't instill much confidence. And because no one knows the model, how can we be sure that the model isn't tweaked each week for someone's favorite team to move up the rankings? (I guess the BCS is overseeing things to make sure that this isn't the case, but who knows).

2. The article supports evaluating and selecting playoff teams only after the entire regular season is finished, rather than ranking teams after each week: "Because committee members ... would evaluate the entire body of work, schools will be more apt to schedule quality out-of-conference opponents." What a novel idea ... NOT! One of the major rules in designing experiments is that you cannot modify your experiment half way through because you do not like the preliminary results (exception: cancelling a clinical trial resulting in many deaths). You have to wait until you have all of the data before testing a hypothesis. Games are played to determine the best team on the field, so let's collect all of the results before trying to determine the final rankings. Otherwise, the preseason polls are very likely to be the tiebreaker between evenly matched teams, which has absolutely nothing to do with performance on the field.
For example, image that all of the preseason top 5 teams go undefeated. After each week, none of the teams have a reason to slide down the rankings because they all won. Similarly, it is tough for the team ranked #5 to jump ahead of the other teams that also did not lose. From week to week, voters tend to assume that the previous ranking is truth, and so need exceptional evidence to move one team above another. This also requires the voters to admit that their previous ranking was incorrect, and who wants to admit they are wrong?

With this post, I am officially throwing my hat into the ring of choosing the 4 college football playoff teams. I promise to wait until the end of the season, when all of the data has been collected, to make my selections (even if this means not watching ESPN during the fall when they update their playoff teams every hour). If I use a model to select the teams, I will be fully transparent, allowing everybody access to the methods I used, so that my results can be replicated. Because some may not agree with choices related to my model, I would also be prepared to defend my model by evaluating its performance using previous years' data or using some other metric. Finally, because I am a new PhD graduate, I would be willing to accept a discounted salary compared to other BCS executives (I'm thinking about $100k would be fair for the month I would be working each year).

Thursday, May 17, 2012

French Open 2012 - Part 2

In today's post, I am continuing my picks for the French Open champions. You can read the first part here.

First, let's look at each champion's best previous performance at the French Open:

**French Open Champions**
Previous Best FO Result	# of Men's Champions	# of Women's Champions
Winner	7	3
Runner-up	2	3
Semifinals	0	2
Quarterfinals	1 (Costa)	2
4R	1 (Gaudio)	1 (Li Na)
2R	0	1 (Myskina)
Did Not Play	1*	0

*Rafael Nadal won the first French Open he played, in 2005.

Of the men that I have not yet eliminated, Federer and Nadal have both won the French Open, Murray made the semifinals, and Tsonga has never been past the 4th round. Because all of the champions since 2000 (except Gaudio) have been to at least the quarterfinals, I am eliminating Tsonga. This leaves me with:

~~Novak Djokovic~~ (#1 seed but not defending champ, AO Winner)
Roger Federer
Rafael Nadal
Andy Murray
~~Jo-Wilfried Tsonga~~ (never previously to QF)

For the women, 10 of the past 12 winners have previously been to the quarterfinals. Using this as a guideline, I am eliminating Radwanska and Kvitova (neither has been past the 4th round). Note that Li Na had never been to the quarterfinals previously until she won the French Open last year. This leaves:

~~Victoria Azarenka~~ (#1 seed but not defending champ)
Maria Sharapova
~~Agnieszka Radwanska~~ (never previously to QF)
~~Petra Kvitova~~ (never previously to QF)
~~Samantha Stosur~~ (1R)
Serena Williams
~~Marion Bartoli~~ (3R)
Carolina Wozniacki
Li Na
~~Vera Zvonareva~~ (3R)

Next, let's look at the number of clay court warm-up tournaments won by the eventual champion in his/her winning year. The plots below show bar plots of the number of warm-up tournaments won by men (top) and women (bottom).

The 2 men who did not win a warm-up tournament were both ranked outside of the top 20 when they won (Costa, Gaudio). Because I am only considering men ranked in the top 5, I will require that my pick wins at least one warm-up tournament. Federer and Nadal have both won clay court warm-ups this year, but Murray did not (he just lost in the final warm-up of the year). This leaves me to choose between Federer and Nadal.

For the women, 4 of the 12 did not win warm-ups. However, 3 of these 4 were finalists at the Australian Open (Henin, Ivanovic, Li) in the same year. So I will require that my pick either has won a warm-up or was a finalist at this year's Australian Open (Sharapova). This eliminates Wozniacki and Li (assuming she does not win this week), leaving Sharapova (AO finalist) and Williams (2 warm-up wins).

Now it is time to pick my favorite to win this year's French Open.

Federer vs. Nadal: Nadal has only lost once at the French Open, winning 6 of the last 7 championships. Federer is on a hot streak of his own, going 47-3 since last year's US Open and winning 7 of his last 10 tournaments. Federer has never beaten Nadal at the French Open, winning his only championship when Nadal was eliminated by Soderling. Because Federer has not won a grand slam in over 2 years and Nadal has only lost once at the French Open, I am going to pick Rafael Nadal.

Sharapova vs. Serena Williams: Serena has yet to lose a clay court match this year, winning 2 tournaments and still playing in another. Given that she also won the French Open in 2002, I am assuming that she will be the overwhelming favorite on the women's side this year. However, Serena was in a similar position going into last year's US Open, then lost in the Finals to Stosur, a player who had never won a grand slam and who Serena had beaten a few weeks previously. Sharapova is the number 2 player in the world, reaching the finals of Wimbledon and the Australian Open. However, in both finals she lost to players who had never been in a grand slam final before. So its safe to say that both players have a lot on the line at the French Open. Serena has had some bad loses in the past at the French Open (although she was out with injury last year), and last year Sharapova made the semifinals. Because of better results at the French Open in the past few years and the consistent grand slam performances in the past year, I am going with the upset and picking Maria Sharapova.

I realize that I made my conclusions using a lot of subjective opinion. If I have the time, I hope to build a model to predict this year's champions and compare to the picks that I just made.

Tuesday, May 15, 2012

French Open 2012 - Part 1

It's already the middle of May, which means that the French Open is right around the corner. In this week's posts, I will make my predictions for the men's and women's 2012 French Open Champions. I will be looking at the results since 2000 to determine the player who I think is most likely to win. Let's start by looking at the seed for the winner:

**French Open Champions**
Year	Men's	Seed	Women's	Seed
2000	Gustavo Kuerten	5	Mary Pierce	6
2001	Gustavo Kuerten	1	Jennifer Capriati	3
2002	Albert Costa	22	Serena Williams	3
2003	Juan Carlos Ferrero	3	Justine Henin	4
2004	Gaston Gaudio	44	Anastasia Myskina	6
2005	Rafael Nadal	5	Justine Henin	10
2006	Rafael Nadal	2	Justine Henin	5
2007	Rafael Nadal	2	Justine Henin	1
2008	Rafael Nadal	2	Ana Ivanovic	2
2009	Roger Federer	2	Svetlana Kuznetsova	7
2010	Rafael Nadal	2	Francesca Schiavone	17
2011	Rafael Nadal	1	Li Na	6

10 of the past 12 men's champions have been seeded 5 or higher. Using this as our cutoff, this leaves the following players as possible winners this year (by current world ranking):

Novak Djokovic
Roger Federer
Rafael Nadal
Andy Murray
Jo-Wilfried Tsonga

Djokovic, Federer, and Nadal have won 27 of the past 28 Grand Slam tournaments (Del Porto won the 2009 US Open), but we won't eliminate Murray and Tsonga quite yet.

11 of the past 12 women's champions were seeded in the top 10. Using the current world rankings, this leaves the following players:

Victoria Azarenka
Maria Sharapova
Agnieszka Radwanska
Petra Kvitova
Samantha Stosur
Serena Williams
Marion Bartoli
Carolina Wozniacki
Li Na
Vera Zvonareva

Also note that Kuerten, Nadal, and Henin are the only players since 2000 to have won the French Open as the number 1 seed. Each time, they were the defending champion. Since it looks like Djokovic and Azarenka will receive the number 1 seeds and neither won the French Open last year, we will eliminate them from out list.

Now let's look at each champion's performance in that year's Australian Open (the most recent Grand Slam).

**French Open Champions**
Year	Men's	AO Result	Women's	AO Result
2000	Gustavo Kuerten	1R	Mary Pierce	4R
2001	Gustavo Kuerten	2R	Jennifer Capriati	Win
2002	Albert Costa	4R	Serena Williams	DNP
2003	Juan Carlos Ferrero	QF	Justine Henin	SF
2004	Gaston Gaudio	2R	Anastasia Myskina	QF
2005	Rafael Nadal	4R	Justine Henin	DNP
2006	Rafael Nadal	DNP	Justine Henin	F
2007	Rafael Nadal	QF	Justine Henin	DNP
2008	Rafael Nadal	SF	Ana Ivanovic	F
2009	Roger Federer	F	Svetlana Kuznetsova	QF
2010	Rafael Nadal	QF	Francesca Schiavone	4R
2011	Rafael Nadal	QF	Li Na	F

None of the past 12 men's champions had won the Australian Open in the same year. Assuming this same pattern holds, this eliminates the 2012 Australian Open Champion, Novak Djokovic (who we had already eliminated). Our updated men's list looks like:

~~Novak Djokovic~~ (#1 seed but not defending champ, AO Winner)
Roger Federer
Rafael Nadal
Andy Murray
Jo-Wilfried Tsonga

Each of the past 12 women's champions either did not play (DNP, ocurred 3 times in the past 10 years) or reached at least the 4th round if they did play. This leaves us with the following women:

~~Victoria Azarenka~~ (#1 seed but not defending champ)
Maria Sharapova
Agnieszka Radwanska
Petra Kvitova
~~Samantha Stosur~~ (1R)
Serena Williams
~~Marion Bartoli~~ (3R)
Carolina Wozniacki
Li Na
~~Vera Zvonareva~~ (3R)

The next post will look at how important previously winning a Grand Slam or winning a clay court warm-up tournament is.

Sunday, May 13, 2012

Plot of the Week 3

Tyler Zeller: Recent Graduate and 2012 ACC player of the year

This weekend I officially graduated and earned my PhD from UNC. In honor of graduation weekend, I want to look at an interesting plot of another new UNC graduate, Tyler Zeller. This shows the trend of Points Per Game (PPG) over all 4 of Zeller's seasons. Thanks to StatSheet.com for the interactive plot.

College Basketball Stats

A few notes:

Zeller was out most of his freshman year with an injury, thus the constant decline in PPG.
He improved in PPG (along with most of the other stats) each of his 4 seasons.
Compared to 2010-2011, Zeller averaged less points in the first half of this past season, but really picked up his scoring once the ACC season began in January.
I love that this is an interactive plot, highlighting the curves and showing specific values when you mouse over it. I need to learn to make some of these cool plots.
Congrats to Tyler Zeller, my friends from the UNC Department of Statistics and Operations Research that graduated with me, and all of the other UNC students who graduated this weekend! Go Heels!

Wednesday, May 9, 2012

Simplest NBA Playoff Model

The Simplest Playoff Model You’ll Never Beat

There's an ESPN competition where many well-respected sports statisticians try to predict the NBA playoffs. This article was written by the winner of last year's competition. He also goes into detail explaining the difficulties of predicting this year's playoffs.

Here is an excerpt where he explains the a very simple playoff model:

So case in point, I came up with this 2-step method for picking NBA Champions:

If there are any teams within 5 games of the best record that have won a title within the past 5 years, pick the most recent.

Otherwise, pick the team with the best record.

Following this method, you would correctly pick the eventual NBA Champion in 64.3% of years since the league moved to a 16-team playoff in 1984 (I call this my “5-by-5″ model).

Of course, thinking back, it seems like picking the winner is sometimes easy, as the league often has an obvious “best team” that is extremely unlikely to ever lose a 7 game series. So perhaps the better question to ask is: How much do you gain by including the championship test in step 1?

The answer is: a lot. Over the same period, the team with the league’s best record has won only 10/28 championships, or ~35%. So the 5-by-5 model almost doubles your hit rate.

Great example of how simple models can result in great prediction accuracy.

Monday, May 7, 2012

Plot of the Week 2

Here is a plot showing the top offensive performances by year for MLB, divided into the American League (AL) and National League (NL), and NBA. For baseball, I plot the home run leaders for each year since 1947. For basketball, I plot the scoring title winner's points per game. Because the NBA is played over 2 calendar years, I plot the year that the season started in (i.e., Kevin Durant just won the scoring title for 2011-2012, which is plotted as 2011).

A few quick notes on this graph:

There is a lot of variability - good luck trying to predict the number of home runs the leader will have after this season.
1961 was a great year for individual performances:

Roger Maris hit 61 home runs
Wilt Chamberlain averaged 50.4 PPG

There is a clear rise in the number of home runs between 1995 and 2008-ish, representing the steroid age. There is no increase in scoring for the NBA during this time frame.

Now's the time to sound really smart and comment about what I may have missed.

Thursday, May 3, 2012

The Mystery Behind NFL Suicides

If you haven't heard, former NFL player Junior Seau was found dead in a possible suicide. This event has the media up in arms again about the link between concussion, depression and suicide in football players. I'm sure that its only a matter of time before Congress steps in to voice their opinion*.

While pro athletes live glamorous lives while playing, many retired athletes struggle with depression after falling out of the limelight and other personal struggles (bankruptcy, family issues, not knowing what to do when retiring at age 30, etc), regardless of sport. I wanted to look at the data for suicide numbers between the NFL (a high contact sport) and other non-contact professional sports, hoping to find a higher rate among NFL players than other athletes.

However, for all of the press that the NFL and suicide are getting right now, I challenge you to find the number of former NFL players who have committed suicide - I can't find it! Here's what I did find:

Cricket is known to have the highest suicide rate in professional sports, with over 150 known cases in the 20th century. See the link here for a good explanation of possible reasons for this.
As of 2005, there have been 76 suicides of former MLB players.
All we have for NFL suicide numbers are anecdotal stories, which play towards the heart but not the mind of statisticians.

I realize that researchers have shown a link between brain injuries sustained playing in the NFL and depression, which has at times has led to suicide. But without the data, we can't conclude that former NFL players are any more likely to commit suicide than other athletes, such as MLB players or cricketers. Extra credit to anyone who can help me find this data.

*Showing how effective their MLB steroid hearings were, Rogers Clemens is currently in court for possibly lying about possibly taking steroids and Ryan Braun, who last year failed a drug test and got off on a technicality, is on the field making millions playing in the MLB.

Sunday, April 29, 2012

Plot of the Week 1

Today is the first installment of Plot of the Week. I am planning on posting a sports-related plot every week. Some weeks I will interpret the plot; other weeks I hope that you will help post what you find interesting about the plot.

This is a scatterplot of athletes' win percentage in Majors vs. win percentage at other non-Major tournaments.

I have plotted male golfers (Tiger Woods pre-scandal, current Tiger Woods, Phil Mickelson, Jack Nicklaus) and female golfers (Yani Tseng, Annika Sorenstam) in black and male tennis players (Pete Sampras, Roger Federer, Rafael Nadal, Novak Djokovic) and female tennis players (Venus and Serena Williams, Martina Navratilova, Chris Evert, Steffi Graf) in red.
Annika Sorenstam and Novak Djokovic are tough to read due to overplotting in the middle of the plot.
The solid line represents the 45-degree line (y = x) and the dashed line is the least squares regression line.

Here are the most interesting features in this graph (to me, at least):

Phil Michelson does not win tournaments (Majors and non-Majors) as frequently as the other athletes.
Graf won much more often than anyone else - over 50% of non-Major tournaments and 40% of Major tournaments. From this view, it seems that Graf is clearly the best female tennis player, with Evert and Navratilova not too far behind.
Athletes close to the solid line win Majors at the same frequency as non-Majors. This seems to represent the set of athletes who are not intimidated by the pressure at the Majors.
Yani Tseng is the only athlete much farther above the line - she seems to over-preform at the Majors. It is still very early in her career, so I would not be surprised if she moves closer to the diagonal line as her career progresses.
The athletes below the line tend to under-perform at the Majors compared to other tournaments. But this set contains some of the best athletes ever, so this is not quite a fair assessment (Venus has 7 majors and Evert and Navratilova 18 each). Maybe it is more fair to say that they over-perform at non-Majors.
Federer and Nadal have very similar winning percentages. Since Federer is 5 years older than Nadal, it will be interesting to see if Nadal can keep this same pace for the next 5 years to catch up to Federer in terms of tournaments won.

Do you notice anything else interesting that I missed?

Wednesday, April 25, 2012

SFS Swimming Conspiracy? Part 3

This is the final part of my trilogy on referee bias against the St. Francis swim team at the state meet. You can read the first two parts here and here.

Is Greg celebrating another district win, or about to smash the trophy on
someone's head after getting DQ'ed at states?

In the previous post, I proved that SFS being DQ'ed 5 out of 61 relays cannot be explained by random chance. So if this large number of DQ's cannot be explained by random chance, is there anything else besides referee bias that can explain this event? Here are some possibilities:

SFS swimmers are more prone to false starts. As mentioned previously, the 5 DQ's were all blamed on different swimmers. In fact, only 2 swimmers were on more than one DQ'ed relay (and they were only on 2 DQ'ed relays). Unless the murky water in the St. Francis pool is causing the swimmers' muscle fibers to twitch too quickly, I don't think the swimmers can explain the difference. But I will back this up with numbers in a minute.
While the swimmers have changed over the past 11 years, the coach, Keith Kennedy, has not. Is Keith to blame for these DQ's, or is he the muse of Lady (Bad) Luck? In my 4 years being coached by him, he never instructed us to false start. In fact, if anything, all of these DQ's have caused him to stress the importance of not pushing the starts. So I'm not buying this explanation either.

You may be thinking that I am too close to the situation to blame Keith or the swimmers. So let me finish with one last analysis that will put the "!" on this debate. In order to qualify for the state tournament, every relay must place high enough at the district tournament to advance. The OHSAA website also shows the Northwest Ohio District swimming results for 8 of the last 9 years (2006 is missing: apparently they are still waiting for one of the other districts to finish swimming before posting results). The coach and, for the most part, swimmers on the relay do not change from districts to states.

For the available data, 15 of the 529 non-SFS relay swims (2.8%) have been DQ'ed at the district meet. This is about twice as frequent as the perennial contenders at the state meet (1.3%), which should be expected as the quality of swimmers at districts are not as high as the swimmers and relays at states. Still, this shows that officials are DQ'ing more relays at the district meet than the state meet. SFS has not been DQ'ed at districts in the past 11 years for a total of 33 relays (even though I don't have the district results for all years, had a relay been DQ'ed at districts, they would not have qualified for the state meet, which I have data for). The probability that a team is not disqualified in 33 relays at the district meet if all relays are swam independently is

(1 - 0.028)³³ = 0.405

Let's do the same analysis, but using the DQ frequency of SFS at the state meet. That is, the probability that a team that is DQ'ed 8.2% of the time at states would go 33 straight relays without getting DQ'ed at districts is

(1 - 0.082)³³ = 0.059

In other words, it is highly probable that a given school will not be DQ'ed in 11 years at districts (probability of about 40%). But schools that are DQ'ed as often as SFS at states would expect to be DQ'ed at least once at districts 94% of the time.

Now, one can argue that because the competition is less at districts than states, that the swimmers play the relay starts safe at districts but push them at states. But I would also expect this to hold for all of the perennial contenders, so this doesn't explain the difference in DQ's between SFS and other perennial contenders at the state meet. One could also argue that there is a referee bias for the perennial contenders at districts. Why? The officials may want the district to have a good showing at the state meet, so it would hurt the district to DQ one of the top relays even if there really was a false start. However, it is tough to quantify this bias in favor of SFS at the district meet.

This look of the district data should show that the large number of DQ's is most likely not fully explained by the coach or swimmers, as you would expect to see similar DQ patterns at the district meet. This three-part series conclusively shows that it is extremely unlikely that there is no referee bias against the SFS swim team at the state meet.

If you enjoyed this series, help out my ego and leave a comment or subscribe to follow my blog.