Tuesday, May 29, 2012

Djokovic and Federer: Semifinal Attraction

For the past 20 grand slam tournaments (5 years, including this French Open), Federer, Nadal and Djokovic have been ranked/seeded in the top 3.  Nadal has been either the #1 or #2 seed in all of these tournaments, meaning that either Federer or Djokovic has been the #3 seed.  A majority of the time, Murray has been the #4 seed, but there are some exceptions (Soderling, del Potro, Ferrer, Davydenko, Roddick).

Nadal Getting Lucky with Grand Slam Draws?

In tennis, the placing of seeded players in the draw includes some randomness.  That is, half the time the #1 seed will draw the #3 seed in the semi-finals, and the other half of the time will draw the #4 seed.  This is different from basketball and other tournaments, where the semis will always consist of #1 vs #4 and #2 vs #3 (barring upsets).  This means that Federer and Djokovic would expect to be in the same half of the draw (i.e., scheduled to meet in the semis) in 10 of these 20 Grand Slams.  In actuality, Federer and Djokovic have been placed in the same half of the draw 15 times (actually playing 7 times in the semis, with Djokovic leading 4-3).  This means that Nadal has been scheduled to play the #4 seed in 15 of the last 20 Grand Slams, giving him the easier path to the final.  In fact, 5 of the 7 majors won by Nadal in this time span were won when Federer and Djokovic were both on the other side of the draw (only defeating both players in the 2008 French Open).  Does this supply evidence that the draw is rigged in favor of Nadal, thus increasing his chances of winning?

This problem can be rephrased in terms of flipping a coin.  If we flip a coin 20 times, how likely is it that we see 15 heads (i.e., Fed and Djokovic are on the same half of the draw)?  Coin flips are independent, so we can easily calculate the probability of 15 heads to be 0.0148, or about a 1.5% chance.  This is very unlikely, but not completely unexpected.

To better illustrate the point, I ran 10,000 simulations where I flip a coin 20 times and record the number of heads, shown in the following graph.  These simulations show that we expect to see at least 15 heads about 2% of the time (proportion to the right of the red line), which is similar to the theoretical value of 1.5%.

Nice to See You Again...

Another interesting pattern is that Federer and Djokovic were placed in the same half of the draw for 7 straight Grand Slams (2008 Wimbledon - 2010 Australian Open).  How likely is this occurrence?  That is, how likely is it to see 7 heads in a row when flipping a coin 20 times?  The theoretical calculation is more complicated, so I again ran 10,000 simulations of 20 flips and recorded the maximum number of heads to appear in a row.
The probability of seeing a run of 7 heads (i.e., 7 heads in a row) or more is 0.1112, or about 11%.  This means that if you flip a coin 20 times, you will see a run of at least 7 heads 11% of the time, which is probably much more common than you might expect.

There was another stretch of 6 grand slams where Fed and Djokovic were again on the same side of the draw (2012 Wimbledon - 2011 US Open; combined with the previous run of 7, Federer and Djokovic were on the same half of the draw 13 out of 14 grand slams!).  What's the probability that the second longest run when flipping a coin 20 times is at least 6?  It's about a 1% chance, which is much less common than seeing a run of 7.

Will it Ever End?

While a run of 7 grand slams in a row is not very uncommon, having a second run of 6 or being placed on the same side of the draw 15 out of 20 tournaments is not very likely.  I do not actually believe that the draw is rigged, but the frequency of Fed and Djokovic meeting is only expected to occur about 1% of the time.

The Law of Large Numbers tells us that, eventually, we expect Djokovic and Federer to be on opposite sides of the draw 50% of the time.  However, as neither player will be playing infinitely many more grand slams, this does not impact the probability at the next Grand Slam.  That is, assuming the rankings stay the same, there is still a 50% chance that Federer and Djokovic will be placed on the same half of the draw at Wimbledon, the next Grand Slam tournament.

Sunday, May 27, 2012

Plot of the Week 5

Last week, David Epstein (from Sports Illustrated) reported that, contrary to popular belief, recent studies have shown that NFL players actually have a longer life expectancy than non-NFL players.  You can find his original article here.  The printed SI article included some graphics, but I thought that the data could be visualized better.  So I am presenting the same data as in the article but in a more effective way.

The first plot shows that fewer NFL players have died than would be expected in the general US male population (*when looking at men with similar age and race to the NFL players in the study), 334 vs. 625, a 43% decrease.  The different colors represent different death causes.

The second plot looks at the expected and actual death rates for 3 special death categories: suicide, heart disease and cancer.  In each of these categories, NFL players have a lower death rate than the US male population (including suicide, 9 vs 22, a 59% decrease!!!).


The original academic paper that reported these numbers and results can be found here.  I only glanced through the paper for a minute, but the statistical methods seemed reasonably sound to me.  Even though this data was collected in 2007, these number seem to suggest that anecdotal evidence (i.e., the media) is responsible for exaggerating the link between playing in the NFL and suicide.

Friday, May 25, 2012

French Open Bracket Challenge

The French Open draws came out today.  For the past few years, I have been filling out grand slam brackets online with my family for bragging rights.  Normally we all make awful picks, but one of my parents manages to crush me.  This year, I thought that I would open up our family pool to all of my blog followers.  To enter your brackets, follow the links, then create an account and your ready to go! Note that the entries must be received by Sunday May 27 at 5 AM.

Men's bracket:
http://www.tourneytopia.com/RacquetBracketFrenchOpenWTA/nosweatstats/Default.aspx

Women's bracket:
http://www.tourneytopia.com/RacquetBracketFrenchOpenWTA/nosweatstats2/Default.aspx

If you are having any difficulty, leave a comment and I will try to help you figure it out.  You can also enter the main pool, which will allow you to win some prizes if you finish first.  Winners will get a shout-out in a future post.

Don't forget to check out my picks for the winners.

Sunday, May 20, 2012

Plot of the Week 4

Here are 2 neat plots from Skeptical Sports Analysis.  Follow the link to read the whole article for an in-depth analysis between LeBron and Larry Bird.  The short story is that LeBron had his best year in 2008 in Cleveland, and Larry Bird never had a losing season.



Saturday, May 19, 2012

College Football Playoff


It is looking more likely by the day that there will soon be a 4-team playoff to determine the national champion for college football.  One of the details that is still being determined is how the 4 playoff teams will be selected, described in this SI article.  There are 2 interesting issues raised by this article that I want to address.

1. Computer programs are currently used in the BCS system to determine which 2 teams will play for the national title.  The problem is that "computer programmers ... refuse to reveal the formulas that determine their rankings".  One of the current hot topics in statistics is reproducibility of research - when you get your work published in a journal, you need to describe your methods so that anybody else reading the article can reproduce the results.  If you don't reveal your method, then other researchers are not able to accurately evaluate your work and your conclusions become suspect.  How can we trust in a ranking where we don't know the input variables and model?  This is especially relevant in this context, as there is no way to evaluate the rankings (i.e., there is no "true" ranking that we can compare model performance with).
   If the authors are worried about others stealing their work, all they have to do is file for a patent/copyright.  If that is not the issue, I am left to believe that they are worried that others will improve upon their model and obtain "better" results. That doesn't instill much confidence.  And because no one knows the model, how can we be sure that the model isn't tweaked each week for someone's favorite team to move up the rankings? (I guess the BCS is overseeing things to make sure that this isn't the case, but who knows).

2. The article supports evaluating and selecting playoff teams only after the entire regular season is finished, rather than ranking teams after each week: "Because committee members ... would evaluate the entire body of work, schools will be more apt to schedule quality out-of-conference opponents."  What a novel idea ...  NOT!  One of the major rules in designing experiments is that you cannot modify your experiment half way through because you do not like the preliminary results (exception: cancelling a clinical trial resulting in many deaths).   You have to wait until you have all of the data before testing a hypothesis.  Games are played to determine the best team on the field, so let's collect all of the results before trying to determine the final rankings.  Otherwise, the preseason polls are very likely to be the tiebreaker between evenly matched teams, which has absolutely nothing to do with performance on the field.
   For example, image that all of the preseason top 5 teams go undefeated.  After each week, none of the teams have a reason to slide down the rankings because they all won.  Similarly, it is tough for the team ranked #5 to jump ahead of the other teams that also did not lose.  From week to week, voters tend to assume that the previous ranking is truth, and so need exceptional evidence to move one team above another.  This also requires the voters to admit that their previous ranking was incorrect, and who wants to admit they are wrong?

With this post, I am officially throwing my hat into the ring of choosing the 4 college football playoff teams.  I promise to wait until the end of the season, when all of the data has been collected, to make my selections (even if this means not watching ESPN during the fall when they update their playoff teams every hour).  If I use a model to select the teams, I will be fully transparent, allowing everybody access to the methods I used, so that my results can be replicated.  Because some may not agree with choices related to my model, I would also be prepared to defend my model by evaluating its performance using previous years' data or using some other metric.  Finally, because I am a new PhD graduate, I would be willing to accept a discounted salary compared to other BCS executives (I'm thinking about $100k would be fair for the month I would be working each year).

Thursday, May 17, 2012

French Open 2012 - Part 2

In today's post, I am continuing my picks for the French Open champions.  You can read the first part here.

First, let's look at each champion's best previous performance at the French Open:

French Open Champions
Previous Best FO Result    # of Men's Champions # of Women's Champions 
Winner
7
3
Runner-up
2
3
Semifinals
0
2
Quarterfinals
1 (Costa)
2
4R
1 (Gaudio)
1 (Li Na)
2R
0
1 (Myskina)
Did Not Play
1*
0

*Rafael Nadal won the first French Open he played, in 2005.  

Of the men that I have not yet eliminated, Federer and Nadal have both won the French Open, Murray made the semifinals, and Tsonga has never been past the 4th round.  Because all of the champions since 2000 (except Gaudio) have been to at least the quarterfinals, I am eliminating Tsonga.  This leaves me with:
  1. Novak Djokovic (#1 seed but not defending champ, AO Winner)
  2. Roger Federer
  3. Rafael Nadal
  4. Andy Murray
  5. Jo-Wilfried Tsonga (never previously to QF)
For the women, 10 of the past 12 winners have previously been to the quarterfinals.  Using this as a guideline, I am eliminating Radwanska and Kvitova (neither has been past the 4th round).  Note that Li Na had never been to the quarterfinals previously until she won the French Open last year.  This leaves:
  1. Victoria Azarenka (#1 seed but not defending champ)
  2. Maria Sharapova
  3. Agnieszka Radwanska  (never previously to QF)
  4. Petra Kvitova  (never previously to QF)
  5. Samantha Stosur (1R)
  6. Serena Williams
  7. Marion Bartoli (3R)
  8. Carolina Wozniacki
  9. Li Na
  10. Vera Zvonareva (3R)
Next, let's look at the number of clay court warm-up tournaments won by the eventual champion in his/her winning year.  The plots below show bar plots of the number of warm-up tournaments won by men (top) and women (bottom).
The 2 men who did not win a warm-up tournament were both ranked outside of the top 20 when they won (Costa, Gaudio).  Because I am only considering men ranked in the top 5, I will require that my pick wins at least one warm-up tournament.  Federer and Nadal have both won clay court warm-ups this year, but Murray did not (he just lost in the final warm-up of the year).  This leaves me to choose between Federer and Nadal.

For the women, 4 of the 12 did not win warm-ups.  However, 3 of these 4 were finalists at the Australian Open (Henin, Ivanovic, Li) in the same year.  So I will require that my pick either has won a warm-up or was a finalist at this year's Australian Open (Sharapova).  This eliminates Wozniacki and Li (assuming she does not win this week), leaving Sharapova (AO finalist) and Williams (2 warm-up wins).

Now it is time to pick my favorite to win this year's French Open.

Federer vs. Nadal: Nadal has only lost once at the French Open, winning 6 of the last 7 championships.  Federer is on a hot streak of his own, going 47-3 since last year's US Open and winning 7 of his last 10 tournaments.  Federer has never beaten Nadal at the French Open, winning his only championship when Nadal was eliminated by Soderling.  Because Federer has not won a grand slam in over 2 years and Nadal has only lost once at the French Open, I am going to pick Rafael Nadal.

Sharapova vs. Serena Williams: Serena has yet to lose a clay court match this year, winning 2 tournaments and still playing in another.  Given that she also won the French Open in 2002, I am assuming that she will be the overwhelming favorite on the women's side this year.  However, Serena was in a similar position going into last year's US Open, then lost in the Finals to Stosur, a player who had never won a grand slam and who Serena had beaten a few weeks previously.  Sharapova is the number 2 player in the world, reaching the finals of Wimbledon and the Australian Open.  However, in both finals she lost to players who had never been in a grand slam final before.  So its safe to say that both players have a lot on the line at the French Open.  Serena has had some bad loses in the past at the French Open (although she was out with injury last year), and last year Sharapova made the semifinals.  Because of better results at the French Open in the past few years and the consistent grand slam performances in the past year, I am going with the upset and picking Maria Sharapova

I realize that I made my conclusions using a lot of subjective opinion.  If I have the time, I hope to build a model to predict this year's champions and compare to the picks that I just made.

Tuesday, May 15, 2012

French Open 2012 - Part 1

It's already the middle of May, which means that the French Open is right around the corner.  In this week's posts, I will make my predictions for the men's and women's 2012 French Open Champions.  I will be looking at the results since 2000 to determine the player who I think is most likely to win.  Let's start by looking at the seed for the winner:

French Open Champions
Year    Men's Seed Women's Seed
2000
Gustavo Kuerten
5
Mary Pierce
6
2001
Gustavo Kuerten
1
Jennifer Capriati
3
2002
Albert Costa
22
Serena Williams
3
2003
Juan Carlos Ferrero
3
Justine Henin
4
2004
Gaston Gaudio
44
Anastasia Myskina
6
2005
Rafael Nadal
5
Justine Henin
10
2006
Rafael Nadal
2
Justine Henin
5
2007
Rafael Nadal
2
Justine Henin
1
2008
Rafael Nadal
2
Ana Ivanovic
2
2009
Roger Federer
2
Svetlana Kuznetsova
7
2010
Rafael Nadal
2
Francesca Schiavone
17
2011
Rafael Nadal
1
Li Na
6

10 of the past 12 men's champions have been seeded 5 or higher.  Using this as our cutoff, this leaves the following players as possible winners this year (by current world ranking):
  1. Novak Djokovic
  2. Roger Federer
  3. Rafael Nadal
  4. Andy Murray
  5. Jo-Wilfried Tsonga
Djokovic, Federer, and Nadal have won 27 of the past 28 Grand Slam tournaments (Del Porto won the 2009 US Open), but we won't eliminate Murray and Tsonga quite yet.

11 of the past 12 women's champions were seeded in the top 10.  Using the current world rankings, this leaves the following players:
  1. Victoria Azarenka
  2. Maria Sharapova
  3. Agnieszka Radwanska
  4. Petra Kvitova
  5. Samantha Stosur
  6. Serena Williams
  7. Marion Bartoli
  8. Carolina Wozniacki
  9. Li Na
  10. Vera Zvonareva
Also note that Kuerten, Nadal, and Henin are the only players since 2000 to have won the French Open as the number 1 seed.  Each time, they were the defending champion.  Since it looks like Djokovic and Azarenka will receive the number 1 seeds and neither won the French Open last year, we will eliminate them from out list.

Now let's look at each champion's performance in that year's Australian Open (the most recent Grand Slam).
French Open Champions
Year    Men's AO Result Women's AO Result
2000
Gustavo Kuerten
1R
Mary Pierce
4R
2001
Gustavo Kuerten
2R
Jennifer Capriati
Win
2002
Albert Costa
4R
Serena Williams
DNP
2003
Juan Carlos Ferrero
QF
Justine Henin
SF
2004
Gaston Gaudio
2R
Anastasia Myskina
QF
2005
Rafael Nadal
4R
Justine Henin
DNP
2006
Rafael Nadal
DNP
Justine Henin
F
2007
Rafael Nadal
QF
Justine Henin
DNP
2008
Rafael Nadal
SF
Ana Ivanovic
F
2009
Roger Federer
F
Svetlana Kuznetsova
QF
2010
Rafael Nadal
QF
Francesca Schiavone
4R
2011
Rafael Nadal
QF
Li Na
F

None of the past 12 men's champions had won the Australian Open in the same year.  Assuming this same pattern holds, this eliminates the 2012 Australian Open Champion, Novak Djokovic (who we had already eliminated). Our updated men's list looks like:
  1. Novak Djokovic (#1 seed but not defending champ, AO Winner)
  2. Roger Federer
  3. Rafael Nadal
  4. Andy Murray
  5. Jo-Wilfried Tsonga
Each of the past 12 women's champions either did not play (DNP, ocurred 3 times in the past 10 years) or reached at least the 4th round if they did play.  This leaves us with the following women:
  1. Victoria Azarenka (#1 seed but not defending champ)
  2. Maria Sharapova
  3. Agnieszka Radwanska
  4. Petra Kvitova
  5. Samantha Stosur (1R)
  6. Serena Williams
  7. Marion Bartoli (3R)
  8. Carolina Wozniacki
  9. Li Na
  10. Vera Zvonareva (3R)
The next post will look at how important previously winning a Grand Slam or winning a clay court warm-up tournament is.

Sunday, May 13, 2012

Plot of the Week 3

Tyler Zeller: Recent Graduate and 2012 ACC player of the year
This weekend I officially graduated and earned my PhD from UNC.  In honor of graduation weekend, I want to look at an interesting plot of another new UNC graduate, Tyler Zeller.  This shows the trend of Points Per Game (PPG) over all 4 of Zeller's seasons.  Thanks to StatSheet.com for the interactive plot.


A few notes:

  1. Zeller was out most of his freshman year with an injury, thus the constant decline in PPG.
  2. He improved in PPG (along with most of the other stats) each of his 4 seasons.
  3. Compared to 2010-2011, Zeller averaged less points in the first half of this past season, but really picked up his scoring once the ACC season began in January.
  4. I love that this is an interactive plot, highlighting the curves and showing specific values when you mouse over it.  I need to learn to make some of these cool plots.
  5. Congrats to Tyler Zeller, my friends from the UNC Department of Statistics and Operations Research that graduated with me, and all of the other UNC students who graduated this weekend! Go Heels!

Wednesday, May 9, 2012

Simplest NBA Playoff Model


The Simplest Playoff Model You’ll Never Beat

There's an ESPN competition where many well-respected sports statisticians try to predict the NBA playoffs.  This article was written by the winner of last year's competition.  He also goes into detail explaining the difficulties of predicting this year's playoffs.

Here is an excerpt where he explains the a very simple playoff model:
So case in point, I came up with this 2-step method for picking NBA Champions:  
  1. If there are any teams within 5 games of the best record that have won a title within the past 5 years, pick the most recent.
  2. Otherwise, pick the team with the best record. 
Following this method, you would correctly pick the eventual NBA Champion in 64.3% of years since the league moved to a 16-team playoff in 1984 (I call this my “5-by-5″ model).
Of course, thinking back, it seems like picking the winner is sometimes easy, as the league often has an obvious “best team” that is extremely unlikely to ever lose a 7 game series.  So perhaps the better question to ask is: How much do you gain by including the championship test in step 1? 
The answer is: a lot. Over the same period, the team with the league’s best record has won only 10/28 championships, or ~35%. So the 5-by-5 model almost doubles your hit rate.
 Great example of how simple models can result in great prediction accuracy.

Monday, May 7, 2012

Plot of the Week 2

Here is a plot showing the top offensive performances by year for MLB, divided into the American League (AL) and National League (NL), and NBA.  For baseball, I plot the home run leaders for each year since 1947.  For basketball, I plot the scoring title winner's points per game.  Because the NBA is played over 2 calendar years, I plot the year that the season started in (i.e., Kevin Durant just won the scoring title for 2011-2012, which is plotted as 2011).

A few quick notes on this graph:
  • There is a lot of variability - good luck trying to predict the number of home runs the leader will have after this season.
  • 1961 was a great year for individual performances:
    • Roger Maris hit 61 home runs
    • Wilt Chamberlain averaged 50.4 PPG
  • There is a clear rise in the number of home runs between 1995 and 2008-ish, representing the  steroid age.  There is no increase in scoring for the NBA during this time frame.
Now's the time to sound really smart and comment about what I may have missed.

Thursday, May 3, 2012

The Mystery Behind NFL Suicides

If you haven't heard, former NFL player Junior Seau was found dead in a possible suicide.  This event has the media up in arms again about the link between concussion, depression and suicide in football players.  I'm sure that its only a matter of time before Congress steps in to voice their opinion*.

While pro athletes live glamorous lives while playing, many retired athletes struggle with depression after falling out of the limelight and other personal struggles (bankruptcy, family issues, not knowing what to do when retiring at age 30, etc), regardless of sport.  I wanted to look at the data for suicide numbers between the NFL (a high contact sport) and other non-contact professional sports, hoping to find a higher rate among NFL players than other athletes.

However, for all of the press that the NFL and suicide are getting right now, I challenge you to find the number of former NFL players who have committed suicide - I can't find it!  Here's what I did find:

  • Cricket is known to have the highest suicide rate in professional sports, with over 150 known cases in the 20th century.  See the link here for a good explanation of possible reasons for this.
  • As of 2005, there have been 76 suicides of former MLB players.
  • All we have for NFL suicide numbers are anecdotal stories, which play towards the heart but not the mind of statisticians.  

I realize that researchers have shown a link between brain injuries sustained playing in the NFL and depression, which has at times has led to suicide.  But without the data, we can't conclude that former NFL players are any more likely to commit suicide than other athletes, such as MLB players or cricketers.  Extra credit to anyone who can help me find this data.


*Showing how effective their MLB steroid hearings were, Rogers Clemens is currently in court for possibly lying about possibly taking steroids and Ryan Braun, who last year failed a drug test and got off on a technicality, is on the field making millions playing in the MLB.