## Vasaloppet 2014 – Data analysis of a XC ski race

Vasaloppet is the largest cross country ski race in the world with 15,800 skiers. From wikipedia: “Vasaloppet (literally, The Vasa race)

Gustav Vasa

is an annual long distance (90 km) cross-country ski race (ski marathon) held on the first Sunday of March in northwesternDalarna, Sweden between the village of Sälen and town of Mora. It is the oldest, the longest, and the biggest (in terms of participants) cross-country ski race in the world.[1] The race was first run in 1922, inspired by a run by King Gustav Vasa in 1520.”

For runners, in road races like marathons, all finishers receive a medal. In Vasaloppet, you only get a medal if you ski faster than 150% of the winning time.

To give you an idea what that means: take Chicago Marathon, 2013.  The winning time for men was 2:03:45. For a medal, you would have needed to run at 3:05:37 (150% on winning time) and for that race, this would have put you in the top 6% (place 1237 of 21488) in the men’s division. Running a marathon close to 3 hours is really hard and involves serious training and time commitment. Men and women use the 150% medal time based on winning male and winning female respective.

So, what this means is that getting a medal in V asaloppet is a big deal.

A friend of mine, Kalle, participated in Vasaloppet this year. His time 6:47:03 did not break the medal time at 6:21:50 but he wasn’t that far off either. So I figured I do some analysis to see what is going on. Vasaloppet is 90 km or 56 miles.  There are 8 legs of about 9 – 15 km.  Each leg goes to a different town, akin to Boston Marathon, which goes from Hopkinton to Ashland to a few other towns and finishes in Boston.

I downloaded the results for Vasaloppet, which includes splits for each leg for Kalle’s class (men, 55-59).  There are 832 skiers in that class. During the pre-processing step, for each skier, for each leg, the pace (min/km) was calculated and used in all subsequent results. I could simply have used the split but I wanted to see the actual pace to get a better feel for how fast they ski.

A few goals with the analysis:

1. For each leg, I wanted to create a box plot to see which leg seemed slower compared to faster.
2. I wanted to overlay Kalle’s time to see how he did on a per leg basis compared to his competition.
3. To see how Kalle fared compared to medal time, I also overlaid the medal time.
4. This was a good opportunity to look at violin plots, which I haven’t used in the past.
5. In addition, as a secondary plot, I wanted to use a heat map to use colors to see the differences between each leg as well as between each skier.

Using R and the ggplot2 graphing library, it was very easy to create the graph (let me know if anyone is interested in the raw data and the code). A few observations

1. There are actually two plots as I overlaid the box plot on top of the violin plot.
2. The first leg, from the start to Smagen, there is a 2 km uphill which causes a bit of a zoo-like conditions. That is why the spread from fastest (~3.5 min/km) to the slowest (~13.8 min/km) is so large. The remaining 7 legs are more normal.
3. Legs 2, 3 and 5 seem faster than the other legs, which makes sense as the terrain is more conducive to faster skiing (read: more downhill)
4. Kalle’s splits are in red and the medal splits are in green
5. Kalle is doing well: in all legs, except the first leg where he got caught in the zoo, Kalle is either faster or almost faster than 75% of the skiers in his class.  On leg 3, he beats the medal pace.
6. Based on the plot, it is easy to see that in order to win a medal, your time needs to be clearly better than 75% of all skiers. Note though that the winning time is based on first overall (not first in class) and in this case, we are only looking at one of the classes.
7. Based on the violin plots, we can see that early in race, the violin plots are more pear shaped.  As the race goes on and people start to become more tired, the max thickness of the plots is moving upwards (the majority of the skiers are getting slower) causing the last plot to have a body like a bodybuilder.

Vasaloppet (men 55) (click to enlarge, 1.8m)

The next plot is a standard heat map, also based on R and the ggplot2 library. The plot is small but high resolution because there are 832 skiers vertically. For better details, download the png and zoom in.

The plot shows the faster skier on top and slower skiers (based on finishing time) at the bottom.  Thus blue color means faster pace and red color means slower pace.

Two observations stand out: First, yes, leg 1 was a zoo and you can see the 10 – 14 min/km pace in red colors. Secondly, as mentioned above, legs 2, 3 and 5 are clearly the faster legs, perhaps leg 5 is the fastest leg.

Let me know if you want the code or the data.