I recently picked up running again, as a lot of other data enthusiasts, I find it interesting to measure my performance, develop a baseline and then start competing with myself. Having the computer synthesized voice of RunKeeper in my hears from time to time, adds some gamification to the otherwise very monotone exercise that running is.
On my (almost everyday) 6.5 km. run, I recently started testing the magical 5 min/km pace wall, that a lot of recreational runners are fighting with. Fueled with much excitement, and the memory of conquering the Dam-tot-Damloop in 2017, I begun thinking if I should aim a bit higher and begin training for a ‘real’ marathon. Before making any commitments, I decided to play around with some data to see how I would match against other participants.
So more or less all of the big marathons have their timing data available online, so which one should I use? Well I really like Berlin, and would not mind going there again, so why not: Berlin Marathon?
Questions and objectives
In my experience a good analysis always starts out with some initial questions. As you dive deeper into the data-set, you quickly get inspired and new questions gets added to the list, that’s totally cool, but I always strive for at least have a direction.
Here’s what I would like to get visualized and answered:
- Plot the distribution of finishing times, for all the participants.
- What is the minimal pace required to be part of 10%?
- What is the average/mean pace for all the participants?
- What is the average age of the participants?
- Which age class is performing the best?
- Who is best at endurance running (on average) men or woman?
- And what is the difference?
Getting the data
# Explain the scraper and link to GitHub.
# Turn the Jupyter Notebook into a nice WP text.