« I wouldn't want to explain | Main | If Tolstoy was a programmer »

Learning from the data

I’m contemplating a course project based on “knowledge discovery” in marathon chip data. Not published results with 5k splits, but the raw, direct-from-the-mats chip data, the stuff the timing company uses to re-run the previous year’s race as a system test.

Knowledge discovery (usually called “machine learning” at the University, but also sometimes known as “data mining”) is an interesting field, because it’s implies the idea that there are patterns in data which are too subtle for us to see. One of the major tasks is classification, often used in medical applications to distinguish a set of symptoms as ill vs. not ill.

That’s not a simple task for marathon data; what are the classifications? Did the athlete beat their seed time? Did they finish? It might be intriguing simply to see if a program could predict, based only on chip data, the gender of the athlete wearing that chip.

The profusion of data with a high level of variance is a big problem for this hypothetical analysis, but another one is the mentality. We know there’s a huge number of variables in play, and at some point we discard the possibility that we could ever make sense of it all. But one of the strengths of machine learning is that the software decides which variables are actually relevant, and which are just noise.

It’s also approaching the problem of identifying which data are representative and which are outliers; our gut instinct is to suggest that we’re all outliers, but that’s clearly not the case, or there wouldn’t be thousands of runners crossing the line every hour.

So if you stop worrying about whether the answer can actually be found—that’s a question to be answered later—and just think about questions you might ask, what would you look for in marathon data?

Technorati Tags: , , ,

Comments

It might be interesting to look at gender vs. late race pace degradation in the marathon. Similarly age vs. pace degradation. I have some ideas about what you’d find. Also, it might be interesting to look at first-timers vs. experienced marathoners.

Is there a cluster of finishes around obvious round number target times like 4 hours?

Do male runners try harder to finish just in front and not just behind an elite female runner?

Ditto for well-known “name” runners.

What is the effect of runners in fancy dress? Is there a huge cluster who are too embarassed to finish behind Tinkerbelle or Goofy?

One hypothesis I’d like to test is whether people who are just over three hours (or some other salient if arbitary goal) run positive splits more often. That is to say, more people running 3:04 as 1:29/1:35 than 1:32/1:32.

I seem to remember some NCAA XC results (maybe from appleraceberryjam) that included a “clumping” analysis, or some such measurement. I, perhaps misinterpreting their intent, took this to be an attempt to analyze how much of a propensity runners have to stick with other runners. With your marathon data that includes mile splits for runners as well as finish times, you could look at the extent to which runners run together in packs, groups or pairs during the race and compare that to the finish. Of course, the more basic function of the data would be to see how much runners stick together period. Are runners clumping more than an average bell curve of their times would suggest that they ought to? In XC results I’ve looked at graphically, most races follow a pretty standard bell curve - a few super fast runners, a few slow runners, and the bulk of runners in the middle. Even with small samples (<100 runners) the curve is pretty uniform. If there was clumping around the hour marks in marathons, it would show up quite readily I imagine - one could do that in excel in 5 minutes with your marathon data, no need for any fancy coding, so I imagine you, P, have more complex analyses intended.

What’s more interesting to me is how runners improve (or not) over time, and the relative difficulties of different courses or different races. Taking runners who have ran the same race in multiple years, you can see how that runner, and runners on average, improves or worsens their time from year to year. Obviously variables like the weather will affect the aggregate data to a large degree. But that’s a good way to isolate improvement in certain runners. If a 1000 runners ran Race X in 2005 when the weather was cool, and again in 2006 when the weather was hot, and on average the runners slowed down 3% from 2005 to 2006, but a subgroup of 100 runners averaged a 1% improvement, then you go find those runners and see what they did: maybe they drank more water than average, maybe they had a particular training technique that other runners did not adopt.

Similarly, one can compare the degree of difficulty at different courses of the same distance (New York v. Boston, for example) by taking relative times from the same runners from multiple years and averaging the variation.

I don’t imagine the later two are what you have in mind - they would require developing a database of runners (most of that data is available on coolrunning and other websites online), but someone with coding ability ought to be able to write something that parses the hytek results html output that all these races use and strips the runners’ names, any other identifying data and their times from each race and adds it to the database. Then you could design and interface that allows the user to compare the average runner (or an individual runner - people would love it!) from race to race, year to year, and thus get a hugely useful average difficulty factors of any given race in any given year compared to any other given race in any year that’s ever been recorded digitally.

If I were a computer science grad student, that’s what I’d want to work on. As for the raw chip data, I imagine you could analyze the difference in bell curves between coed, male-only and female-only races to detect gender differences in pacing and clumping. My hypothesis would be that females tend to clump more, particularly at the beginning of a race, and that males clump less, but do so more at the end of a race than do women. Inexperienced runners will clump more at the end, less at the beginning than experienced runners, who will conversely clump more at the beginning as they run in packs, and less later on, as they drop off the pack, although perhaps this applies only to elite runners or the leader group in any given race, rather than to ‘experienced’ runners.

I also conjecture that on average ‘elite’ atheletes run positive splits more than the average runner, as they try to hang on to a pack that is running at or faster than their goal pace, only to fall off the back 6,13 or 20 miles in.

Depending on how many split times you have, you could analyze the effects of terrain changes (uphills/down hills, different surfaces) on average runner pace. You can see how well average runners maintain a consistent pace in different races, how good inexperienced vs. experienced athletes are at pacing. Again, the more races over more years you have of this data, the more analysis you can do. And for this stuff, you don’t even need to isolate individual runners.

Anyway, there’s a lot of potential for developing a tool that would be immensely useful to coaches, runners, meet directors and sports doctors, let alone track geeks. You should definitely do this instead of some professor’s lame RA position.

Between Eric’s and Tillerman’s questions, I should probably share this fraction of a response I got while emailing around for leads in my literature review:

“Something I discovered when analyzing finishing times from the London Marathon and a South African marathon 15+ years ago was the effect of ‘barriers’, such as the ‘three hour barrier’ on the fine detail of the distribution of finishing time. With runners talking about trying to break various barriers, one would think that many runners would try and just fail, so that there would be a number of runners just slower than each round number time point. The results show the opposite—there tended to be a larger number in the minute or two just under the ‘barrier’ and then relatively fewer finishers in the first few minutes afterwards. My guess is that runners realize a few miles out that they are not going to make their target and then slow down.”

Post a comment