Learning from the data
I’m contemplating a course project based on “knowledge discovery” in marathon chip data. Not published results with 5k splits, but the raw, direct-from-the-mats chip data, the stuff the timing company uses to re-run the previous year’s race as a system test.
Knowledge discovery (usually called “machine learning” at the University, but also sometimes known as “data mining”) is an interesting field, because it’s implies the idea that there are patterns in data which are too subtle for us to see. One of the major tasks is classification, often used in medical applications to distinguish a set of symptoms as ill vs. not ill.
That’s not a simple task for marathon data; what are the classifications? Did the athlete beat their seed time? Did they finish? It might be intriguing simply to see if a program could predict, based only on chip data, the gender of the athlete wearing that chip.
The profusion of data with a high level of variance is a big problem for this hypothetical analysis, but another one is the mentality. We know there’s a huge number of variables in play, and at some point we discard the possibility that we could ever make sense of it all. But one of the strengths of machine learning is that the software decides which variables are actually relevant, and which are just noise.
It’s also approaching the problem of identifying which data are representative and which are outliers; our gut instinct is to suggest that we’re all outliers, but that’s clearly not the case, or there wouldn’t be thousands of runners crossing the line every hour.
So if you stop worrying about whether the answer can actually be found—that’s a question to be answered later—and just think about questions you might ask, what would you look for in marathon data?