Navel-Gazing: Rankings and Methods
SMQ's guest spot last week on Every Game Counts, detailing his issues with Ohio State as the presumptive, no-doubt Number One and his "resume-based" ranking method, led to a couple further questions in comments from Peter Bean of Burnt Orange Nation:
2. Is it even possible to avoid some amount of power polling? Let's take this to the next level. I ask you because, well, because you're thoughtful enough to sort through it: What if, as an example, Tennessee had tanked after their whipping of Cal. But Cal still improved as much as they have since week one. Does it affect Cal's resume? Should it?
Cal's loss to Tennessee is on the resume regardless of what happens to Tennessee - SMQ looks at the process as game-by-game, and each one counts on its own merits (demerits). So if Tennessee had tanked, Cal would definitely suffer for that in judging the "value" of that game. But Tennessee's win, too, would have less value if Cal had tanked.
On the first question, SMQ's preseason poll was not a "power poll," but a projection of how each team would finish in the final poll in January (the difference being that schedule was a huge factor in SMQ's poll, where it wouldn't be considered in a preseason power poll). But this ballot was also thrown out after the first week of games in favor of the "resume" for the rest of the season.
The most important measure of any poll or ballot is its internal consistency. On that note, SMQ came up with hairy, confusing definitions of several possible methods of ranking teams that he imagines encompasses most voters:
Power Poll, or "Holistic"
The apparently preferred method, which asks simply, "Who's better?" or "Who would beat who on a neutral field?" or something like that. No measurables, just a human brain sorting information as it sees fit - a kind of almost metaphysical effort to determine the "essence" of a team in its current incarnation. If you're a voter and haven't given much thought to your overriding method, this is almost definitely what you're doing.
Strengths: Simple, direct, and the most flexible, because its based the most on perception and opinion. Can incorporate both a "resume" and a "futures" element that takes into account where a team has come from and where it's going compared to another, similar team; i.e., if two teams look like they're in the same spot at the same point in the season, like undefeated West Virginia and undefeated Rutgers, for example, a notion of "strength" can take into account not only WVU's more successful past, but also its likely more successful future as the conference schedule stiffens towards the end of the season. If a voter looks at its remaining slate and says, "Rutgers is going to fall," the Knights will remain below a similar team, like maybe Boise State, which hasn't necessarily been more impressive on the field but has clearer skies ahead.
Drawbacks: Haphazard. There are really no internal rules to dictate consistency, which is a bitch when perception does not reflect reality, and an overemphasis is placed on a team's history (meaning past seasons) rather than its present. Ratings on "strength" are abstract, almost by definition non-quantifiable, and easily wrecked by idiosyncrasies in the illogical infinite regress of who beat who - in 2005, for instance, a victory chain can be drawn to show how Division III Averett University could have beaten Ohio State, which is proof (the chain, that is, which can be drawn to and from any team in any division) that merely beating a team is not a pure indicator of "strength." So other very malleable notions like "talent" must be brought into the picture to determine a prospective ten-win team from an eight-game winner. It's a real instinctual, gut-feeling guessing game up here, when one of the first rules of the process should be that your eyes and gut are not always reliable sources. Also leads to the dreaded "drop-em-when-they-lose" syndrome, which is excessively loyal to preconceived notions and pretty much just unfairly stubborn.
Resume
A method that attempts to rank based strictly on the measurable: if each team had a resume for this season and this season only, and its name at the top was blacked out, how would the voter rank those resumes? Takes into account only games played to date this season - these are folks who always complain about polls that come out and distort reality before October. SMQ's preferred method all year, and seemingly the default method for most end-of-season rankings.
Strengths: Consistency. Attempts to use "evidence" rather than perception or past history to eliminate abstraction, and treats every team equally and entirely as a team - doesn't give any boosts or demerits to teams based on the recent past or personnel. For example, Tennessee's opening win over Cal was deemed the most impressive of the week, and the Vols were number one in SMQ's poll in Week Two. If Boise State defeats a I-AA team in its opener by a two touchdowns more than Georgia defeats a I-AA team, as was the case the first week of this season, the "Resume" voter would rank Boise higher in the second week even if he believed Georgia was the "better" team, because there's no way to measure UGA's perceived superiority - it's just an abstract notion based on past teams, not the current reality. When Michigan State was an impressive 3-0, the "Power Poll" voter might have said "I don't believe in the Spartans, they always fall apart," and stayed away from MSU, but the "Resume" voter, even if he believed in an eminent collapse, would criticize and reward based solely on those three games, and deal with the meltdown only when it came (which, of course, it did in the fourth game). All that's considered is what's happened on the field to date, which is all that can be measured, and which is all anyone will have to go on in the final ranking in January, when it counts.
Drawbacks: "Attempts" is the very key word above. Even if a voter is using a statistical method (see below), subjectivity and abstraction creep in when considering how much credit or punishment is deserved for a particular win, especially early in the season, or, on the same lines, how to account statistically for strength of schedule. It's OK that the same win or loss on a resume changes in value as the season goes along according to changes in perception about a particular opponent, but that's still dreaded perception, which is what Peter was getting at in his second question. Early this year, in trying to come up with a way to account for strength of schedule on various resumes, SMQ started making a list that assigned a basically arbitrary value to each team as part of a group of similarly-valued teams, until it dawned on him to ask, "If this is what I actually think of these teams, why don't I just use this list?"
Futures
At the other pole, it's the mock stock approach - explicitly embraced most weeks by Orson and ripped off at least once by Gameday - of "buying" and "selling" (or "holding") teams based on where they're going to end up at year's end. These are the people who have West Virginia at two, or, weirder, one, based on the Mountaineers' softy schedule. It's not about what you've done, or how "good" you are - it's only about where you wind up.
Strengths: Ruthless pragmatism. The "Futures" voter probably didn't get carried away with Florida because of the minefield it had ahead of it, and is probably a lot less excited by Southern Cal with California, Oregon and Notre Dame awaiting than the Trojan-loving computers are. On the other end, Arkansas' stock shot up like a rocket with its remaining schedule after it beat Auburn; big money's going down on either Texas or Nebraska (especially if it's Texas) after this weekend, because it's pretty much clear sailing for the winner right into the Big XII Championship.
Drawbacks: Highly speculative by definition. Rewards soft scheduling, and creates bubbles around teams prepared to devour the empty calories in delicious cupcakes. Instills a hollow, frontrunning mentality.

Past results are no guarantee of future returns
Statistical (Faux "objective")
Like the "Resume" method, eliminates speculation and abstractions like perception and previous history to the extreme by running cold, hard numbers to reach a conclusion most bordering as closely as possible to scientific fact. The much-maligned computer guys.
Strengths: Able to process huge amounts of relevant information that puny human brains could never consider alone, and reach subsequently enlightening conclusions. When SMQ raged against the machines Monday, frequent commenter and resident stat guru Paul Kislanko argued "the only thing worse than using computers is using the human polls," and said by the end of the season, when teams are more connected by common opponents and opponents of opponents, etc., results like six I-AA teams ranked ahead of No. 63 Miami of Florida would be eliminated. So, clearly, they're not beholden to flawed human perceptions and biases, either - you know, an acrobatic, game-winning 20-yard catch that earns a kid an impressive highlight and all-conference honors is just another 20-yard catch in the books. Stupid mortals!
Drawbacks: Puny human brains are telling the computers what factors to consider and how much to consider them to reach said conclusions. SMQ, as one who's tried to devise his own low-tech, purely stat or other number-based projections, didn't say "faux objective" for nothing: the formula for input itself has all kinds of built-in biases that can be rigged (intentionally, for you conspiracy theorists, but more likely unintentionally) to favor certain types of teams. It doesn't matter what the formula is - unless, that is, it's something exceedingly simple like pure winning percentage, in which case it can't account for the all-important strength of schedule variances. Strength of schedule itself is the biggest stick in the craw here, because it skews the relevance of every other possible number, and the most difficult element to measure by numbers alone; many computer rankings, like Jeff Sagarin's, for instance, use "Record vs. Top 10" and "Record vs. Top 30," but this seems more than a little "Chicken or Egg?" If the rankings haven't been generated yet, how can you tell who's in the top 10 or top 30? After those numbers are figured in, and the top 10 and top 30 change, do the inputs to those categories change again to reflect the difference? And do they change again after that? And again, ad infinitum? The "finish line" to such changes is subjective. There's also the huge problem of grouping at the margins (No. 11 is grouped with No. 29 rather than No. 10, for example), which brings us back to the arbitrary nature of such decisions.
This will probably be elaborated on later - SMQ is intrigued by the notion of constructing four polls, one based on each method (or more, if there are more valid methods), and coming up with a final ballot based on an average of each one. He's not going to do that halfway through the season (he doesn't spend nearly enough time with the one method he uses now), but it's an interesting thought.
0 recs |
15 comments
Comments
Interesting
Take, as a related example, baseball statistical forecasting. Set aside, for the time being, the fact that the sport is far, far more suited to statistical analysis. When the baseball forecaster tries to predict the batting average or ERA or a given player, he starts with a set of inputs, then tweaks them input data over and over - often trying things that he's biased to think should count more - until the predictive accuracy reaches an acceptable level.
The follow-up question, then, is how much value predictive accuracy should count. Is the most desirable poll one that is best at predicting winners? Even at the expense of resume? Predictive accuracy should matter, right? But so does resume.
Your idea of a combo-poll, incorporating elements of each, is interesting.
Good stuff all around, as always.
by PB @ BON on
Oct 17, 2006 1:48 PM EDT
reply
actions
0 recs
Not baseball!
And in both baseball and other sports, the "tweaks" to a rating system are not usually due to any bias on the part of the analyst. What you do is find a set of factors that has the highest r-squared value when they correlate predicted to actual winning percentage.
Of the thee main college sports, football is actually the most predictable. But that's a topic for another day.
by JPK on
Oct 17, 2006 2:01 PM EDT
up
reply
actions
0 recs
You misunderstood me
by PB @ BON on
Oct 17, 2006 2:30 PM EDT
up
reply
actions
0 recs
Rephrase
In baseball, the predictive algorithms are built the same way that a guy like Jeff Sagarin builds his predictive algorithm. A human tweaks the inputs until he is satisfied with the predictive accuracy of the formula, which he uses to rank teams.
The question, then, is what value should a voter place in predictive accuracy of his rankings? It's not black and white by any stretch.
by PB @ BON on
Oct 17, 2006 2:34 PM EDT
up
reply
actions
0 recs
No they don't
Human analysts (at least those who know what they're doing) do NOT "tweak the outputs" of their rating systems. They hypothesize a set of relevant inputs, measure their correlation to expected winning percentage, and if its not acceptable they try a different set of inputs and/or a different combination of them.
Once they have something that is acceptable, its permanent. If I change one of my algorithms, its a new algorithm. But it would be waaaay too much work to take the output and manually adjust it to some subjective ranking.
by JPK on
Oct 17, 2006 2:53 PM EDT
up
reply
actions
0 recs
Fair
The question, yet to be addressed, is how we should value predictive accuracy. Let's say, JPK, that you create an algorithm that predicts the winners of college football games with 95% accuracy each week after you input each of the previous weeks' data. That's awfully good, but let's say that it's got Florida at #34, Tennessee at #20, and California at #2. That doesn't sit well with most of us. Why? Most of us value things like (see above) resume, and so forth, too.
So, then, what? What's the right balance? Is there a compromise possibility? Are they mutually exclusive? These are questions worth asking.
by PB @ BON on
Oct 17, 2006 5:04 PM EDT
up
reply
actions
0 recs
Now you're getting there
My own "neural-network based" ranking isn't useful at all to order teams by what anyone (even other computer rankings) would consider a "#n is better than #(n+1)" list, but that isn't what it was designed to do. If you understand it (and I'm not sure even I really do) though, and do some manipulations based upon it, it does very well at normalizing scoring offense and scoring defense, which is its purpose.
See my comment on a consistency-checker, though. It turns out that if a rating is self-consistent with respect to my PA() algorithm, then no matter how flaky the ranking looks, it's PA() rank will be closer to any other self-consistent system's PA() rank, and if you take enough systems that have different weights for the inputs and use their PA() ranks you can come pretty close to that 95 percent number.
Actually, those of us who have studied multiple rating systems over multiple sports have independently come to the conclusion that 95 percent is impossible. Somewhere between one of seven and one of five games are going to have a result that NO rating system could've predicted.
Repeat all together now "that's why they play the games" (those of you who instead chanted "on any given day" get credit too...)
by JPK on
Oct 17, 2006 5:36 PM EDT
up
reply
actions
0 recs
'That's why they play the games'
by SMQ on
Oct 17, 2006 6:40 PM EDT
up
reply
actions
0 recs
The one question I'm still struggling with
If I understand you correctly, SMQ, you still treat it as a really bad loss, and Cal gets zapped. Is that right?
I guess I'd agree that you do get the benefit of consistency, and there's something to be said for that principle alone. Yet it may not feel right.
Just an odd, difficult concept. I've enjoyed your and JPK's thoughts on this.
by PB @ BON on
Oct 17, 2006 7:20 PM EDT
up
reply
actions
0 recs
Ditto...
The problem was that this seemed less like actual experimenting and more like tweaking the system to fit what I already thought to begin with. Shouldn't the system be worked out intricately enough and with enough foresight to begin with that the inputs validate the results, rather than vice versa?
by SMQ on
Oct 17, 2006 3:17 PM EDT
reply
actions
0 recs
Preeee-cisely
probability higher-rated team wins = (ISR difference) * .02 + .5
He measured the correlation of that to the actual winning percentage and found an r-squared value of .9.
So he never actually had to look at the ranking order at all. Later, after many years worth of data, he found the correlation could be improved by both changing the algorithm to include a HFA factor AND adjusting the formula to include it.
That's not "tweaking", it's measuring and improving.
by JPK on
Oct 17, 2006 4:10 PM EDT
up
reply
actions
0 recs
Same question applies
The question, though, is not a mathematical one. We're all on the same page with regards to how these algorithms are created, how the input is tweaked, and how the new outputs are measured. The critical question I'm interested in is a philosophical one.
Folks like Jeff Sagarin, and presumably yourself - given your statistical acumen - may be satisfied with a set of rankings based on the most accurate algorithm created. Others, though, may say, "Wait a second. The computer doesn't notice that all five of Team A's losses were, subjectively, impressive. I want to credit that somehow." Or whatever subjective evaluation we may want to bring to the table.
And I want to probe SMQ on that philosophical question. What's the balance? Should there be one? Is there a compromise? Is the whole matter necessarily imperfect?
by PB @ BON on
Oct 17, 2006 5:26 PM EDT
up
reply
actions
0 recs
More fundamentally
Is it to rank the teams according to which would win on a neutral field?
Or, is it this hazy, hard-to-define mishmash of subjective evaluation, resume-rewarding, and objective data analysis?
Or something else?
It's a bitch of a question.
by PB @ BON on
Oct 17, 2006 5:29 PM EDT
up
reply
actions
0 recs
Fundamental answer
The purpose of a computer rating system is to provide a simpler view of a very complex set of data that I can use as one input in forming my judgement of comparitive quality of a team.
You have to know how to use 'em, though.
For instance, Sagarin's "predictor" is designed so that the difference between two team's ratings should be the same as the score differential were the teams to play. The algorithm minimizes the error in the rating as applied to historical games. "Minimizes" cannot be the same as "eliminates."
We need tools like this because there's no way for a human to do similar summaries. As of now, just looking at 1-A vs 1-A, there've been 350 games, and to compare each team to every other team (7021 comparisons) you have to process 532,496 connections. I couldn't even count those without my laptop.
Where human judgement comes in is that my program can summarize some aspect of each and every one of those connections into a single number and present it to me. How I use that summary is up to me - and how I use it depends upon me understanding what it is about all those paths that is being summarized. It's a differnt list if that's "who won or lost" vs "who won and what the MOV was" vs "who won, what the MOV was, and where the game was played" vs "who won, what the MOV was, where the game was, and what the passign efficiency of the two teams in the game was", vs...
well, you get the picture.
The trick is knowing what is important to which rating system. Once you do, then that ranking becomes just input to the unfathomable and un-reproducible analog process the human brain goes through in forming its own opinion.
The human error is to assume that the ordinal rank of teams by that metric means anything. The first step on the path to enlightenment is to accept the notion that there's no ordered ranking of teams such that every team ranked higher is "better" than all teams ranked below it. There are just many possible orderngs each of which tells us something about the team as it is viewed by a certain rating system.
by JPK on
Oct 17, 2006 6:05 PM EDT
up
reply
actions
0 recs
Suggestion for a consistency check
The answer, for anyone who cares, is in the duplication of games in the OWP and OOWP as the RPI defines those.
As a part of that study I developed an analytical tool that I called the "Performance Against" algorithm.
This system evidently doesn't support all the html tags I used in its definition, so see my original article and ignore references to earlier articles.
The relevant part is
..the basic "performance against" algorithm can be used as a tool to analyze any method, using the value (not the ranking) from any other method as a replacement for OWP.
So, what you can do is compare rankings from a rating system to rankings by the PA() mapping.
The PA() can be applied to any rating system whose values are all positive. For systems such as Massey's where the team can be assigned a negative value, the ratings must be mapped into an interval that maintains the order but for which all values are ≥ 0.
by JPK on
Oct 17, 2006 4:48 PM EDT
reply
actions
0 recs


