Championship Chessmetrics Analysis by Jeff Sonas
Jeff Sonas of http://www.chessmetrics.com/ writes.
(Version 2)
INTRODUCTION
We are cursed to live in "interesting times" in the chess
world. We have two different organizations sponsoring their own versions of the
World Championship, and the top-rated player in the world wants no part of
either championship. Countless proposals and unification plans have been
suggested, and rejected, and there is no end in sight.
I am a relatively weak chess player, but a very strong
computer programmer and statistician. Most of all, I am a big fan of chess, and
I want to help. I have little to contribute in the arenas of business plans,
organizational details, or negotiations, but nevertheless I do have something
quite useful to offer. I have developed some very sophisticated statistical
tools that enable me to objectively explore various chess topics, and in recent
weeks I have devoted considerable time to analyzing thousands of different
world championship formats. I would like to share the results of that analysis.
I am not affiliated with any chess organization, and I have
no particular agenda to promote. What I do have, instead, is the distinct
impression that people are making important decisions about the world
championship, without adequate information. One possible explanation is that
the decision-makers are simply unaware that it is often possible to use
statistics to draw reasonably sound conclusions about some of these topics. Or
maybe they don't even care about objective truth, and simply want to promote
their own agendas or improve their own situations. I'm going to adopt the role
of the optimist here, and assume that many people would love to have an
objective analysis of the various options available for the world chess
championship, but that nobody ever thought to ask for one. Well, here's your
analysis...
Based on my calculations, I can now tell you whether one
world championship format is "objectively better" than another one, and I can
explain why. If you describe a typical world championship format to me, I can
tell you, with reasonably good accuracy, the average percentage chance of the
strongest player in the world winning the championship cycle. I call that
percentage the "effectiveness" of a world championship format.
For instance, it turns out that the 128-player FIDE World
Championship has an "effectiveness" of 38%, which means that 38% of the time,
it will be won by the strongest player in the world (assuming no boycotts). In
other words, five times out of eight the strongest player will fail to win the
tournament. The Einstein Group's world championship cycle (which will debut in
July in Dortmund, Germany) has a much better effectiveness of 50%, which still
means that the best player will be champion only half of the time. By
comparison, a slightly modified version of Yasser Seirawan's "Fresh Start"
proposal is extremely effective, at 67%. In fact, none of the 13,000 formats
under consideration managed to break the 70% barrier, so Yasser's proposal is
almost maximally effective.
Through statistical analysis combined with random
simulation, I have analyzed 13,000 different world championship formats in
great detail, including Swiss tournaments, knockout tournaments, long matches,
short matches, round-robin tournaments of various types, qualifier tournaments,
and much more. I have tried to include all of the formats which have been used
historically or which are currently under consideration, as well as many
experimental formats. Out of those 13,000 formats, the FIDE World Championship
format is ranked #12,671, which means that it is in the bottom 5% in
effectiveness. Although the Einstein Group format is clearly better, a 50%
effectiveness is still not very good: it ranks #10,945 on my list. The modified
Seirawan proposal, by comparison, is way up at #345.
After that introduction, you might be chomping at the bit
to learn what format is #1 on my list. However, I'm not going to tell you just
yet, because "effectiveness" is not the only important factor. Without giving
those other factors their due consideration, it doesn't make sense to talk yet
about what is "best" or even "objectively best".
THE FOUR IDEAL CHARACTERISTICS OF A WORLD
CHAMPIONSHIP
In evaluating various world championship formats, I believe
there are four important characteristics to consider. I want to introduce a
little bit of terminology here, in an attempt to make all this easier to talk
about. An ideal world championship format would be "practical", "effective",
"inclusive", and "unbiased". Let me briefly cover what I mean with each of
those four words.
(1) "Practical" - The top players must be willing to
participate, the sponsors must be willing to sponsor the tournaments and/or
matches, and the playing sites must be available. Thus, World Championship
formats that include relatively shorter events, or just one event, would be
more "practical" than multi-stage formats or formats with very long matches or
tournaments. And, of course, World Championship formats with greater prize
money will also be more attractive to the players, although there are other
important considerations for most players.
(2) "Effective" - The overall purpose of the World
Championship is to allow the strongest player (whoever that may be) to
demonstrate their superiority by winning the championship. For instance, World
Championship formats with inadequate length or inefficient structure will
frequently be won by weaker players, whereas more effective formats would
provide that strongest player sufficient maneuvering space (even if they lose a
game or two) to demonstrate their superiority by winning the championship.
(3) "Inclusive" - It is easy to tell which players
have been the most successful in the recent past; just consult the rating list.
However, ratings are known to be somewhat inaccurate as measures of players'
actual strength, and it is quite conceivable that the strongest player is not
actually the highest-rated player. Thus it is typically a good idea to include
several players in the World Championship cycle, to give more people an option
to demonstrate their ability. However, the tricky part is that many
super-inclusive formats, such as the FIDE championships, are extremely
ineffective at determining the strongest player. Nevertheless, it is still
possible (though challenging) to be both "inclusive" and "effective"
simultaneously.
(4) "Unbiased" - Traditionally, specific players in
the world championship cycle have been given certain advantages, due to their
past accomplishments. For instance, the defending champion might be seeded
directly into the final match, or a recent semifinalist might automatically
qualify as a Candidate without needing to play in an Interzonal. Other
advantages have included "draw odds", and the champion's right to an automatic
rematch, and first-round byes for high-rated players (as in the earlier
100-player FIDE knockout tournaments). These "biases" are often perceived as
being unfair to everyone else, and should be avoided when possible. However, a
"bias" is not inherently bad; it is simply an advantage granted to a particular
player. It can be one way to make an event more "effective" without having to
make it impractically long.
In all fairness to the FIDE and Einstein Group approaches,
they do have their important advantages. The FIDE approach is extremely
inclusive and unbiased, and reasonably practical (as long as there is
sufficient funding for such an event). The Einstein Group's format is not
particularly inclusive, though it has the large practical advantage that it
bears some resemblance to the traditional way the championship has been run,
and thus its winner might indeed be more accepted by the public, as a
legitimate champion, than the FIDE champion often has been.
THE FIDE CHAMPIONSHIPS
The FIDE championship format takes a mere 22 days of play
to reduce 128 competitors down to one champion. It is very inclusive, and has
no biases in favor of any specific participant. For comparison, I identified 72
other formats that are 22 days or shorter, and also have no biases. Out of
these possibilities, the FIDE format (38% effectiveness) is right in the
middle, ranked 37th out of 73. Most options are in the 30%-40% range, and only
one format managed to finish above 50%. If FIDE were to invite just the eight
top-rated players to its knockout tournament, with two rounds of 6-game matches
and then a 10-game final (22 playing days), it would have a 52% chance to be
won by the strongest player in the world, slightly better than the Einstein
Group approach. Another good unbiased and practical option would be to have
four simultaneous single-round-robin tournaments with 10 players each (9
playing days), with the four winners advancing to two rounds of knockout
matches (4-game semifinal matches and then an 8-game final match). That
approach would be significantly more inclusive and only slightly less effective
(46% effectiveness) than the eight-player knockout.
When there are no biases introduced (i.e., nobody gets
automatically seeded into any later stage, and everyone is treated equally), a
knockout event seems to be far better than a Swiss. For instance, the options
to take the top two or four finishers from a 13-round Swiss tournament and then
play short matches between those top finishers, turn out to be very
ineffective, often lower than 20%. However, as you will see in a little while,
a format based on a Swiss qualifier can actually be considerably more effective
than a comparable format with a knockout qualifier. This discovery greatly
surprised me, and I will go into more detail further down, when I discuss the
Fresh Start proposal. However, first let's finish talking about the FIDE and
Einstein Group approaches.
The major criticism of the FIDE championship, of course, is
that the individual matches are too short. A single loss can mean almost
certain elimination. Everyone loses a game now and then, so it seems an overly
drastic punishment to be eliminated because you happened to have a minus score
over the span of two games. The 2002 tournament made a half-hearted attempt to
address this by lengthening the final match from 6 games to 8 games. As I
mentioned before the tournament, that is hardly much of an improvement (it
raised the effectiveness by 0.2%). It would have been better (39%
effectiveness) to use those extra two days to make the quarterfinal round 4
games long, instead, although of course even better would be a 4-game
quarterfinal AND an 8-game final (41% effectiveness).
Another obvious option would be to change all of the 2-game
matches into 4-game matches. Of course, this would have the unfortunate result
of adding at least 10 days to the length of the event if it stayed a 128-player
tournament. To compensate, the number of players could be reduced from 128 down
to 64. Thus with 4-game matches throughout, leading to an 8-game final (32
playing days), the effectiveness would rise to 42%.
Unsurprisingly, the knockout tournament would become more
and more effective, as we make it less and less inclusive and lengthen various
rounds. If we were to halve the number of players again, a reasonably inclusive
knockout tournament (32 players) could still be held, with four-game matches
throughout, leaving room for either an 8-game final match (43% effectiveness)
or a longer 14-game match (44% effectiveness). With sixteen players, the
effectiveness could be improved to 46% by 4-game matches and a 14-game final.
Finally, as I already mentioned, the most effective unbiased tournament would
be an eight-player knockout tournament with six-game quarterfinal and semifinal
matches, with a ten-game final, an overall effectiveness of 52%.
THE EINSTEIN GROUP CHAMPIONSHIPS
Now let us turn to the Einstein Group championship format.
This is an amazing attempt to compress an entire Candidates Cycle and World
Championship match into a mere 30 days of play. The format has come under
severe criticism because the round-robin preliminaries and the subsequent two
rounds of four-game matches are perilously short. In its current state, the
only significant bias involved is that the defending champion gets to play in
the final. So, I considered all of my formats lasting 30 or fewer playing days,
with the single bias that the champion is seeded into the final automatically
(assuming rapid tiebreaks throughout). There were 208 different formats, and
the Einstein Group approach (50% effectiveness) ranked 148th, placing it in the
bottom third.
The most effective approach (62% effectiveness), within
these constraints, would be to only invite the four top-rated players (other
than the champion). They would then play two rounds of six-game knockout
matches to get from four players down to one, and the winner would play the
defending champion in an 18-game match. Even just a 14-game match would still
be a 61% effectiveness, and better than any other approach (given the
constraints). If it were necessary to include eight candidates, plus the
champion (as is the case in Dortmund), the 30 days would be better spent in
three rounds of 4-game knockout matches, followed by an 18-game match against
the defending champion (60% effectiveness).
If it were desirable to be even more inclusive (for
instance so that a "wildcard" local participant like Christopher Lutz could be
chosen, without impacting the odds too significantly), you could have two
simultaneous 10-player single-round-robins, where the two winners play each
other in a 4-game match, and the winner plays the defending champion in a
16-game match (56% effectiveness). Or you could even go the super-inclusive
route, with a 196-player 13-round Swiss like Yasser Seirawan suggests. The two
top finishers could play each other in a 4-game match, and the winner
challenges the defending champion in an 8-game match. That would only last 25
days, and would still have an effectiveness of 55%. All of these options are
significantly more effective than the actual format chosen by the Einstein
Group, while still lasting no more than 30 playing days.
Of course, none of those options resemble the format that
will actually happen in Dortmund. Are there less significant changes that would
still greatly improve the effectiveness? Absolutely. For instance, the pair of
4-game knockout matches is hazardous. Even in a four-game match, it is very
difficult to recover from a loss. How about getting rid of one of those
matches? Instead of picking the top two players from each preliminary
round-robin, you could just pick the top finisher from each round-robin. Then a
single four-game match between those two winners, followed by the same 16-game
final against the defending champion, would make the event four days shorter,
and it would raise the effectiveness from 50% to 56%. It would be even better
(60% effectiveness) to make use of the whole 30 days by playing matches of 10
and then 14 games (rather than 4 and then 16).
Finally, there was a way to be even more effective, within
the 30-day constraint, although it did involve introducing another bias into
the world championship cycle. It always helps the effectiveness of a format if
you allow the highest-rated player to automatically bypass the qualifier event.
For instance, you could have the highest-rated player compete in a 10-game
match against the winner of a 4-player double-round-robin, and the winner would
challenge the defending champion in a 14-game match. That would be a 64%
effectiveness, and it seems likely that Garry Kasparov would have been more
amenable to that option, although of course I have no idea what went on with
the negotiations. I should point out that all of these numbers assume that
nobody declines an invitation. With neither Kasparov, Viswanathan Anand, nor
Ruslan Ponomariov participating, the effectiveness of the Einstein Group
approach, in this particular cycle, will of course be way lower than 50%. It
will probably be more like 20% or 25%, since it is reasonably likely that the
best player in the world is either Kasparov, Anand, or Ponomariov, and there is
less than a 50-50 chance that the best player in the world is actually
participating in the championship cycle at all.
WHERE THESE NUMBERS COME FROM
I don't expect you to blindly accept all of these numbers.
If you're still paying attention by this point, you might be wondering whether
I'm just making up the numbers to serve my own purposes, or if I actually
calculated them somehow. I don't want to bog you down with all of the gory
details, but here is a brief summary of what I did.
I didn't want my conclusions to be skewed by any special
characteristics of the current rating list, such as an unusually large gap
between #2 and #3, or between #3 and #4. So I decided that my calculations
would be based upon a "representative rating list", rather than an actual one.
I did some analysis of rating list trends over the past few decades, and came
up with a way to randomly simulate millions of "typical" rating lists. Thus
sometimes there is a huge gap between #1 and #2, and sometimes it's very
crowded at the top, with no clear leader. Sometimes the champion isn't even the
top-rated player.
However, it is also important to acknowledge that ratings
are inaccurate. They are merely estimates of players' true strengths, and those
estimates have errors associated with them (a standard deviation of about 50,
if you're interested). Somebody might have a rating of 2700, but their true
strength could easily be 2580 or 2780. So, for each random rating list, I had
to simulate a "true strength" for each player. The one player with the highest
"true strength" is that elusive "strongest player in the world", whom we are
trying to identify through the use of an effective world championship format.
Thus sometimes the "strongest player in the world" might not be the world
champion or the top-rated player; they might even be rated #8 or #10 or #20 in
the world, though it's unlikely. That is why it is important to be inclusive
with your world championship cycle; if you just use the top two or three
players, you might easily leave out the strongest player.
Armed with the ratings and true strengths of everyone on a
simulated rating list, I could then proceed to simulate a world championship
cycle. I tried various types of qualifier formats, different numbers of
simultaneous qualifying tournaments, allowing the top-rated one or two players
to bypass the qualifier, different ways of resolving tied matches, and/or
allowing the champion to enter the cycle at various stages. The breakthrough
was my realization that all popular world championship formats could in fact be
expressed as an "Interzonal" qualifier followed by a series of knockout
matches. This allowed me to tackle the problem systematically, rather than just
trying a few options which I thought might me "ideal". For instance, the FIDE
championships were treated as eight different qualifier tournaments (each of
which were 16-player knockout events won by a single player) and then a series
of knockout matches among the final eight players. The Einstein Group
championships were treated as two simultaneous qualifier tournaments (each of
which were 4-player double-round-robin tournaments that qualified two players),
and then there were three rounds of knockout matches, with the champion
entering the cycle in the third and final round of knockouts. And so on. For
each simulated championship cycle, I could see whether the "strongest player"
actually won, and over an average of many thousands of iterations for each
format, that would tell me the "effectiveness" of each world championship
format.
YASSER SEIRAWAN'S "FRESH START" PROPOSAL
I have to admit that I expected my analysis to reveal a
searing criticism of Yasser Seirawan's "Fresh Start" proposal, with its Swiss
qualifier. Swiss tournaments are generally perceived to be very ineffective,
especially compared to knockout tournaments of comparable size. I expected that
I would have to conclude that "it's all well and nice to play three rounds of
long matches at the end of your world championship cycle, but what good is that
when the majority of Candidates were chosen in a lottery?"
I was even advised to save myself the effort of trying to
program Swiss tournaments in my simulations, since they were obviously so
ineffective. A very prominent arbiter told me, "You do not need that for your
simulation. It is perfectly obvious, if you want to obtain a winner who has the
highest rating prior to the event, then the current FIDE knockout system is
best." However, I really wanted to compare the FIDE and Einstein approaches
against Yasser's proposal (which is based upon a Swiss qualifier), so I
ultimately decided to include the Swiss qualifiers in my analysis.
Well, guess what? Out of the 13,000 world championship
formats I evaluated, number TWO on the list, with an effectiveness of 69.4%,
was the following structure: The world champion and the two highest-rated
players (other than the world champion) bypass the qualifier and automatically
become Candidates. They are joined by the top five finishers from a 196-player
13-round Swiss. Those eight players then play three rounds of knockout matches
(16-game quarterfinal, 20-game semifinal, and 20-game final).
Does that sound familiar? It's almost exactly what Yasser
Seirawan suggests for the next world championship cycle. He actually suggests a
10-game quarterfinal, a 14-game semifinal, and a 20-game final, and that
shorter format (67% effectiveness) shows up at #181 on my list (still in the
top 2% of formats). And there are details in his proposal about tiebreaks that
were not included in my overall analysis (though I do cover them further down);
I assumed rapid tiebreaks everywhere for the eight-player candidate cycles,
since otherwise the calculations would have taken months to run all the
possibilities! And Yasser doesn't actually say that it should be the two
highest-rated players who bypass the qualifier; he specifically names Garry
Kasparov and Ruslan Ponomariov as the two players.
The number one format on my list, with an effectiveness of
69.5%, was actually very similar to number two. In this scenario, only the top
finisher from that same Swiss tournament qualifies, to play the #1-rated player
in a 20-game match. The winner then plays the defending world champion in a
20-game match for the title. That is the single most effective world
championship that I could find, but unfortunately it includes two biases: the
world champion gets automatically seeded into the final round, AND the
top-rated player doesn't have to play in the Swiss. Yasser's proposal would be
somewhat less biased, as it is less of an advantage to be an "automatic
Candidate" when there are eight Candidates rather than two, and of course in
his proposal the defending world champion does not get automatically seeded
into the final match.
Since we're on the topic, I should point out that the #3
format on my list has actually been tried, sort of, in the world championship.
In 1959 Mikhail Tal won an eight-player quadruple-round-robin tournament in
Yugoslavia, allowing him to play a 24-game match against the defending
champion. In 1962 Tigran Petrosian won an identical format in Curacao. And that
same format is #3 on my list, with an effectiveness of 69.3%, although it says
that the winner of the round-robin should face the top-rated player rather than
the defending champion. Thus if the defending champion was not the top-rated
player, the champion would have to play in (and win) the round-robin tournament
for the opportunity to play a championship match against the top-rated player.
Also, it's not strictly like the 1959 and 1962 Candidates tournaments, because
back then the eight players came from Interzonals, whereas this format
recommends just taking the players from the top of the rating list. Presumably
the bias in favor of the top-rated player is too much to make this format
acceptable, although it is clearly very effective.
Of course, there is no real difference between 68.3% and
68.5%. The point is not so much that nine of the top twelve formats happened to
have Swiss qualifiers. The real dazzler is that a Swiss qualifier can with any
seriousness be called "optimal". Conventional wisdom tells us that knockout
tournaments are more effective than Swiss tournaments of comparable length. It
says that knockout tournaments work better, because the strongest players are
in control of their own destiny, and nobody can finish ahead of you unless you
are actually knocked out by someone. By contrast, in a Swiss you might do well
but someone else might happen to do even better.
Why is conventional wisdom wrong? Well, I have two possible
explanations. One has to do with information theory. In a multi-stage event
such as a knockout tournament, it only matters if you make it to the next
stage, whether that be from a 2-0 whitewash or a 3-3 standoff where somebody
advances from a sudden-death game. After each round, the slate is wiped clean
and all remaining players start with the same score. Obviously, that means
discarding a considerable amount of information about how players have been
performing. When the whole point is to identify the strongest player, it seems
unwise to discard so much information. By contrast, in a Swiss tournament, your
total score reflects the whole of your performance in the event. Of course,
this "additional information" has to be balanced against the fact that players
face different levels of opposition in a Swiss tournament, so a score of +2
might sometimes be more impressive than a score of +4. But there are obviously
ways to address that by optimizing the pairings and/or scoring method, though
that lies outside the scope of my analysis... for now.
To understand my other explanation, consider an alteration
to Yasser's proposal. Rather than a large Swiss which generates five
Candidates, you could instead have five different simultaneous 16-player
knockout tournaments (2-game matches throughout), where the winner of each
knockout tournament becomes a Candidate. That approach would be good (62%) but
not as good as the Swiss approach (67%). With the knockout approach, you are
basically splitting your field into five subgroups, and deciding to take the
single top-performing player from each subgroup. If the strongest player in the
world happens to be playing in the same subgroup as another player who is
almost as strong, then it becomes reasonably likely (in the knockout approach)
that the strongest player would lose a two-game match to the slightly weaker
player. You can't qualify both players and resolve their differences later in a
long match, since you are required to take exactly one player from each
subgroup (i.e., the one who wins each knockout tournament). The numbers (62%
vs. 67%) suggest that it would work much better to have all of the players
intermingled in one big tournament, so the five strongest performances can
advance, independent of who would have been in which subgroup.
However, the Swiss tournament is not some magical solution
that should be used anywhere; it is very easy to use it poorly. The Swiss only
works well if the highest-rated players bypass it and automatically become
Candidates. Thus the Swiss is best viewed as a super-inclusive way to sort
through the rabble and find the rare player who is extremely under-rated
(literally) and actually very strong. If we already know that a player is very
strong (the defending champion, or one of the two top-rated players in the
world), it is far better to allow them to bypass a Swiss where they might
potentially lose a couple of games and fail to qualify. For instance, if you
had everyone (including the defending champion) play in the Swiss, and picked
the top eight finishers as your candidates, then the effectiveness would only
be 17%. If you automatically qualified the defending champion, but the other
seven qualifiers had to come from the Swiss, the effectiveness would only be
53%, barely better than the Dortmund style. The most important thing is to
include at least the highest-rated player automatically, along with the
defending champion. If the two automatic qualifiers are the defending champion
and the (remaining) highest-rated player, the effectiveness jumps up to 64%.
And as we've seen already, if the second-rated player is also allowed to bypass
the qualifier, the effectiveness is a nearly-ideal 67%.
Another interesting question is whether the qualifier
tournament becomes more effective if you make it more inclusive. We have seen
earlier, in the discussion about the FIDE format, that a knockout loses
effectiveness significantly when you double the number of players. In the case
of a Swiss, however, the inclusion of extra players actually helps, rather than
hurts, the effectiveness. For instance, if you modify the Seirawan proposal to
only include 64 players, the effectiveness is 61%, but doubling the field of
players, for a total of 128, raises the effectiveness to 65%, and tripling the
field (to Yasser's suggested 196-player level) leads to the best effectiveness,
the 67% already mentioned. Presumably this is because the weaker players don't
get in the way as much in a Swiss, after the first round or two.
In a 128-player knockout, you have a large number of
players who clearly are not the strongest players in the tournament, but who
can have a huge impact on the outcome through the chance elimination of a top
seed. We almost saw the extreme example of that in Moscow, where a single loss
to the bottom seed just about resulted in the first-round elimination of #1
seed Viswanathan Anand. On the other hand, by having such an inclusive field in
the large Swiss, you give yourself the possibility of identifying an extremely
underrated player who actually deserves to play in the Candidate section.
If you're trying to get a feel for what level of player
would typically finish in the top five in the 196-player Swiss tournament, I
can tell you that an average set of five qualifiers would have ratings ranging
from 2600 to 2780. A very strong set of five qualifiers (which would happen one
time out of every ten) might be something like: Michael Adams, Alexei Shirov,
Peter Leko, Alexander Morozevich, and Judit Polgar. A much weaker set of five
qualifiers (which also whould also happen one time out of every ten) would be
like: Viswanathan Anand, Zoltan Almasi, Konstantin Sakaev, Giorgi Giorgadze,
and Xie Jun. On average, out of the five top Swiss finishers, there would be
two or three players rated above 2700, and two or three players rated below
2700. Once every 25 or 30 tournaments, all five qualifiers would be rated below
2700, and once every 40 or 45 tournaments, all five qualifiers would be rated
above 2700. About 45% of the time, at least one qualifier would be a sub-2600
player.
RAPID TIEBREAKS
One controversial issue is whether rapid games are a good
way to break ties. This only matters, of course, if a tie actually occurs, so
it is a more significant factor when there are short events (such as the FIDE
championships or the Dortmund qualifier), and it wouldn't matter as much in the
Seirawan proposal (though of course it still could happen). There is a general
perception that rapid and blitz games are more "random" than classical games.
This is undoubtedly true, since time trouble always introduces an element of
randomness into the outcome of a game. However, I recently analyzed the results
of several thousand games played at various time controls over the past few
years, and (statistically speaking) this issue doesn't seem to be a
particularly significant one. The higher-rated player still manages about the
expected percentage score, whether the game is played at classical, rapid, or
blitz controls. Here is a picture to illustrate what I am talking about.
In this graph, we see the well-known trend that as the
white player's rating advantage gets bigger and bigger, White tends to score a
higher and higher percentage. If the two players have the same rating, then
White scores 55%. If White has a rating advantage of 200 points, then White
would score almost 70%. The blue line represents this relationship at classical
time controls.
Now look at the red line, which represents rapid games. If
rapid time controls really did make the game a lot more random, then the
higher-rated player would tend to score closer to 50% than predicted, with
either color. That means we would see the red line being flatter, more
horizontal, than the blue line. This is true to a certain degree, especially on
the right side of the graph, in those scenarios where White has a large rating
advantage. This means that rapid games do indeed turn out more randomly when
White is the big favorite; White is not able to score as high a percentage as
the ratings would suggest. For instance, with a +300 rating point advantage,
White would score 75% in classical games but only 72% in rapid games. However,
when Black is the favorite by more than 100 rating points (the left side of the
graph), the rapid results are exactly the same as classical. Thus, when
outrated by 300 points, White scores an identical 33% whether it be classical
or rapid. So, the conclusion to be drawn is that the advantage of the white
pieces is not as large in rapid games as in classical games, especially when
White is the higher-rated player. But the higher-rated player should do just
about as well in rapid as in classical. Perhaps the real "randomness" comes
from the fact that rapid matches are typically only two games long, rather than
four or six.
The blitz data (the white line on my graph) is a little
more suspect, because there are fewer results available to analyze. However,
there is no compelling evidence that blitz games are "more random" than rapid
or even classical games; the white line is not any more horizontal than the
blue line. You can see a distinctive bend in the middle of the white line,
suggesting that the advantage of the white pieces is magnified when the two
players are of similar strength. For instance, when the two players have the
same rating, White scores 58% in blitz but only 55% in classical. As I just
mentioned, the advantage of the white pieces is not as large in rapid chess as
it is in classical chess, so in rapid games, when the players have identical
ratings, White only manages to score 53%. But again, I see no real evidence
that the faster time controls are diminishing or obscuring the rating
difference between the two players in blitz. Thus it seems that rapid games, or
even blitz games if need be, are a reasonably effective way to resolve ties.
Now, it is certainly true that we see a lot more decisive
results in the faster time controls, particularly in blitz. What do I mean by
"a lot more"? Well, switching the time controls from classical to rapid, has
about the same effect (on the frequency of draws) as changing one of the
players from Peter Leko to either Veselin Topalov or Alexei Shirov, or changing
the opening from a 1.d4 game to a Sicilian Dragon. Further, switching the time
controls from classical to blitz, has about the same effect (on the frequency
of draws) as changing a Peter Leko-Anatoly Karpov matchup into an Alexander
Morozevich-Alexei Fedorov matchup, or changing a Petroff's Defense into a
King's Gambit. This will indeed make the results slightly more random, which
(as I said) could be addressed by making the rapid tiebreaks longer. I hate to
sound like a broken record, but I should again point out that this exact
approach (using 4-game matches if a rapid tiebreak becomes necessary) was
already suggested by Yasser Seirawan in his "Fresh Start" proposal.
For instance, let's take a very simple unbiased case, where
two simultaneous 10-player single-round-robin tournaments are held, and the
winners play each other in a title match. First let's consider the case where
the final match is only six games long. If a drawn match is to be resolved by
the spin of a roulette wheel, the effectiveness of this format is 37.3%.
Obviously, it would be better to actually play games to resolve the tie, since
the stronger player would have a better-than-even chance to win the tiebreak.
So if we use the rapid-blitz progression like in the FIDE championships, the
effectiveness goes up to 39.2%. Since blitz games are more random, if we simply
played a long set of 2-game rapid matches, it would be slightly better (39.3%).
Finally, Yasser's suggestion of a rapid match which would be four games long
(rather than two), is the most effective tiebreak method (39.5% effectiveness).
You can see from those numbers that the tiebreak method
doesn't matter too much, even for a mere six-game match; the effectiveness
ranged from 37.3% (random) to 39.5% (4-game rapid match). Of course, as the
match length is increased, the tiebreak method becomes less and less of a
factor; for a 16-game match, the random option has an effectiveness of 41.1%
and the other options are all 41.6% or 41.7%. And for a 24-game match, the
random tiebreak has an effectiveness of 42.1% and all other options are tied at
42.3%. A drawn match is just too unlikely.
However, sometimes this issue does not even arise.
Specifically, if one of the players has been granted "draw odds" in a
particular match, that player is automatically declared the winner in the event
of a drawn match. Usually, the defending champion is granted draw odds in their
match, and this is obviously a key part of Yasser's proposal, since it
acknowledges two champions, and there are also the curious provisions about
"inheriting" draw odds if you overcome them in your quarterfinal match.
Generally, draw odds are not a good way to resolve ties. They are better than a
roulette wheel (since on average the defending champion will be stronger than
the challenger), but slightly less effective than any other tiebreak method.
The main benefit of draw odds is that they provide an incentive for a defending
champion to actually participate in a world championship cycle, since the draw
odds are a bias that favors the defending champion.
Everything that I have said to this point applies to chess
world championships in general. The conclusions would have been identical a
decade ago, or fifteen years in the future, even with a completely different
set of top players. However, at this point I must leave off my attempts to be
"generic", because there is one final issue I want to cover, which must be
handled "specifically". I want to discuss the topic of who would be favored by
the various biases in the "Fresh Start" proposal, and in order to do that we
must start talking about "Vladimir Kramnik" and "Garry Kasparov" and "Ruslan
Ponomariov", rather than just "the defending champion" or "the highest-rated
player".
WHO IS FAVORED BY THE FRESH START PROPOSAL?
The "Fresh Start" proposal has an interesting set of
biases. Kramnik, Kasparov, and Ponomariov are all "rewarded" by being allowed
to bypass the qualifier, but each in turn is "punished" by the fact that the
other two players are also bypassing the qualifier. Ponomariov would presumably
be happy to avoid the qualifier, but sad that Kasparov and Kramnik (probably
his two strongest potential opponents) were guaranteed to qualify. Further, as
champions of their respective organizations, Kramnik and Ponomariov are
additionally granted another bias: draw odds in their quarterfinal and
semifinal matches. Finally, Kasparov is "punished" by the fact that he will
have to overcome draw odds in his semifinal match, whoever the opponent. So
clearly Kramnik and Ponomariov would benefit from the match structure, and
Kasparov would probably not benefit, but how big of a deal is this? What are
the magnitudes of each player's advantages and disadvantages? This is an
extremely important question, perhaps THE most important question about the
relative merits of Yasser's proposal.
First of all, let's once again draw an important
distinction between the meaning of "highest-rated player" and "strongest
player". Ratings are inexact, and so the player with the highest rating might
not actually be the strongest player. There is no way to exactly measure who
the strongest player is; all we can do is talk about the "likelihood" that each
player really is the strongest in the world. The rating list tells us (with
great accuracy) who has been most successful recently, and gives us some idea
of who will do best in the near future, but we should always remember that no
rating difference is ever 100% conclusive; you have to deal with probabilities
rather than absolutes.
By the way, I want to applaud the decision of the Einstein
Group to use an average of the FIDE and Professional ratings for the
invitations and seedings in their Dortmund qualifier. I had already mentioned a
year ago that a simple average of the two ratings did an excellent job of
masking the limitations of each individual one, so I think it was a great
decision. To keep things consistent, I have done the same thing in the
following analysis (using the April 1st 2002 rating lists), although I had to
add 50 points to each Professional rating to make the numbers similar to the
FIDE ratings. With these ratings, we can apply some simple statistics and
calculate each player's likelihood of being the strongest in the world.
Unsurprisingly, it's probably either Garry Kasparov or
Vladimir Kramnik. Kasparov (average FIDE/Prof rating 2842) has a 49% chance of
being the strongest player, whereas Kramnik (2827) has a 34% chance. Veselin
Topalov (2758), Ruslan Ponomariov (2751), and Viswanathan Anand (2751) each
have about a 3% chance, and the rest of the world (2740 and below) has a
combined 8% chance. In a perfect world championship format, whenever Kasparov
was indeed the strongest player, he would win the championship. And likewise
for Kramnik. Thus, in a perfect format, Kasparov would have a 49% chance
overall to win the championship, and Kramnik would have a 34% chance, and so
on.
However, the "perfect world championship" is only a myth.
We've already seen (above) that no known world championship format is even 70%
effective, so even in the best case, a third of the time the championship will
be won by somebody who is not the strongest player. We have to keep the matches
down to a reasonable and practical length, and sometimes that just isn't long
enough for the strongest player to demonstrate their superiority over another
very strong player.
I have spent several hours analyzing the statistical effect
of draw odds, and I can state very confidently that the actual selection of
Candidates is far more important than the question of who gets draw odds in a
10-game (or longer) match. For instance, even if there were no draw odds,
Kasparov and Kramnik would still be "punished" by the fact that they have to
play fairly short matches against players who are certainly weaker, but
nevertheless have some chance to eliminate them. For instance, I just told you
that we can be 83% sure that either Kasparov or Kramnik is the strongest player
in the world, but even after they bypassed the qualifier, there would still be
more than a 25% chance that someone else would actually win the championship.
Ruslan Ponomariov is clearly the beneficiary of the most
significant biases in the "Fresh Start" proposal. Although his combined rating
of 2751 puts Ponomariov in a virtual tie for fourth in the world with
Viswanathan Anand, he still has less than a 3% chance of actually being the
strongest player in the world. Nevertheless, Ponomariov would have a 10.4%
chance to actually win the championship. It turns out that if Ponomariov's
rating were actually 2783 (rather than 2751), then the numbers would claim that
Ponomariov did in fact have a 10.4% chance of being the strongest player. Thus
we can say that the specific Fresh Start proposal "awards" Ponomariov 32 rating
points, in effect.
This is a very large bias in favor of Ponomariov. To try
and put that bias in more concrete terms, let's envision a fantasy scenario
where Kasparov and Kramnik are the only two players who bypass the qualifier,
so Ponomariov has to finish in the top six in the Swiss qualifier like anyone
else. However, in this fantasy, Ponomariov gets a special advantage (in the
Swiss and in the final rounds of matches) that he receives the white pieces
every five games out of six, instead of every one game out of two. According to
my calculations, that fantasy scenario gives Ponomariov about the same
advantage that the actual Fresh Start proposal gives him. Is that an unfair
advantage? Or is it commensurate with his position as FIDE World Champion? That
is for someone else to decide, I suppose.
It would be tempting to say that +32 rating points is way
too many to "award" Ponomariov, and that he should be granted an automatic
place but not given draw odds. Well, that doesn't really help very much,
because the lion's share of his advantage lies in his automatic Candidate
status. Here is how the various biases are measured by my technique:
(1) Being an automatic qualifier for the three
rounds of matches (10/14/20 games): Kasparov -14 rating points, Kramnik -7
rating points, and Ponomariov +22 rating points.
(2) Draw odds given to Kramnik and Ponomariov in the
quarterfinal: Kramnik +4 rating points, Ponomariov +6 rating points.
(3) Draw odds given to Kramnik and Ponomariov in the
semifinal: Kramnik +4 rating points, Ponomariov +4 rating points.
(4) Any player who eliminates Kramnik or Ponomariov
in the quarterfinal, inherits draw odds for the semifinal: Kasparov -2 rating
points.
Interestingly enough, this collection of small advantages
for Kramnik, and small disadvantages for Kasparov, are sufficient to make
Kramnik the statistical favorite if the Fresh Start proposal were to actually
happen. Kramnik would have a 38% chance to win the championship, Kasparov would
have a 36% chance to win the championship, and (as I've already said)
Ponomariov would have just over a 10% chance to win the championship.
Nevertheless, that is only because Kasparov and Kramnik are already so close
together. In the bigger picture, this draw odds issue does not seem to merit
the attention it gets. A +4 rating point advantage, across the entire world
championship cycle, is less important statistically than the total advantage
you would get from your opponent blundering a pawn in one single game, sometime
during the cycle. Probably this is more of a prestige issue than anything else,
or perhaps there is a huge psychological issue I am ignoring with my statistics
(like the feeling that you are battling uphill from the start, if the other
person has draw odds).
CONCLUSION
As I said way back at the beginning, I have no particular
agenda to promote. However, I have had to re-examine many of my assumptions
about chess, as a result of this analysis, and I hope that will happen for you
as well. Among other things, I now have a much greater respect for Swiss
tournaments than before, along with a greater respect for Yasser Seirawan's
judgment and intuition about what makes a good tournament format! Perhaps some
deeply-held beliefs about the "randomness" of rapid chess will also be
challenged as a result of my analysis, but possibly that is too much to expect.
Likewise for the "draw odds" debate, I suppose...
This essay is the culmination of many, many late-night
hours of effort. However, I hope that it will prove to be a beginning, rather
than an end. There are many problems with the current state of the chess world,
and statistics will never be the only answer to any of them. Statistics are
merely a tool, a source of information, to assist people in finding a better
answer to some of their problems. There has been so much debate, and yet so
little objective exploration of the facts, and so I hope that this will be the
beginning of a new effort, a new kind of debate. I invite you to send me e-mail
at jeff@chessmetrics.com, and if there is enough interest perhaps I will
publish a follow-up analysis which incorporates feedback from all of you.
I would like to conclude with a quote from baseball analyst
Bill James: "It has always been my experience that if you can present a good
argument and back up what you are saying, there are people who will be
persuaded. It is sometimes possible to change the tenor of the debate by
injecting information into the discussion." I hope, very much, that he is
correct.
Thank you for taking the time to read this.
Jeff Sonas
You can contact Jeff Sonas at:
jeff@chessmetrics.com
The views expressed here do not necessarily reflect those of
TWIC, Chess & Bridge Ltd or the London Chess Center. |