ScidBase
Home Scid Daily Screenshots What's New Downloads Beta Tests CD-ROM ScidBase Scidlet Resources Links Author
English   Deutsch 

ScidBase is a high quality reference database of unannotated master-level games in the Scid database format. It is only available on CD-ROM, with a small donation to cover media/postage costs and the time spent collating it.

ScidBase facts

Here are some ScidBase statistics as of 29 Jan 2004.

  • 1.03 million games
  • Mean year: 1990 (Year range: 1834 to 2004)
  • Mean Elo rating: 2350
  • Both players rated 2600+: more than 19,000 games*
  • Both players rated 2500+: more than 92,000 games*
  • Both players rated 2400+: more than 260,000 games*
  • Both players rated 2300+: more than 430,000 games*
[*] The Scid spellcheck file gives virtual ratings to strong historical players before ratings were established in 1970, so these figures include games between players like Capablanca and Alekhine for example.

ScidBase is the cumulative result of many hundreds of hours work...

  • downloading game files
  • converting various (CA, CBF, CBH, PGN) formats to Scid
  • cleaning messed-up tags (jumbled Site/Event data, etc)
  • standardizing player/event/site names (using the Scid spellcheck feature, purpose-written scripts and lots of manual cleaning)
  • sorting out ambiguous names where possible (sometimes this is impossible; just try separating the two Andrei Sokolovs correctly!)
  • finding and removing doubles, taking care to avoid "false" doubles
  • weeding out games by low-rated players, internet blitz rubbish, computer blitz rubbish, weak regional youth championships, etc etc

What's in, what's out

Many games available on the Internet are simply not "strong" enough (played by strong enough players, in serious conditions, at a serious time control) for a database like ScidBase. Some blitz games are fine (such as playoffs for a serious championship event) but too many will degrade the quality of chess represented.

Most historical games are included, but many odds and exhibition games have been left out. I don't really care how often Morphy beat patzers with knight odds. For tournaments in the modern era, the general rule is that a tournament should have at least a few strong titled players and not too many complete unknowns.

There are relatively few computer vs computer games, except for important events such as organised championships. "Comp-comp" games are easy to generate but tend to pollute the openings information of a general database. Correspondence games are also mostly restricted to world championships and the like, since the quality of chess in correspondence varies widely.

For recent years, a good rule of thumb is that if it is good enough for TWIC (The Week in Chess), it's good enough for ScidBase. It's hard to set a fixed rule, because even youth tournaments like national or regional under-12 or under-10 championships can have strong titled players these days.

One grey area is large Swiss events which have some grandmasters, IMs, etc but also many weak players. ScidBase generally includes all known games in such events avoiding "censorship" based on rating, although in many cases only the games involving at least one "strong" player are available anyway.

Most databases contain many "empty" games (with no or very few moves, usually starting 1.a4 or 1.h4). These are only retained to complete crosstables, and are easy enough to remove with a Header search in Scid if you so wish. Empty games in tournaments where many games are already missing, have usually been removed.

Naming convention

One of the most important attributes of a well-cleaned database (apart from having very few duplicate games) is consistent, standard naming of players, events and sites. Here is a brief explanation of some of the naming conventions used in ScidBase.

Player Names

Full names (as they appear in the latest FIDE rating list) are used in most cases. If a player has more than one known given name, the second and subsequent given names are usually abbreviated to initials. Wherever possible, all games by one player share the same name for that player. Insufficient name information often makes this impossible; and there are still some ambiguous shortened names in the database.

Country codes are ignored, except to disambiguate conflicting names. This is important in the post-USSR era when many players have represented more than one country. Title (GM/IM/FM etc) codes are also removed (since they are often wrong anyway), as are club and region appendages. You may want to know who played for Solingen in a collection devoted to Bundesliga games, but this is a general database.

Players who have had a name change (usually women who have married) have the current name retrospectively applied. This may be a little confusing but avoids a player having games under two different names, and the need to determine when the name actually changed.

Dates

Wherever possible, a full date (e.g. 2003.12.31) is used. If only the month a tournament started in is known, that is used for all games in the tournament: "1998.04" for example, even if some games were played in May.

One hard, fixed rule is that every game considered to be in a tournament must share the same "EventDate" tag (which is as much information about the date of the first game as is known). This is essential for grouping games into tournaments.

Site Names

Almost all sites end with a three-letter country code. Common english spelling of city names is generally used. Sites where only the country is known have just the country code, e.g. "FRA". (This is most commonly used for team championships which can take place in many cities over several months.)

Where a location has changed countries (e.g. GER/FRG/GDR, Yugoslavia and the ex-Soviet republics) the current country is used. It's "Kiev UKR" (Ukraine) even when the USSR championship was played there.

Tournaments played in multiple sites usually have a single site name, such as "London ENG / Leningrad RUS" for the 1986 World Championship. This rule is less clearly followed for long-running events such as a national team championship, which can reasonably be interpreted as a single event (in which case it should have a single site name) or multiple separate events.

Internet games have the country code "INT", e.g. "Internet Chess Club INT" or just the generic "Internet INT". Correspondence games have a site of "Corr", possibly followed by a country code if played within a single nation.

Event Names

Event name standardization is probably the toughest part of database cleaning. There are so many different ambiguous conventions and abbreviations. For ScidBase, I have tried to produce a fairly verbose standard format and apply it as consistently as possible, while still having event names that are readable English. Here are just a few of the problems faced:

  • Inconsistent ordering of common words like Open, Women, Junior, etc.
  • Inconsistent abbreviation and capitalization of common words.
  • "M" can be "match" or "men"; "W" can mean "women" or "world"; "b" can mean "blitz" or "boys"; "g" can mean "game" or "girls"; "f" can mean "final" or "female".
  • Roman numerals are way overused: for classes (III for C grade), months (III for March), FIDE categories, and annual counters (III for the 3rd occurence of an event).
  • Inconsistent use of Sponsor names.
  • Indication of the class of time control (blitz, rapid, g/30, etc) in the Event name is a mess in most databases.

ScidBase event naming conventions include:

  • Few abbreviations. Women, World, Open, Team and Match are spelled in full. Championship is left as Ch and International as It, since these are very common conventions.
  • Country codes where practical: FRA instead of France or French. One exception: I chose "USSR" over "URS" because it's only one more letter and easier to read.
  • Occurrence numbers are at the start of the event, followed by a period. Suffixes like (1)"st", (2)"nd" and (3)"rd" are avoided. Example: "53. USA Ch" for the 53rd United States championship.
  • Category numbers such as "... (cat. 4)" are avoided, except where necessary to disambiguate two tournaments which would otherwise share the same event and site in the same year.
  • We generally avoid duplicating information from the site field, except for a national or city championship. So you may see the event "London Ch" but "5. London Open" would just be "5. Open".

Summary

As you can see, ScidBase took a lot of work. I hope you will support it by donating to receive a copy of it on CD-ROM.


SourceForge Hosted by SourceForge
Scid SF Index Page
© 2004 Shane Hudson
Page updated: 29 Jan 2004