ELO ratings and color dependent expected score

pdemp · 10-06-2019 2:32pm #1

I've been research Elo rating system for something at work, and thought I'd look into implementations in chess as that's where it started off.

I used ~450,000 of the latest rated games from TWIC and the standard expected score of white [1/(1+10^(-(elo_white-elo_black)/400))] does a poor job of predicting the average outcome for white. However 1/(1+10^(-(elo_white-elo_black + 40 )/500)) is an excellent predictor of the the expected score [see below].
6034073

So if white is 40 points lower rated than black then the expected score is 0.5.

I tried to test with the ICU games database, but there's just not enough games with ratings to test for sure.

So should colour be taken into account in expected score calculation [and therefore rating calculation]?
Or, given that alternating colours reduces the impact somewhat, should the first move advantages be ignored for the sake of simplicity?

zeitnot · 10-06-2019 3:08pm

pdemp wrote: »

I used ~450,000 of the latest rated games from TWIC and the standard expected score of white [1/(1+10^(-(elo_white-elo_black)/400))] does a poor job of predicting the average outcome for white. However 1/(1+10^(-(elo_white-elo_black + 40 )/500)) is an excellent predictor of the the expected score [see below].

(The image didn't show for me.)

pdemp wrote: »

So if white is 40 points lower rated than black then the expected score is 0.5.

A few years back Jeff Sonas proposed simplifying the expectancy tables to a straight line between -460 and +390 for White, i.e., 35 point advantage to White. So this is consistent.

pdemp wrote: »

So should colour be taken into account in expected score calculation [and therefore rating calculation]?

Depends on what the goals are. More complicated models give more accurate results. The Deloitte/FIDE Chess Rating Challenge had some very elaborate schemes. For a rating scheme, it seems to me that simpler is better.

One question: in going through the TWIC data, did you calculate draw probability? This might be useful for some purposes.

Kilmokey · 10-06-2019 4:41pm

Paul

Looking at the individual history tab for players on the LCU website colour changes the score. With three players picked at random one did better with white one did better with black and the other was about even, have fun.

Tim Harding · 10-06-2019 4:55pm

When using TWIC data, did you exclude the blitz and rapid games?

If you didn't exclude at least the rapid then the dataset is worthless for discussions of classical ratings.

cdeb · 10-06-2019 5:18pm

I've consistently scored 7-8% better with black for some reason (something like 53% v 46%).

I'm in favour of a bonus for me!

But if whites and blacks are 50/50 (and my own database of 800 games is exactly 50/50), then it should all balance out, no? If so, is there a need to further complicate things?

Tim Harding · 10-06-2019 7:28pm

cdeb wrote: »

I've consistently scored 7-8% better with black for some reason (something like 53% v 46%).

I'm in favour of a bonus for me!

But if whites and blacks are 50/50 (and my own database of 800 games is exactly 50/50), then it should all balance out, no? If so, is there a need to further complicate things?

CORRECTION: In my previous post, meant to say the blitz (instead of rapid), but I think for ratings you should only consider events of the same type.

I would imagine the faster the games, the less the colour matters?

pdemp · 10-06-2019 10:10pm

zeitnot wrote: »

(The image didn't show for me.)

Fixed I hope. Not sure if I've permission to put up images.

zeitnot wrote: »

A few years back Jeff Sonas proposed simplifying the expectancy tables to a straight line between -460 and +390 for White, i.e., 35 point advantage to White. So this is consistent.

That's what I roughly got as the average advantage, but it depended strongly on the relative rating [if you can see the graph you can see it ranges from around +100 to -5 in the zone +\- 300].

zeitnot wrote: »

Depends on what the goals are. More complicated models give more accurate results. The Deloitte/FIDE Chess Rating Challenge had some very elaborate schemes. For a rating scheme, it seems to me that simpler is better.

Yeah, for my application I'm reviewing a few ranking systems, some used in chess, and some are complex on an initial look. As simple as possible but no simpler in chess should really take colour into account. Ignoring it seems anti-intuitive.

zeitnot wrote: »

One question: in going through the TWIC data, did you calculate draw probability? This might be useful for some purposes.

I can run that tomorrow for separate win, draw, loss probabilities for white.

Tim Harding wrote: »

When using TWIC data, did you exclude the blitz and rapid games?

If you didn't exclude at least the rapid then the dataset is worthless for discussions of classical ratings.

I hadn't included event details as I was only interested in testing if elo was self consistent [and it appears it only is if colour is a parameter], but I'll rerun tomorrow. I think blitz and rapid make up around 5% of the total games each, so their impact should be minimal, or reinforce an argument for using a colour specific as you mention the colour should be less of a factor in such games.

pdemp · 10-06-2019 10:34pm

Kilmokey wrote: »

Paul

Looking at the individual history tab for players on the LCU website colour changes the score. With three players picked at random one did better with white one did better with black and the other was about even, have fun.

A few hundred v 450,000

Pattern wasn't clear with entire ICU games database, too much noise with 11,000 games. I'm not going to calculate the chances of picking three players at random and getting that outcome. I swear.

cdeb wrote: »

I've consistently scored 7-8% better with black for some reason (something like 53% v 46%).

I'm in favour of a bonus for me!

But what's your performance rating difference between the colours, rather than just results. We've two players in the club who are >100 points better as black than white, but there's always exceptions.
And remember it's not a bonus for you, it's a bonus for your opponents who won't lose as much when you've your true rating.

cdeb wrote: »

But if whites and blacks are 50/50 (and my own database of 800 games is exactly 50/50), then it should all balance out, no? If so, is there a need to further complicate things?

If certain other conditions are met then it will balance out. I suppose I'd disagree that it complicates things, I see it as a simplification. With the colour parameter shift in the distribution, the expected scores in rating calculations match the expected scores from real games.

cdeb · 10-06-2019 11:35pm

pdemp wrote: »

But what's your performance rating difference between the colours, rather than just results.

Fair point actually. I don't know. I would assume that the rating of my opponents would balance out as well over a relatively large sample of games. But I'm not going back working it out!

pdemp wrote: »

I suppose I'd disagree that it complicates things, I see it as a simplification.

Well I suppose if it's an extra factor to take into account in calculating ratings (i.e. two sets of expected scores, one for white and one for black), it can't really be a simplification. It could be a refinement alright.

There's a lot of data on the ratings site, with colours all noted. I wonder would it be possible to create a duplicate and play around with alternative expected scores to see if it changed anything? My suspicion is very little would change - if I gain 2 points in a tournament because of it, then I would go into the next tournament 2 points higher and the same performance would see me gain a fraction less. I still think it would all just balance out.

mikhail · 11-06-2019 4:39am

Here's Paul's plot, for anyone struggling to access it.

pdemp · 11-06-2019 11:39am

cdeb wrote: »

Well I suppose if it's an extra factor to take into account in calculating ratings (i.e. two sets of expected scores, one for white and one for black), it can't really be a simplification. It could be a refinement alright.

Yes describing it as a simplification is wrong. I'll rephrase and go with the exclusion of colour being an over-simplification, since it is inconsistent with the data. Pretty much along the same reasoning as switching from normal to logistic in the first place.

cdeb wrote: »

There's a lot of data on the ratings site, with colours all noted. I wonder would it be possible to create a duplicate and play around with alternative expected scores to see if it changed anything? My suspicion is very little would change - if I gain 2 points in a tournament because of it, then I would go into the next tournament 2 points higher and the same performance would see me gain a fraction less. I still think it would all just balance out.

You'd hope for most players it wouldn't make a difference by much, especially active tournament players. But suppose someone person is ~250 points higher rated than their opponents and have 5 games with each colour. Excluding the colour effect their expected score is 8.03. However taking colour into account the expected score is 7.58. So with a K-factor of 24, they are losing out on 12 rating points over just those 10 games, no matter what their actual score.

pdemp · 11-06-2019 11:41am

Tim Harding wrote: »

When using TWIC data, did you exclude the blitz and rapid games?

If you didn't exclude at least the rapid then the dataset is worthless for discussions of classical ratings.

Removing the blitz and rapid made no noticeable difference. And I've not enough rapid or blitz games yet to come to any firm conclusions about effect of colour in those events.

pdemp · 11-06-2019 11:52am

zeitnot wrote: »

One question: in going through the TWIC data, did you calculate draw probability? This might be useful for some purposes.

Not sure if anyone can see the attachments [thanks for posting the first one mikhail], but two plots, one with prob of draw v (elo white - elo black) and the other with the ratio of wins to draws v (elo white - elo black).

Probability of draw seems to follow a cauchy distribution shifted by ~30 rating points.

Ratio of wins to draws is pretty much fixed when elo_white < elo_black - 60 at 0.5. After that it grows exponentially roughly following 0.5*10^( (elo white - elo black + 60)/400)

zeitnot · 11-06-2019 6:59pm

pdemp wrote: »

Not sure if anyone can see the attachments

I can see all the attachments now. Yes, the one attached to the first post seems to be a very close fit.

pdemp wrote: »

two plots, one with prob of draw v (elo white - elo black) and the other with the ratio of wins to draws v (elo white - elo black).

Probability of draw seems to follow a cauchy distribution shifted by ~30 rating points.

Ratio of wins to draws is pretty much fixed when elo_white < elo_black - 60 at 0.5. After that it grows exponentially roughly following 0.5*10^( (elo white - elo black + 60)/400)

Very interesting, thanks. I am puzzled by the second plot here, though. All the others seem to suggest that playing White is worth 30-40 rating points, and once that is accounted for, everything else is symmetric. But the second plot here shows that a much lower-rated White scores half his points via draws, whereas a much lower-rated Black scores almost all his points via wins. That's quite unexpected.

pdemp · 11-06-2019 10:21pm

zeitnot wrote: »

Very interesting, thanks. I am puzzled by the second plot here, though. All the others seem to suggest that playing White is worth 30-40 rating points, and once that is accounted for, everything else is symmetric. But the second plot here shows that a much lower-rated White scores half his points via draws, whereas a much lower-rated Black scores almost all his points via wins. That's quite unexpected.

The axis isn't well labelled. It should say White Win Prob / Draw Prob. For black the graph is similar but with black win:draw ratio of ~0.4 (so black is more dependent on draws for elo black < elo white - 20 than white which is intuitive), and then the same exponential growth rate for elo black > elo white - 20.

zeitnot · 11-06-2019 10:51pm

pdemp wrote: »

The axis isn't well labelled. It should say White Win Prob / Draw Prob. For black the graph is similar but with black win:draw ratio of ~0.4 (so black is more dependent on draws for elo black < elo white - 20 than white which is intuitive), and then the same exponential growth rate for elo black > elo white - 20.

Thanks. That makes sense.

The results are interesting because they half-fit the usual model of draws, i.e., that the players play twice (win or lose, no draw possible), and an overall 'draw' corresponds to a 1-1 result. So if p = prob of white win in these half-games, we have overall prob of white win p^2, prob of draw 2pq, prob of black win q^2. (And expectancy table for rating purposes should give p.) White win prob / draw prob = p / 2q, and I think the plot matches this to the right. But not to the left, where we would expect the ratio to go to 0 instead of levelling out at 0.5.

pdemp · 12-06-2019 7:50am

zeitnot wrote: »

The results are interesting because they half-fit the usual model of draws, i.e., that the players play twice (win or lose, no draw possible), and an overall 'draw' corresponds to a 1-1 result. So if p = prob of white win in these half-games, we have overall prob of white win p^2, prob of draw 2pq, prob of black win q^2. (And expectancy table for rating purposes should give p.) White win prob / draw prob = p / 2q, and I think the plot matches this to the right. But not to the left, where we would expect the ratio to go to 0 instead of levelling out at 0.5.

Taking that approach 1.3*(p/2q)^1.25 fits the right hand side. The issue seems to be that 2p(1-p) overestimates the number of draws apart from near elo_white = elo_black - [30,60].

While not 100% rigorous the following formulae estimate the draw and win probabilities very well for white with rd = elo_white - elo_black:
Expected score: 1/ (1 + 10^(-(rd+40)/500))
Draw: 375/(250*pi*(1+((rd+30)/250)^2))
White Win (derived from above two):
1/ (1 + 10^(-(rd+40)/500)) - 0.5*375/(250*pi*(1+((rd+30)/250)^2))

sodacat11 · 12-06-2019 11:42am

I think that it very much depends of the playing style of a player and the openings they use as to how they score with white or black. Most of the highest rated scalps I've taken are with Black but I also lose more games to higher rated as Black than I do as White . I do much better against lower rated when I'm White.
I think that often it is easier to play against a better player as Black because they are usually the ones dictating the course of the game and the Black player is just responding to threats with a limited (or forced) number of replies. The White player on the other hand has to formulate plans and try to break down the black defence so often has a much greater choice of moves and thereby more chance of picking the wrong one.
Overall, I don't think colours should be taken into consideration for rating purposes.

pdemp · 17-06-2019 2:30pm

cdeb wrote: »

There's a lot of data on the ratings site, with colours all noted. I wonder would it be possible to create a duplicate and play around with alternative expected scores to see if it changed anything? My suspicion is very little would change - if I gain 2 points in a tournament because of it, then I would go into the next tournament 2 points higher and the same performance would see me gain a fraction less. I still think it would all just balance out.

I did a few more calcs and it doesn't balance out unless your consistently playing opponents of your own strength. I couldn't find the colour data on the ratings page, even for my own games. Is this something only the rating officer can access?

cdeb · 17-06-2019 3:01pm

Colour data is there in the background - you can see shadings in any tournament standings list that you click into.

zeitnot · 17-06-2019 3:40pm

pdemp wrote: »

I did a few more calcs and it doesn't balance out unless your consistently playing opponents of your own strength. I couldn't find the colour data on the ratings page, even for my own games. Is this something only the rating officer can access?

The TWIC data that you started with is not a random cross-section of all games played, but instead skews high. For example 4NCL division 1 is included, but lower division games from 4NCL are not.

I remember reading somewhere that colour advantage was rating dependent: expectation around 0.57 for White at higher levels, but close to 0.5 at the very lowest levels. (I have no reference, and it was a good while ago.) If that's the case, it would give a good reason to stick with the simpler formula.

pdemp · 17-06-2019 4:27pm

cdeb wrote: »

Colour data is there in the background - you can see shadings in any tournament standings list that you click into.

Thanks. class "B". I never noticed the shading intent before.

zeitnot wrote: »

The TWIC data that you started with is not a random cross-section of all games played, but instead skews high. For example 4NCL division 1 is included, but lower division games from 4NCL are not.

I remember reading somewhere that colour advantage was rating dependent: expectation around 0.57 for White at higher levels, but close to 0.5 at the very lowest levels. (I have no reference, and it was a good while ago.) If that's the case, it would give a good reason to stick with the simpler formula.

Yes, the TWIC is dominated by 2300 - 2500 games. I tested it in bands of 100 from 1000 up to 2800, and it was a better estimate of the results for all ranges, especially for 1200+.

The colour issue put a spanner in the works for my work application, in which order of events matters in ranking scenarios, so hoping if I can understand it in the relatively simple case of chess results I can then apply it consistently in other areas.

pdemp · 17-06-2019 4:37pm

That said in chess it won't make much of a difference in practice.

e.g. suppose there are only two players left in the world with true ratings of 1400 and 1900. With colour taken into account their ratings won't change over a large number of games, provided they play roughly the same with each colour. If colour is ignored then their ratings will shift to 1450 and 1850. So minimum impact in chess, more of an interesting curiosity.

ELO ratings and color dependent expected score

Comments