eXtremeGammon's chi-squared tests on FIBS dice

Zorba · April 15, 2010, 08:19:05 PM

Here's what I get from eXtremeGammon's chi-squared tests after 4895 dice rolls from FIBS for me and 4884 for my opponent. I don't know how to interpret this. All I know is that FIBS dice, so far, at no point scored all "good" or "excellent" for these tests. For comparison, I have fewer Dailygammon matches entered (1450 rolls me, 1448 opponent), but I get 6 times "Excellent".

See screenshot.

Just thought this was interesting, especially since I've just learned from another FIBSter with similar results. I have no clue what kind of conclusion one could draw from this (if any at all).

boomslang · April 15, 2010, 11:43:48 PM

The first three Chi2 tests, test whether the observed rolls you, your opponent, and the both of you received were evenly distributed over all possible results (six doubles with prob. 1/36 and 15 non-doubles with prob. 1/18 each). Of each possible roll the occurrences are counted and compared with the expected number of occurrences (basically the differences are squared, normalized and added). This results in a X2 value of which the distribution is known.
Large values indicate that there were (some) big differences between observed and expected counts; small values indicate very small differences. Since the rolls are random, big differences AND small differences are suspicious. Your X2 of 11.87 is small but not 'too' small: assuming random dice, there is an 8% chance to get an even smaller value. The opponents' 37.07 is actually quite big because there's only a 1.15% chance of getting even larger values, indicating your opponents' dice were not evenly distributed.

The second three tests are similar, but they make statements about the randomness of doubles during races. In this case there are only 7 cases: six different doubles (prob. 1/36 each) and one non-double (prob. 30/36). According to this table, only the combined race double test looks a bit suspicious.

As eXtremeGammon sets the p-values at 5% and 95%, even with true random dice, you will see 'suspicious' results in 10% of the cases. Too often 'Excellent' is also not good!

dorbel · April 16, 2010, 07:29:06 AM

No mathematician I, but aren't these sample sizes far too small to be meaningful? I would have thought that millions of rolls would be required for any sort of proper test.

Including the first roll of any game in your survey will distort results as Fibs can't give you a doublet there.

It occurs to me that including the last roll of a race may also be misleading. In a two roll ending, a doublet will actually be the last roll, whereas a singleton won't!

Perhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.

diane · April 16, 2010, 09:05:09 AM

Quote from: dorbel on April 16, 2010, 07:29:06 AMPerhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.

I am not volunteering for that

I suppose 100 of us could do 10000 rolls...ugh..1000 of us do 1000 rolls...er - are there good places to steal large numbers of 'real rolls' from?

dorbel · April 16, 2010, 11:15:17 AM

More to the point, how good do the dice have to be? If you can't predict what the next roll is going to be, or even make a better than random guess, if the number of doubles is more or less proportional, then what else do you need? Comparing your dice and opponent's dice is also meaningless, as on fibs, both sides are rolling with the equivalent of the same pair of dice! If at the end of a million trials or something, one roll has come up more than any other, so what? Now if you could say which number that was going to before the trial started, then that would be interesting!

Zorba · April 16, 2010, 11:37:11 AM

Thanks boomslang, that's a clear explanation! Remains the question of sample size, as dorbel wrote. I don't know how that would work out for these X^2 tests.

In general, how many rolls are needed to say something meaningful depends on what you want to test and what kind of confidence you are looking for. If you roll dice 25 times and you get 25 doubles, that is meaningful, because the "discrepancy" is relatively huge. If I suspect a pair of dice is not quite precision dice because sixes come up 17.00% of the time on each die rather than 16.67%, I need a lot more than 25 rolls to get some meaningful evidence for that hypothesis.

About the first roll: I'd guess the programmers at XG have taken care of this before compiling the data. Including last rolls are not a problem; the easiest way to see this is to realize that the dice don't know which roll of the game it is.

One could roll precision dice by hand and do this X^2 test as a control, but for random dice, the kind of outcomes you're expected to see are already mathematically defined. So it would be more like testing if the precision dice rolled are actually likely to be random.

As to the final question, "how good do the dice have to be?", I'd say you'd like them to behave randomly. Anything else brings up the question of the cause of the discrepancy. As backgammon often leads to positions where even a single dice roll can have a huge impact on the outcome of a game or match, even slight discrepancies in a distribution overall, could potentially (but not necessarily) be highly influential on the outcomes of games and matches.

pck · April 16, 2010, 07:58:10 PM

Quote from: Zorba on April 16, 2010, 11:37:11 AM
Thanks boomslang, that's a clear explanation! Remains the question of sample size, as dorbel wrote. I don't know how that would work out for these X^2 tests.

gumpi ran the same XG tests over roughly 250.000 dice rolls from around 2200 FIBS .mat files I sent him. Here are the results.

What puzzles me is how it is possible to get good results for the player's as well as the opp's distribution, but a bad one for their combined distribution.

boomslang · April 16, 2010, 09:24:10 PM

Quote from: dorbel on April 16, 2010, 07:29:06 AM

No mathematician I, but aren't these sample sizes far too small to be meaningful? I would have thought that millions of rolls would be required for any sort of proper test.

Including the first roll of any game in your survey will distort results as Fibs can't give you a doublet there.

It occurs to me that including the last roll of a race may also be misleading. In a two roll ending, a doublet will actually be the last roll, whereas a singleton won't!

Perhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.

As a rule of thumb, Chi2 tests are considered valid if the expected count in each of the cells is larger than 5. With Zorba's 4800+ rolls, the expected number of occurrences for each possible roll is larger than 133 (for the doubles) and larger than 266 (for each non-double), so that doesnt cause a problem.

Opening rolls and last rolls during a bear off are indeed special cases. (I once made a table of 47000 last rolls in games that didn't end with a resignation and almost 30% of them were doubles.) Using these rolls in the test will, in the long run, result in high X2's and therefore delfated p-values.

Quote from: Zorba on April 16, 2010, 11:37:11 AM
[...] Including last rolls are not a problem; the easiest way to see this is to realize that the dice don't know which roll of the game it is.

It is a problem. Suppose you have 4 stones left on your 1 point, and your opponent has just started bearing off. With a probability of 11/36 the last roll in that match will be a double: it will only be a non-double if you roll two non-doubles in a row! (Assuming the opponent doesn't resign.)

pck · April 16, 2010, 09:42:07 PM

Quote from: boomslang on April 16, 2010, 09:24:10 PM
It is a problem. Suppose you have 4 stones left on your 1 point, and your opponent has just started bearing off. With a probability of 11/36 the last roll in that match will be a double: it will only be a non-double if you roll two non-doubles in a row! (Assuming the opponent doesn't resign.)

But from this it doesn't follow that there are more doubles than 1 in 6 rolled on average in bg matches. (*)

Dorbel is right. Inim fell prey to this illusion a while ago as well. The above argument ignores that fact that bg games have no fixed number of rolls.

Imagine an evenly distributed dice stream. If (*) were true you shouldn't be able to play bg with it. But of course you can.

boomslang · April 16, 2010, 10:02:09 PM

Quote from: pck on April 16, 2010, 07:58:10 PM
gumpi ran the same XG tests over roughly 250.000 dice rolls from around 2200 FIBS .mat files I sent him. Here are the results.

What puzzles me is how it is possible to get good results for the player's as well as the opp's distribution, but a bad one for their combined distribution.

Looking at the screenshots, it seems the dice for pck and the dice for all his opponents were exactly the same: same overall stats, same race stats, same number of rolls, same double ratio, same average pip, entering ratio etc.

Something clearly went wrong while running the tests and/or importing the .mat-files.

boomslang · April 16, 2010, 10:16:17 PM

Quote from: pck on April 16, 2010, 09:42:07 PM

But from this it doesn't follow that there are more doubles than 1 in 6 rolled on average in bg matches. (*)

It does mean that there are more doubles than 1 in 6 during race situations (since the last roll has a higher that 1/6 chance of being a double and all remaining rolls have a 1/6 chance of being a double) so for the last three tests it matters.

pck · April 16, 2010, 10:21:25 PM

Quote from: boomslang on April 16, 2010, 10:02:09 PM
Looking at the screenshots, it seems the dice for pck and the dice for all his opponents were exactly the same: same overall stats, same race stats, same number of rolls, same double ratio, same average pip, entering ratio etc.

Something clearly went wrong while running the tests and/or importing the .mat-files.

Thanks, I think you're right. This would indeed explain how the players' individual distributions can be good and the combined one bad - it is very unlikely for the exact same deviations from the expected values to happen to two players.

pck · April 16, 2010, 10:39:02 PM

Quote from: boomslang on April 16, 2010, 10:16:17 PM
It does mean that there are more doubles than 1 in 6 during race situations (since the last roll has a higher that 1/6 chance of being a double and all remaining rolls have a 1/6 chance of being a double) so for the last three tests it matters.

Not sure about this. The same argument as above applies: Use an evenly distributed dice stream which doesn't favour doubles to play only races. It's quite possible to do that, but shouldn't be if it were true that in races more doubles than 1 in 6 are rolled.

My guess is that in some or all of those situations which have a >1/6 chance for a double on the last roll, some previous roll cannot have been a double (since otherwise the >1/6 last roll situation would have been prevented from being created).

Zorba · April 16, 2010, 11:19:45 PM

Heh, this is a funny problem that I indeed discussed with inim once.

While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double, as overall the probability should still be 1/6 over all rolls. The obvious reason is that the dice don't know about the board situation and just pop up doubles 1/6 of the time. It's just that doubles more often cause a roll to become the last roll of a match.

If you just look at the doubles fraction in a race situation, it's less clear what's going on. The last roll of a race will be a double more than 1/6 of the time. But how about the next-to-last roll? The probability that one was a double is lower than 1/6, by the same line of reasoning (had it been a double, it might've often been the last roll). However, now that we're not looking at all rolls anymore but just the race doubles, it's unclear to me if that would influence the overall probability. If at all, I don't think it would be by much. Most of the "last roll double p>1/6" is compensated for by the "next-to-last roll double p<1/6" I'd think and the rest may all be compensated for by going one or two rolls further back.

Another complicating factor with counting "race doubles", might be that players often disengage and turn a position into a race, when they roll a double. It's not even clear if these doubles are counted as race doubles: it's a race after the move, but it wasn't before the move. To avoid bias, these rolls should not be counted.

boomslang · April 17, 2010, 12:29:27 AM

Quote from: pck on April 16, 2010, 10:39:02 PM

My guess is that in some or all of those situations which have a >1/6 chance for a double on the last roll, some previous roll cannot have been a double

Good dice (precision or generated by an ideal RNG) have no memory so they are independent of any other roll so that includes 'some previous' roll aswell...

Quote from: Zorba on April 16, 2010, 11:19:45 PM

While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double

No, see above.

Quote from: Zorba on April 16, 2010, 11:19:45 PM
The last roll of a race will be a double more than 1/6 of the time. But how about the next-to-last roll? The probability that one was a double is lower than 1/6, by the same line of reasoning [...]
Most of the "last roll double p>1/6" is compensated for by the "next-to-last roll double p<1/6" I'd think and the rest may all be compensated for by going one or two rolls further back

Again, due to the lack of memory property they all are completely independent.

The thing is that 'opening doubles' are rolled, but not recorded in your .mat-files. That's why they will not show up in your XG stats and the stats are flawed. If you play against GnuBG, it WILL show you opening doubles, just as in real life ;-). If I am not mistaken, you see opening doubles in FIBS aswell if you have the game text toggle on. Write them down with pen and paper, add them to your stats and all is well!

pck · April 17, 2010, 01:11:00 AM

Quote from: Zorba on April 16, 2010, 11:19:45 PM
While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double, as overall the probability should still be 1/6 over all rolls. The obvious reason is that the dice don't know about the board situation and just pop up doubles 1/6 of the time. It's just that doubles more often cause a roll to become the last roll of a match.

Yes, the occurance of certain last roll situations with >1/6 double probability depends on previous rolls. So we cannot assign those situations a probability p > 1/6 for a double and at the same time assign 1/6 for all other rolls. I agree exactly the same is true for races.

Quote from: Zorba on April 16, 2010, 11:19:45 PM
However, now that we're not looking at all rolls anymore but just the race doubles, it's unclear to me if that would influence the overall probability.

If the starting points of races are distributed randomly within the sequence S of all rolls of all matches, the partial sample P of S we get by looking only at races should be a randomly selected subset of S. In a randomly selected subset of S we can expect to find the same distribution of doubles as we expect to find in S.

pck · April 17, 2010, 01:51:03 AM

Quote from: boomslang on April 17, 2010, 12:29:27 AM
Good dice (precision or generated by an ideal RNG) have no memory so they are independent of any other roll so that includes 'some previous' roll aswell...
[...]
No, see above.
[...]
Again, due to the lack of memory property they all are completely independent.

The thing is that 'opening doubles' are rolled, but not recorded in your .mat-files. That's why they will not show up in your XG stats and the stats are flawed. If you play against GnuBG, it WILL show you opening doubles, just as in real life ;-). If I am not mistaken, you see opening doubles in FIBS aswell if you have the game text toggle on. Write them down with pen and paper, add them to your stats and all is well!

I completely agree on the opening doubles, the rolls to determine who moves first must be included into the stats, otherwise they will be flawed.

With last roll doubles, the situation is different. The no memory/independence argument doesn't apply here, because you choose to look at situations with a > 1/6 probability. That is, you single out a certain class of situations according to a rule (last roll double with p > 1/6). For this class of situations you can no longer assume independence from earlier rolls since it was not selected randomly.

To see this more clearly, consider a simple game where two players take turns rolling one die. Whoever rolls 6 first, wins. So all games will end in a 6. The last roll probability for 6 will therefore be 100%. But it would obviously be wrong to conclude that due to the "independence of the rolls", the probability of getting a 6 in the rolls before the last one is 1/6. That probability is of course zero. We're looking at the game from a "backwards" perspective, from the vantage point of its last roll which we know to be a 6. This knowledge allows us to draw conclusions about the probabilities of the previous rolls.

The independence argument applies only from a "forward-looking" vantage point, that is, from a view which has no knowledge yet about what is going to happen in the game. Switching back to backgammon, we may say that a double will occur with probability 1/6 only from such a perspective. But looking backwards from a last-roll-double-with-p > 1/6-situation means we do have a certain amount of knowledge about what happened. We are no longer in a general, entirely undetermined situation. We do have some information which allows us to arrive at certain conclusions about previous rolls. Hence the backwards-looking double-probability for those rolls will not be 1/6 anymore.

Assigning p > 1/6 double probabilities to certain last roll situations and 1/6 to all other rolls illegitimately mixes the forward and backward view and produces erroneous results. (You're mixing probabilities from two different probability spaces modelled according to different states of knowledge.)

dorbel · April 17, 2010, 04:54:48 AM

Nicely put pck.

boomslang · April 17, 2010, 11:55:22 AM

Quote from: pck on April 17, 2010, 01:11:00 AM
Yes, the occurance of certain last roll situations with >1/6 double probability [ ...]

I don't know what you mean by an 'occurrence of certain last roll situations', so I cannot comment on this.

Quote from: pck on April 17, 2010, 01:11:00 AM
If the starting points of races are distributed randomly within the sequence S of all rolls of all matches, the partial sample P of S we get by looking only at races should be a randomly selected subset of S. In a randomly selected subset of S we can expect to find the same distribution of doubles as we expect to find in S.

1. "If the starting points of races are distributed randomly..." We know that this is not true: the starting point of race situations cannot lie in the first 8 or so rolls
2. P is not a random subset of S: it is the tail of sequence S, so it for sure will include the last roll (and because of 1. it will for sure not include the opening roll).

More than 1/6th of P will be doubles.

Zorba · April 17, 2010, 12:38:18 PM

Conditional probabilities are a b#t#h...

Introducing "last roll" as a special class to race dice creates conditional probabilities, that are no longer independent events.

One cannot just single out the last roll of a race where doubles have p>1/6, and think that by doing so, the other rolls in the race are still independent events. Consider your own example:

4 checkers on the acepoint. What is the probability of rolling doubles on:

1. my last roll
2. my next-to-last roll

FIBS Board backgammon forum

News:

eXtremeGammon's chi-squared tests on FIBS dice

Zorba

boomslang

dorbel

diane

dorbel

Zorba

pck

boomslang

pck

boomslang

boomslang

pck

pck

Zorba

boomslang

pck

pck

dorbel

boomslang

Zorba