News:

controversial "other online games" Board introduced --GO, Poker, Scrabble and gosh even Chess.....let me know :)

Main Menu

eXtremeGammon's chi-squared tests on FIBS dice

Started by Zorba, April 15, 2010, 08:19:05 PM

Previous topic - Next topic

Zorba

Here's what I get from eXtremeGammon's chi-squared tests after 4895 dice rolls from FIBS for me and 4884 for my opponent. I don't know how to interpret this. All I know is that FIBS dice, so far, at no point scored all "good" or "excellent" for these tests. For comparison, I have fewer Dailygammon matches entered (1450 rolls me, 1448 opponent), but I get 6 times "Excellent".

See screenshot.

Just thought this was interesting, especially since I've just learned from another FIBSter with similar results. I have no clue what kind of conclusion one could draw from this (if any at all).

The fascist's feelings of insecurity run so deep that he desperately needs a classification of some things as successful or superior and other things as failed or inferior. This also underlies the fascist's embracement of concepts like mental illness and IQ tests.  - R.J.V.

Luck is my main skill

boomslang

The first three Chi2 tests, test whether the observed rolls you, your opponent, and the both of you received were evenly distributed over all possible results (six doubles with prob. 1/36 and 15 non-doubles with prob. 1/18 each). Of each possible roll the occurrences are counted and compared with the expected number of occurrences (basically the differences are squared, normalized and added). This results in a X2 value of which the distribution is known.
Large values indicate that there were (some) big differences between observed and expected counts; small values indicate very small differences. Since the rolls are random, big differences AND small differences are suspicious. Your X2 of 11.87 is small but not 'too' small: assuming random dice, there is an 8% chance to get an even smaller value. The opponents' 37.07 is actually quite big because there's only a 1.15% chance of getting even larger values, indicating your opponents' dice were not evenly distributed.

The second three tests are similar, but they make statements about the randomness of doubles during races. In this case there are only 7 cases: six different doubles (prob. 1/36 each) and one non-double (prob. 30/36). According to this table, only the combined race double test looks a bit suspicious.

As eXtremeGammon sets the p-values at 5% and 95%, even with true random dice, you will see 'suspicious' results in 10% of the cases. Too often 'Excellent' is also not good!

dorbel

No mathematician I, but aren't these sample sizes far too small to be meaningful? I would have thought that millions of rolls would be required for any sort of proper test.

Including the first roll of any game in your survey will distort results as Fibs can't give you a doublet there.

It occurs to me that including the last roll of a race may also be misleading. In a two roll ending, a doublet will actually be the last roll, whereas a singleton won't!

Perhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.



diane

Quote from: dorbel on April 16, 2010, 07:29:06 AMPerhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.

I am not volunteering for that  :blink: :blink:

I suppose 100 of us could do 10000 rolls...ugh..1000 of us do 1000 rolls...er - are there good places to steal large numbers of 'real rolls' from?   :laugh: :laugh:
Never give up on the things that make you smile

dorbel

More to the point, how good do the dice have to be? If you can't predict what the next roll is going to be, or even make a better than random guess, if the number of doubles is more or less proportional, then what else do you need? Comparing your dice and opponent's dice is also meaningless, as on fibs, both sides are rolling with the equivalent of the same pair of dice! If at the end of a million trials or something, one roll has come up more than any other, so what? Now if you could say which number that was going to before the trial started, then that would be interesting!

Zorba

Thanks boomslang, that's a clear explanation! Remains the question of sample size, as dorbel wrote. I don't know how that would work out for these X^2 tests.

In general, how many rolls are needed to say something meaningful depends on what you want to test and what kind of confidence you are looking for. If you roll dice 25 times and you get 25 doubles, that is meaningful, because the "discrepancy" is relatively huge. If I suspect a pair of dice is not quite precision dice because sixes come up 17.00% of the time on each die rather than 16.67%, I need a lot more than 25 rolls to get some meaningful evidence for that hypothesis.

About the first roll: I'd guess the programmers at XG have taken care of this before compiling the data. Including last rolls are not a problem; the easiest way to see this is to realize that the dice don't know which roll of the game it is.

One could roll precision dice by hand and do this X^2 test as a control, but for random dice, the kind of outcomes you're expected to see are already mathematically defined. So it would be more like testing if the precision dice rolled are actually likely to be random.

As to the final question, "how good do the dice have to be?", I'd say you'd like them to behave randomly. Anything else brings up the question of the cause of the discrepancy. As backgammon often leads to positions where even a single dice roll can have a huge impact on the outcome of a game or match, even slight discrepancies in a distribution overall, could potentially (but not necessarily) be highly influential on the outcomes of games and matches.
The fascist's feelings of insecurity run so deep that he desperately needs a classification of some things as successful or superior and other things as failed or inferior. This also underlies the fascist's embracement of concepts like mental illness and IQ tests.  - R.J.V.

Luck is my main skill

pck

Quote from: Zorba on April 16, 2010, 11:37:11 AM
Thanks boomslang, that's a clear explanation! Remains the question of sample size, as dorbel wrote. I don't know how that would work out for these X^2 tests.

gumpi ran the same XG tests over roughly 250.000 dice rolls from around 2200 FIBS .mat files I sent him. Here are the results.

What puzzles me is how it is possible to get good results for the player's as well as the opp's distribution, but a bad one for their combined distribution.

boomslang


Quote from: dorbel on April 16, 2010, 07:29:06 AM

No mathematician I, but aren't these sample sizes far too small to be meaningful? I would have thought that millions of rolls would be required for any sort of proper test.


Including the first roll of any game in your survey will distort results as Fibs can't give you a doublet there.

It occurs to me that including the last roll of a race may also be misleading. In a two roll ending, a doublet will actually be the last roll, whereas a singleton won't!

Perhaps it would be instructive to roll a pair of precision dice as a control alongside your observations.


As a rule of thumb, Chi2 tests are considered valid if the expected count in each of the cells is larger than 5.  With Zorba's 4800+ rolls, the expected number of occurrences for each possible roll is larger than 133 (for the doubles) and larger than 266 (for each non-double), so that doesnt cause a problem.

Opening rolls and last rolls during a bear off are indeed special cases.  (I once made a table of 47000 last rolls in games that didn't end with a resignation and almost 30% of them were doubles.)  Using these rolls in the test will, in the long run, result in high X2's and therefore delfated p-values.



Quote from: Zorba on April 16, 2010, 11:37:11 AM
[...] Including last rolls are not a problem; the easiest way to see this is to realize that the dice don't know which roll of the game it is.

It is a problem. Suppose you have 4 stones left on your 1 point, and your opponent has just started bearing off.  With a probability of 11/36 the last roll in that match will be a double: it will only be a non-double if you roll two non-doubles in a row! (Assuming the opponent doesn't resign.)




pck

Quote from: boomslang on April 16, 2010, 09:24:10 PM
It is a problem. Suppose you have 4 stones left on your 1 point, and your opponent has just started bearing off.  With a probability of 11/36 the last roll in that match will be a double: it will only be a non-double if you roll two non-doubles in a row! (Assuming the opponent doesn't resign.)

But from this it doesn't follow that there are more doubles than 1 in 6 rolled on average in bg matches. (*)

Dorbel is right. Inim fell prey to this illusion a while ago as well. The above argument ignores that fact that bg games have no fixed number of rolls.

Imagine an evenly distributed dice stream. If (*) were true you shouldn't be able to play bg with it. But of course you can.

boomslang

Quote from: pck on April 16, 2010, 07:58:10 PM
gumpi ran the same XG tests over roughly 250.000 dice rolls from around 2200 FIBS .mat files I sent him. Here are the results.

What puzzles me is how it is possible to get good results for the player's as well as the opp's distribution, but a bad one for their combined distribution.


Looking at the screenshots, it seems the dice for pck and the dice for all his opponents were exactly the same: same overall stats, same race stats, same number of rolls, same double ratio, same average pip, entering ratio etc.

Something clearly went wrong while running the tests and/or importing the .mat-files.

boomslang

Quote from: pck on April 16, 2010, 09:42:07 PM

But from this it doesn't follow that there are more doubles than 1 in 6 rolled on average in bg matches. (*)


It does mean that there are more doubles than 1 in 6 during race situations (since the last roll has a higher that 1/6 chance of being a double and all remaining rolls have a 1/6 chance of being a double) so for the last three tests it matters.

pck

#11
Quote from: boomslang on April 16, 2010, 10:02:09 PM
Looking at the screenshots, it seems the dice for pck and the dice for all his opponents were exactly the same: same overall stats, same race stats, same number of rolls, same double ratio, same average pip, entering ratio etc.

Something clearly went wrong while running the tests and/or importing the .mat-files.

Thanks, I think you're right. This would indeed explain how the players' individual distributions can be good and the combined one bad - it is very unlikely for the exact same deviations from the expected values to happen to two players.

pck

#12
Quote from: boomslang on April 16, 2010, 10:16:17 PM
It does mean that there are more doubles than 1 in 6 during race situations (since the last roll has a higher that 1/6 chance of being a double and all remaining rolls have a 1/6 chance of being a double) so for the last three tests it matters.

Not sure about this. The same argument as above applies: Use an evenly distributed dice stream which doesn't favour doubles to play only races. It's quite possible to do that, but shouldn't be if it were true that in races more doubles than 1 in 6 are rolled.

My guess is that in some or all of those situations which have a >1/6 chance for a double on the last roll, some previous roll cannot have been a double (since otherwise the >1/6 last roll situation would have been prevented from being created).

Zorba

Heh, this is a funny problem that I indeed discussed with inim once.

While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double, as overall the probability should still be 1/6 over all rolls. The obvious reason is that the dice don't know about the board situation and just pop up doubles 1/6 of the time. It's just that doubles more often cause a roll to become the last roll of a match.

If you just look at the doubles fraction in a race situation, it's less clear what's going on. The last roll of a race will be a double more than 1/6 of the time. But how about the next-to-last roll? The probability that one was a double is lower than 1/6, by the same line of reasoning (had it been a double, it might've often been the last roll). However, now that we're not looking at all rolls anymore but just the race doubles, it's unclear to me if that would influence the overall probability. If at all, I don't think it would be by much. Most of the "last roll double p>1/6" is compensated for by the "next-to-last roll double p<1/6" I'd think and the rest may all be compensated for by going one or two rolls further back.

Another complicating factor with counting "race doubles", might be that players often disengage and turn a position into a race, when they roll a double. It's not even clear if these doubles are counted as race doubles: it's a race after the move, but it wasn't before the move. To avoid bias, these rolls should not be counted.
The fascist's feelings of insecurity run so deep that he desperately needs a classification of some things as successful or superior and other things as failed or inferior. This also underlies the fascist's embracement of concepts like mental illness and IQ tests.  - R.J.V.

Luck is my main skill

boomslang

Quote from: pck on April 16, 2010, 10:39:02 PM

My guess is that in some or all of those situations which have a >1/6 chance for a double on the last roll, some previous roll cannot have been a double

Good dice (precision or generated by an ideal RNG) have no memory so they are independent of any other roll so that includes 'some previous' roll aswell...

Quote from: Zorba on April 16, 2010, 11:19:45 PM

While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double


No, see above.

Quote from: Zorba on April 16, 2010, 11:19:45 PM
The last roll of a race will be a double more than 1/6 of the time. But how about the next-to-last roll? The probability that one was a double is lower than 1/6, by the same line of reasoning [...]
Most of the "last roll double p>1/6" is compensated for by the "next-to-last roll double p<1/6" I'd think and the rest may all be compensated for by going one or two rolls further back


Again, due to the lack of memory property they all are completely independent.

The thing is that 'opening doubles' are rolled, but not recorded in your .mat-files. That's why they will not show up in your XG stats and the stats are flawed.  If you play against GnuBG, it WILL show you opening doubles, just as in real life ;-).  If I am not mistaken, you see opening doubles in FIBS aswell if you have the game text toggle on.  Write them down with pen and paper, add them to your stats and all is well!


pck

Quote from: Zorba on April 16, 2010, 11:19:45 PM
While it is true that the last roll of a game (finished to completion) has a higher chance than 1/6 to be a double, from this it necessarily follows that other rolls in a bg game have a lower chance than 1/6 at being a double, as overall the probability should still be 1/6 over all rolls. The obvious reason is that the dice don't know about the board situation and just pop up doubles 1/6 of the time. It's just that doubles more often cause a roll to become the last roll of a match.

Yes, the occurance of certain last roll situations with >1/6 double probability depends on previous rolls. So we cannot assign those situations a probability p > 1/6 for a double and at the same time assign 1/6 for all other rolls. I agree exactly the same is true for races.

Quote from: Zorba on April 16, 2010, 11:19:45 PM
However, now that we're not looking at all rolls anymore but just the race doubles, it's unclear to me if that would influence the overall probability.

If the starting points of races are distributed randomly within the sequence S of all rolls of all matches, the partial sample P of S we get by looking only at races should be a randomly selected subset of S. In a randomly selected subset of S we can expect to find the same distribution of doubles as we expect to find in S.

pck

#16
Quote from: boomslang on April 17, 2010, 12:29:27 AM
Good dice (precision or generated by an ideal RNG) have no memory so they are independent of any other roll so that includes 'some previous' roll aswell...
[...]
No, see above.
[...]
Again, due to the lack of memory property they all are completely independent.

The thing is that 'opening doubles' are rolled, but not recorded in your .mat-files. That's why they will not show up in your XG stats and the stats are flawed.  If you play against GnuBG, it WILL show you opening doubles, just as in real life ;-).  If I am not mistaken, you see opening doubles in FIBS aswell if you have the game text toggle on.  Write them down with pen and paper, add them to your stats and all is well!

I completely agree on the opening doubles, the rolls to determine who moves first must be included into the stats, otherwise they will be flawed.

With last roll doubles, the situation is different. The no memory/independence argument doesn't apply here, because you choose to look at situations with a > 1/6 probability. That is, you single out a certain class of situations according to a rule (last roll double with p > 1/6). For this class of situations you can no longer assume independence from earlier rolls since it was not selected randomly.

To see this more clearly, consider a simple game where two players take turns rolling one die. Whoever rolls 6 first, wins. So all games will end in a 6. The last roll probability for 6 will therefore be 100%. But it would obviously be wrong to conclude that due to the "independence of the rolls", the probability of getting a 6 in the rolls before the last one is 1/6. That probability is of course zero. We're looking at the game from a "backwards" perspective, from the vantage point of its last roll which we know to be a 6. This knowledge allows us to draw conclusions about the probabilities of the previous rolls.

The independence argument applies only from a "forward-looking" vantage point, that is, from a view which has no knowledge yet about what is going to happen in the game. Switching back to backgammon, we may say that a double will occur with probability 1/6 only from such a perspective. But looking backwards from a last-roll-double-with-p > 1/6-situation means we do have a certain amount of knowledge about what happened. We are no longer in a general, entirely undetermined situation. We do have some information which allows us to arrive at certain conclusions about previous rolls. Hence the backwards-looking double-probability for those rolls will not be 1/6 anymore.

Assigning p > 1/6 double probabilities to certain last roll situations and 1/6 to all other rolls illegitimately mixes the forward and backward view and produces erroneous results. (You're mixing probabilities from two different probability spaces modelled according to different states of knowledge.)

dorbel


boomslang

Quote from: pck on April 17, 2010, 01:11:00 AM
Yes, the occurance of certain last roll situations with >1/6 double probability [ ...]

I don't know what you mean by an 'occurrence of certain last roll situations', so I cannot comment on this.

Quote from: pck on April 17, 2010, 01:11:00 AM
If the starting points of races are distributed randomly within the sequence S of all rolls of all matches, the partial sample P of S we get by looking only at races should be a randomly selected subset of S. In a randomly selected subset of S we can expect to find the same distribution of doubles as we expect to find in S.

1. "If the starting points of races are distributed randomly..." We know that this is not true: the starting point of race situations cannot lie in the first 8 or so rolls
2. P is not a random subset of S: it is the tail of sequence S, so it for sure will include the last roll (and because of 1. it will for sure not include the opening roll).

More than 1/6th of P will be doubles.

Zorba

Conditional probabilities are a b#t#h...

Introducing "last roll" as a special class to race dice creates conditional probabilities, that are no longer independent events.

One cannot just single out the last roll of a race where doubles have p>1/6, and think that by doing so, the other rolls in the race are still independent events. Consider your own example:

4 checkers on the acepoint. What is the probability of rolling doubles on:

1. my last roll
2. my next-to-last roll
The fascist's feelings of insecurity run so deep that he desperately needs a classification of some things as successful or superior and other things as failed or inferior. This also underlies the fascist's embracement of concepts like mental illness and IQ tests.  - R.J.V.

Luck is my main skill

pck

Quote from: boomslang on April 17, 2010, 11:55:22 AM
I don't know what you mean by an 'occurrence of certain last roll situations', so I cannot comment on this.

I meant situations like the one you gave as an example (4 checkers on acepoint, probability for a last roll double > 1/6).

Quote from: boomslang on April 17, 2010, 11:55:22 AM
1. "If the starting points of races are distributed randomly..." We know that this is not true: the starting point of race situations cannot lie in the first 8 or so rolls
2. P is not a random subset of S: it is the tail of sequence S, so it for sure will include the last roll (and because of 1. it will for sure not include the opening roll).

More than 1/6th of P will be doubles.

1. I didn't say "distributed evenly", but "distributed randomly", which I admit is confusing (see below 2. for a hopefully clearer explanation).

2. Note that S was supposed to be an infinite sequence of rolls used to play an infinite number of matches, not just one match. Hence what I called P is an infinite collection of tails in your sense. It is a subset of S, the collection of all rolls.

Instead of using my misleading phrase "P is a random subset of S", consider this:

Imagine two infinite dicestreams, both unbiased towards doubles. With the first stream we play all pre-race parts of all our matches. As soon as we have a race situation, we switch to the dice from the second stream. After a game/match we switch back to stream #1, and so on. We get a resulting, double-unbiased dicestream S over all our rolls of all our matches which is clearly also double-unbiased in pre-race as well as in in-race situations (= P) alone. If races were double-biased, this construction shouldn't be possible. But obviously it is.

The fact that it is so trivially possible to do this lends further credit to my claim that the idea of double-biased races is due to a confusion/mixing of two perspectives.

ah_clem

If I may be so bold as to weigh in with a few observations...

1) An even distribution of rolls across all 36 possible values is hardly proof that the dice are random - consider a dice generator that merely lists the 36 rolls in order, iterates through the list and starts over at the beginning when it gets to roll #37.  This generator would have a perfect distribution, but it's far from random.

2) Any sample from actual games is subject to sample bias. My understanding is that FIBS has a single dice generator that is shared across all ongoing matches - it merely gives the next roll to whoever asks for it.  This stream is what we'd like to test for randomness, but all we have at the moment is samples from the stream,  and as others have noted those samples are biased: doubles can't be the first move, doubles are more likely to occur at the end of games (both via bearoff, and by prompting a cube next roll), big numbers like 6-5 are also more likely to occur at the end of games for the same reason, small rolls like 2-1 will be the last roll relatively infrequently.

In short, the rolls that show up in a particular game are not a fair sample of the main dice stream.  pck explains this pretty well above.

3) To answer dorbel's  question "how good do the dice have to be?", I'd say that pretty much any psuedorandom generator would be good enough, even the lowly rand() in the C library.  The problem with mediocre pseudorandom generators is that they tend to repeat patterns after a few hundred thousand or so rolls, but since the only thing exposed to the user is an erratic sample of the  stream, it would be difficult to impossible to detect any patterns just by looking at the dice in your own matches.  Of course, it's not much harder to use a better psuedorandom generator than rand() (e.g. Mersenne Twister) so there's not much excuse for failing to upgrade, even though it doesn't really matter all that much.

Anybody know what algorithm FIBS uses?

4) I prefer to spend my time and energy on reducing my own errors rather than worrying about the dice. YMMV.


pck

#22
Quote from: ah_clem on April 17, 2010, 04:21:36 PM
1) An even distribution of rolls across all 36 possible values is hardly proof that the dice are random - consider a dice generator that merely lists the 36 rolls in order, iterates through the list and starts over at the beginning when it gets to roll #37.  This generator would have a perfect distribution, but it's far from random.

This distribution would indeed pass the first part of the fibs "dice test", but not the second, "Distribution of runs of n identical rolls".

Quote from: ah_clem on April 17, 2010, 04:21:36 PM
In short, the rolls that show up in a particular game are not a fair sample of the main dice stream.  pck explains this pretty well above.

If one game does not represent the whole of the dicestream accurately, this would be due to sample size, not because of certain rolls occuring "unnaturally" frequently at the end of games. I actually argued against that. What I (and dorbel and Zorba) said was that more last roll doubles than 1 in 6 (in the average game) do not imply more doubles than 1 in 6 overall. The exception is the case of first-roll doubles, where doubles are systematically excluded from gameplay. This can (and must) be fixed by including the rolls which decide who gets the first game-roll.

In the discussion above, we did not touch the problem of sample size. That is a different can of worms.

What I tried to explain in #16 was that one can look at the same statistical problem in various ways, and must take good care to construct one's theoretical description of the problem according to the perspective chosen (and also not change perspectives later).

boomslang

Quote from: ah_clem on April 17, 2010, 04:21:36 PM
1) An even distribution of rolls across all 36 possible values is hardly proof that the dice are random - consider a dice generator that merely lists the 36 rolls in order, iterates through the list and starts over at the beginning when it gets to roll #37.  This generator would have a perfect distribution, but it's far from random.
Quote from: pck on April 17, 2010, 05:15:28 PM
This distribution would indeed pass the first part of the fibs "dice test", but not the second, "Distribution of runs of n identical rolls".

It will also fail on XG's Chi2 test as there is no variability in the data (resulting in X2 values equal to zero).

Quote from: pck
What puzzles me is how it is possible to get good results for the player's as well as the opp's distribution, but a bad one for their combined distribution.

This is possible if the player's and the opponent's dice distribution would have had the same kind of bias, for example one die would have one side that is not straight. The two seperate tests might not reveal the (small) anomaly because the sample size is not large enough; put both groups together and you have doubled the sample size which then could be big enough to show an effect.

pck

#24
Quote from: boomslang on April 18, 2010, 01:27:45 AM
This is possible if the player's and the opponent's dice distribution would have had the same kind of bias, for example one die would have one side that is not straight. The two seperate tests might not reveal the (small) anomaly because the sample size is not large enough; put both groups together and you have doubled the sample size which then could be big enough to show an effect.

Good point. I believe this is the proper explanation for the first three results of XG_test.jpg in #6. By mistake, the same set R of 240.000 rolls entered the X2-test twice. R by itself has a X2 (25.48) which is well within acceptable bounds (P = 18.367%, the probability of getting from an unbiased distribution an even larger X2 than the one observed). But the combined distribution's X2, being exactly twice as large as R's because it duplicates all of R's deviations from the expected values, is a near-impossibility (P = 0.016%).