Bots are perfect

blitzxz · July 06, 2008, 02:58:37 AM

There was some discussion in other topic that bots can't play massive backgames. Someone was asking proof so here it is. The most incredible mistake I have ever seen gnubg to make.

Money session I'm on roll (black) and have a triple shot. This looks like a redouble... but gnu disagrees. It beavers me! And the cube is skyrocketing...

Could someone try the position to snowie too?

2-ply evaluation

Cube analysis
2-ply cubeless equity +0,098
0,613 0,106 0,002 - 0,387 0,214 0,022
Cubeful equities:
1. No double +0,318
2. Double, pass +1,000 ( +0,682)
3. Double, take -0,067 ( -0,385)
Proper cube action: No redouble, beaver (36,1%)

2-ply mini rollout

Cube analysis
Rollout cubeless equity +0,431

Cubeful equities:
1. Double, take +0,875
2. Double, pass +1,000 ( +0,125)
3. No double +0,853 ( -0,022)
Proper cube action: Redouble, take
Rollout details:
Player You owns 2-cube:
0,711 0,159 0,009 - 0,289 0,144 0,015 CL +0,431 CF +0,853
[0,011 0,008 0,005 - 0,011 0,009 0,003 CL 0,030 CF 0,051]
Player gnubg owns 4-cube:
0,747 0,177 0,021 - 0,253 0,125 0,015 CL +1,103 CF +0,875
[0,011 0,011 0,006 - 0,011 0,008 0,004 CL 0,060 CF 0,090]
Full cubeful rollout with var.redn.
100 games, Mersenne Twister dice gen. with seed 828094824 and quasi-random dice
Play: 2-ply cubeful prune [world class]
keep the first 0 0-ply moves and up to 8 more moves within equity 0,12
Skip pruning for 1-ply moves.
Cube: 2-ply cubeful prune [world class]

inim · July 06, 2008, 11:48:12 PM

Some remarks in no particular order.

Gnubg Backgame problems: In Joseph's paper, summarizing his training of the gnubg NN, the weaknesses are well documented. He even broke the problems down into categories and calculated the average error for all. All of the below may lead to blunders in backgames and elsewhere, but emphasis in Joseph's work is on effect to backgame. See here http://pages.quicksilver.net.nz/pepe/ngb/index.html, chapter "The Crashed Net". Categories with known problems include:
- roll-prime - Side on move has a 6 prime.
- trapped - Side on move is trapped behind a 6 prime.
- contain - Opponent is ahead in the race, and should be contained. Side on move has at least 6 checkers in front of one of opponents checkers.
- escape - Opposite of contain. Side is ahead in the race and will win if he escapes.
- scramble - Side on move is well ahead in the race, and wants to break contact without getting hit.
- must-hit - Opposite of scramble. Opponent is ahead in the race, and side (lacking the numbers to contain) needs to hit in order to stay alive.
- other - anything else.
Relevance to FIBS: This is FIBSboard, not gnubg.org. I would think your findings should be posted in either the gnubg bugtracker or their mailing list, this is not the right place. I am not aware of any gnubg developer monitoring this board, so your work is more or less lost, and the chanche of some expert (= gnubg developer) answering is infinitesimal small.
Form of reporting: Please add the gnubg match-id and position-id to any position you wanna see discussed. It's very cumbersome to enter it again from the screenshot using the gnubg edit mode. The screenshot is a nice-to-have, the IDs IMO a must.
Limited value of gnubg rollouts: A gnubg rollout is better than a gnubg NN eval, given the smoothing influence of the larger number of experiemtns. But it still uses the same NN. So very likely the NN which fails to deliver a realistic equity for a position in NN-eval will also blunder in it's rollout. Thus rollouts using the same NN which is analyzed are of limited value when analyzing itself. Best thing would be if you used snowie or bgblitz to validate positions in which gnubg blunders. The other thing is that 100 rollouts seem too little to me to deliver relevant data, the random error is still pretty high. Use a few 1000 at minimum fpr serious analysis.
Cube decisions / Derived values: Cube decisions (and other derived values such as EMG/MWC, luck, n-ply for n > 0, etc. pp.) are just derived values from the initial NN 0-ply evaluation of a position's cubeless equity (and the 5 win/gammon/backgammon probabilities). Given a wrong equity, the derived formula will report nonsense following the well known sh**-in-sh**-out principle. So if you analyze problems, only cubeless equity returned by the NN and the rollout are relevant as this is where the genuine error happens.
Significance / Delta : The delta of +0,098 for NN-eval vs. +0.431 for the rollout in terms of cubeless equity is HUGE (> 0.3!!!). So you surely found a very significant example. Please make sure to file it with the gnubg guys. The larger the delta between NN and rollout, the more interesting a particular position is.
Usefulness of reporting single positions: Unfortunately, while single positions are good to point out weaknesses, they are of little help to improve the gnubg NN based on them. A learning set typically consists of hundreds of thousands of positions. So if you wanna help the developers it would be great if you could contribute as many positions as possible, probably as a text file in a format negotiated with them. Key parameters would be Pos-ID, Match-ID, Rollout-Equity and NN-Equity, and the derived delta of the latter. The most valuable (yet hardest to get by) param would be the equites of a rollout or NN eval with ANOTHER NN.

FrankBerger · July 08, 2008, 06:07:23 PM

Quote from: blitzxz on July 06, 2008, 02:58:37 AM
There was some discussion in other topic that bots can't play massive backgames. Someone was asking proof so here it is. The most incredible mistake I have ever seen gnubg to make.

Well, some people really regards bot as perfect, especially two of them but this is obviously not the case.

Your position shows that GnuBG has problems with extreme Backgammon. This is well known fact (at least to some people). If you dig very deep you can find a backgame on r.g.b. were GnuBG plays close to random (it is a little work to find it, IIRC it was from "bucko" posted and around 2004) and I guess the most incredible mistake is probably in that game.

What I dislike about your posting is the connotation "bots = GnuBG" and the term "proof". You've shown that in one particular position one bot misplayed it. Although I agree, that this is not an accident due to a broken installation, but a systematic error, one should remind that

GnuBG plays less extreme backgames very reasonable
does that not prove at all that all bots misplay such situations

At least by my standards, this is not a proof.

ciao
Frank

blitzxz · July 09, 2008, 05:17:00 AM

You're right. I shouldn't have said bots but gnubg. I tried this position and several others with snowie4 and it was playing clearly better then gnubg. How would bgblitz perform in massive backgame positions? That would be also very interesting test. But this position is certainly a proof. It's a proof that gnubg is not perfect. You only need one error to prove that. But everybody knew that already, right? Maybe, but sometimes I have a feeling that everybody don't remember it. And every time I have said that gnubg has problems with these kind of positions there is at least dozen of people saying that this not the case. Although I agree that it is a well known fact among the developers at least (so I have no reason to send it gnubg mailing list).

FrankBerger · July 09, 2008, 10:45:49 AM

I havn't tested it systematically (lack of time), but BGBlitz seems to behave pretty robust in unusual positions. In this position BGB evaluates as D/T. The rollout deviates, but to a much lesser extent.

And I absolutely agree with you, that bots evaluation are often taken as granted, ignoring that we don't have yet the perfect bot. Each bot has it's weaknesses and sometimes fails in astonishingly simple position (e.g. recently a simply bearoff at r.g.b where S4 (and BGB too

) fails.) There is still way to go and anyone really serious should cross check.

Unfortunately the developer scene is much less vibrant than in other games. I have not really an idea why this is, I personally find it a real challenge, I only regret that I have only my spare time for BGB. But compare it with other brain games.... how much AI's do we have? Compare this with Chess, Shogi, Go ...... It might be that people loving BG are more money oriented than in other games, and from the material point of view it's depressing

Or it might be that, different as in chess, there is no established standard program to host a bar bone AI (like Winboard, Fritz(?) and other). BGBlitz is able to host different AI's, but obviously no one had taken the opportunity. Developing an AI is much, much, much less work than coming up with a full blown BG program. The latter is a huge amount of work (at least 5-10 man years) so that might explain, why no new contender come up, but programming an barbone AIs is a matter of month....

ciao
Frank

FIBS Board backgammon forum

News:

Bots are perfect

blitzxz

inim

FrankBerger

blitzxz

FrankBerger