Just because GnuBG 0-ply plays both sides, does not mean it plays both sides equally well in individual games. Only in the long run will they show equal skill levels. So the outcome of a particular match is not completely luck dependent: one side may have given up much more equity in errors than the other side.
There is one case where gnu (or any bot) should play both sides equally well: If analysis and player plys are set to equal levels, gnu should not find that any player has made any mistake. And indeed it doesn't. Both ERs are always 0. Hence all equity change in the match should be attributed to luck and the luck adjusted result should always be zero.
However, the phenomenon blitzxz and I described persists when ply_analysis = ply_players.
Furthermore, GnuBG's luck evaluations, just like its error evaluations, are not perfect, so this is another factor contributing to inaccuracies, especially noticeable in the short run.
I assume ply_analysis = ply_players. Analyses may not be perfect, but the hard question is, why can gnu not make the numbers consistent
(as opposed to correct) within any of its n-ply worlds, however flawed they may be in an absolute sense. Why is it not possible, with both ERs = 0 and skill out of the picture, to add up all equity changes produced by the dice rolls and get a total change of +50% for the winner and -50% for the loser?
This question is reinforced by the Zare articles you linked to:
Final − Initial = Net Luck + Net Skill in http://www.bkgm.com/articles/Zare/HedgingTowardSkill.html
(A formula like this, which is based on the concept that what is not skill is precisely luck, was actually the reason I posed my question in the first place.)
BTW, for practical purposes: GnuBG defaults to using 0-ply for its luck calculations. Luck calculations can be considered more difficult than normal evaluations, as 21 different dice rolls and their best plays have to be considered, so it's no surprise that 0-ply luck analysis can give rather inaccurate results generally. Use GnuBG's command line and the command boomslang mentioned to increase the ply level of GnuBG's luck analysis.
With different methods/ply-levels for luck and move evaluations, it is clear that discrepancies can occur (see below).
Another interesting thing to consider here is that a n-ply luck analysis is closer to a (n+1)-ply error analysis than to a n-ply error analysis, due to the 21 different rolls that have to be analyzed. This is also true for the time it takes to do such an analysis: using 2-ply for luck analysis is about as slow as doing a 3-ply error analysis.
I ran a few tests. First I set luckanalysis and everything else to 2-ply (which took much
longer than 3-ply analysis). The discrepancies still showed up but seemed to be lower than before (around 3-6% as opposed to 7-15%). But I could run only very few test matches because evaluation took so long (Intel Core 2 Duo 7300, 2.6 GHz, gnu set to use both cores). Next I tried luckanalysis at 1-ply and everything else at 2-ply (your n -> n+1 suggestion). In my first trial match, the LAR discrepancy was almost zero. But my celebratory mood subsided when in further trials it went back to 5-7%, and in one particular 1-point match I got an LAR as big as 23%.
So in the experiments the problem keeps showing up.
The conceptual question is still open. Luck is that which isn't skill. So why have seperate, independent luck and skill analyses? Why not have only a skill analysis (= best move analysis) and report luck as whatever equity changes are not due to skill? (Or vice versa.) Then the numbers would be consistent. Using two different evaluation modes will break formulas such as Zare's and create confusion. If there are any conceptual reasons for doing this, we have not begun to touch them yet.