Gnome Stew » A Case Study In Dice Stat Tests Part 2: Review of Honest Dice Analysis

In last week’s article we looked at a general approach to doing stat analysis on dice. This week we’re going to look at the analysis that was done in the Kickstarted for “Honest Dice | Precision Machined Metal Dice You Can Trust ^[1]” and see how it stacks up to this ideal. As is the case with many things, there isn’t one right way to do a statistical test, analysis, and presentation but there are quite a few wrong ways to do it. I’m just going to go down the general structure from last week with notes where applicable.

Purpose: It’s clear that the intent here is to test the fairness of Honest Dice against a selection of other dice to see which die is better.
Tests: The test chosen was a collection of Chi-Square Goodness of Fit tests. This is a poor choice of test because it can’t actually test the stated purpose, thus many of the conclusions made are invalid. The creator links a StackExchange ^[2] outlining the method they use, but the problem is that that site is outlining a test to answer the question: “Is this particular die fair?” and not the question “Is this die more fair than these others?” and those questions require different tests.
- Threshold for Significance: There were multiple errors in this part of the analysis. First, their tests completely ignored presenting P-Values. That’s most likely because the site they’re using for reference doesn’t make any mention of them explicitly. It does provide lookup tables similar to a P-Value lookup table, but doesn’t mention the term. Again, this is probably because the reference is answering a question about a home application of testing a single die. In that case, you could probably skip the step of calculating your P-Values to compare to a threshold and instead compare test Chi-Square results to Chi-Square critical values. The two processes are mathematically identical. But for an application where you’re presenting multiple tests, with multiple Chi-Square critical values, standardizing everything to P-Values that are easily understood and compared to one another is essential.
  They also failed to choose and stick to a single threshold for significance across their tests. You can see in the D20 tests. For example, they point out that some dice were above the critical value for 1% rarity. Other dice they criticized for being above the critical value for 10% rarity. But something you see here perfectly illustrates the need to choose and be consistent with your threshold. They say: “CNC d20 #1 and #2 didn’t do great either, both landing near the 27.204 level that a fair die would have a <10% chance of exceeding.” That “landing near” is an interesting enough turn of phrase to go look up what the test statistics of those two dice were. CNC #1 was 28.5 while CNC #2 was 26. 26 is below the threshold chosen here, so “being near” the threshold in this case doesn’t mean “almost failed” as it’s being presented. It means “passed”.
- Sample Size: The first issue with sample size here is the statement: “I rolled 11,500 times in total to test these dice.” Phrased in that way, the statement is technically correct, but misleading. The correct statement would be “I rolled between 500 and 2000 times in total to test these dice.” Why the distinction? The total number of times you rolled for a given test is an indicator of the power your tests have. The total number of times you rolled across all tests is irrelevant and easily misinterpreted as a per test sample size.
  The other issue is that the sample sizes that were rolled are insufficient for the level of accuracy being claimed by this analysis. The d4s were rolled 500 times each, the d6s 1000 times each, and the d20s 2000 times each. Generally I’m a proponent of being lazy with sample sizes because my personal philosophy is that “table fair” dice are more than sufficient. However, since this is an attempt to prove the accuracy levels of very accurate machined dice VS other very accurate machined dice and the results are being held up as evidence of superior accuracy it makes more sense to make sure the tests have sufficient power to detect deviations. We need four things to calculate sufficient sample size. Since we want to be able to detect differences between highly accurate dice we should use the .1 standard for effect size. Our degrees of freedom are dictated by the test being performed. That just leaves us to decide on threshold and power. As I don’t know which of those was intended for these tests, I ran all six combinations for each die type. For the d4s, 500 rolls is insufficient to give us even .8 power to detect at .1 rarity. IE: if you want the ability to detect die rolls that are 10% likely or rarer 80% of the time they occur, you need 880 sample rolls. If you want more power to detect even rarer events, you need as many as 2268 rolls. For the d6s which had 1000 rolls, the situation is similar. 1045 rolls are needed for even the lowest power to detect even the weakest rarity. For a better test, 2577 rolls would be required. With the D20s, the 2000 rolls that were collected were sufficient to catch .1 rarity events a .8 proportion of the time, but for better accuracy, 3822 rolls or more should have been collected. So it’s safe to say that while the tests that were performed may have caught some deviation from the ideal distribution, more rolls were needed to provide sufficient detection power.
  
  For a d4 (3 degrees of freedom)
  
  Power\Threshold .1 .05 .01
  
  .8 880 1091 1546
  
  .95 1458 1717 2268
  
  For a d6 (5 degrees of freedom)
  
  Power\Threshold .1 .05 .01
  
  .8 1045 1283 1787
  
  .95 1691 1979 2577
  
  For a d20 (19 degrees of freedom)
  
  Power\Threshold .1 .05 .01
  
  .8 1706 2056 2765
  
  .95 2618 3018 3822

Collecting Data: Nothing was provided on the way in which the rolls were collected. Since the objective here is researching the accuracy of the true distribution of the dice involved, the correct approach would be to roll in lab like settings, maximizing bounce and holding conditions as static as possible for every roll.

Analysis: The math done and the pure numerical results seem to be correct. The table provided contains rolled mean, ideal mean, and the difference between them which is unnecessary as those numbers don’t hold any value for analysis but that’s easily ignored. More importantly, the table is missing P-values but those are easily looked up. Here is what the table should have looked like:

Die	Rolls	Chi Square Test Stat	P-Value
Honest Dice Aluminum d20	2000	10.16	.9489
CNC machined aluminum d20 #1	2000	28.5	.0743
CNC machined aluminum d20 #2	2000	26	.1302
Soft edge plastic d20	2000	40.5	.0028
Honest Dice Titanium d6	1000	2.64	.7553
CNC machined titanium d6	1000	3.18	.6723
Honest Dice Aluminum Hex d4	500	3.15	.3691
CNC Machined aluminum tetrahedron d4	500	5.78	.1228
Shard plastic d4	500	11.38	.0098

Providing the P-values allow for both easy intuition of the results and comparison of the dice not just against other dice in their die type but also in general. For example, with P-values we can say that the soft edge plastic D20 produced the results furthest from its ideal distribution of all these dice followed by the shard plastic d4. Looking at these P-values very few of them show statistical significance at any common or reasonable threshold.

Presentation: Despite the issues described above, the biggest problem with this analysis is in the presentation of results. While parts of the presentation were quite good, other parts had serious issues.
- Discussing Intent: Intent was never explicitly stated which isn’t good, but it was made clear enough from the way the results were presented and phrased and from the conclusions drawn.
- Explaining the Plan: There were several problems with the way the plan for analysis was explained.
  - The test performed is never named. A source is given and those who already know how to do various tests will easily recognize the test being advised as the Chi-Square Goodness of Fit test but this is never stated in the Kickstarter. It’s not mentioned in the source either except in a footnote by someone other than the original author. Since a Chi-Square test is not a test but a family of tests based around a Chi-Square distribution, simply saying a Chi-Square test is insufficient. An exact test must be specified.
  - The threshold for significance is completely omitted from discussion at all. As noted in an earlier section, this omission led to moving goalposts for test results. Again, this is because the source used only mentions the concept tangentially, but for analysis it needs to be stated explicitly and stuck to.
  - Aside from the aforementioned presentation of a total sample size across all tests which is irrelevant and misleading, sample sizes are presented clearly both in the table of results and in the presentation text. While a discussion of why sample sizes were chosen and the factors involved in those decisions would be welcome, what was given was sufficient.
  - In general it’s best to assume that unless you expect your audience to consist entirely of statisticians or other people who routinely deal with stats a sizable proportion of your audience minimally needs a refresher in various statistical terms being used. In this case, several terms were defined as they were used. This can be sufficient but it is often more cumbersome to do than providing a brief description before presentation of analysis begins.
    A particularly important error here is use of the phrase “significantly outperformed” in several places. This is a problem for several reasons. The word significant is both a general word in the English language and a word with a special definition within statistics. When performing a statistical analysis, readers will assume the word is being used in its technical capacity and not as just another word. Thus, making the claim that your dice “significantly outperformed” another die should only ever be done if the claim is true in a statistical sense. If that is not the case another word should be used instead. In this case, the correct test to make such a claim wasn’t performed so no such result should have been implied. However, even given a misunderstanding of that fact, making the claim based on the tests given still should not have been done since making the claim would imply that the Honest dice had produced a test statistic that was not statistically significant (greater that the chosen threshold) and that all competitor’s dice had produced test statistics that were statistically significant (lower than the chosen threshold). While it was true that no tests performed on Honest Dice were statistically significant, only three competitor’s dice produced results that could be said to be statistically significant by any reasonable measure of the word. Thus the end claim of the analysis that Honest Dice significantly outperformed all other dice is at best poorly worded, and at worst false.
  - The graphs that were provided with the analysis were very useful and a helpful visual representation of the distribution of the data. It would have been nice if a copy of the data set used had been provided, but while that’s a good general best practice it’s not essential.
- The final part of presentation is actually going over the test results and explaining them. This section of the analysis contained some fluff that wasn’t strictly necessary but also didn’t hurt anything. However much of this section was misleading and should have been presented differently or omitted entirely. Here are some examples:
  - Most tests had discussions of the mean roll for each die and how they compared to the ideal mean roll and each other. That entire discussion is irrelevant and misleading and should have been omitted. For discussion’s sake, the standard deviation of the average of an ideal die’s rolls is known and can be calculated. Since the average of an ideal die’s rolls quickly can be approximated with the normal distribution as number of rolls increases, there’s little reason to be suspicious of an average sample roll unless it falls outside of the ideal mean ∓ 1.96*standard deviations and generally we would never know if that was the case anyway because average sample roll is not our statistic of interest and thus we cannot draw conclusions from it without a test designed for it. Including it and trying to draw inference from it is simply a waste of time.
    For the record, the standard deviation of the average of n ideal die’s rolls is σ/√n=√((sides2 -1)/12)/√n. Only the precision d4 tested has a sample mean that falls outside of a range we would expect to see and that’s because you can see from the bar charts provided that while the variance of its roll totals aren’t the worst, the roll that it’s shortest on is 4, which of course has a greater impact on its average than an unusual number of 2s or 3s would. Incidentally, this is precisely why we use a chi-square test and not the average roll to test dice. Values further from the mean have more leverage than rolls closer to the mean (Ie: they impact the average more) and thus with the average as a test value a die that rolls far too many of a number close to the mean reports as fairer than a die with a much smaller deviation of extreme numbers.
  - Most tests had comparisons of each die’s Chi square test statistics. In absence of P-values, this will have to do. However, the only comparison that is of any value here is comparing the Chi-square test statistic against the single Chi-square critical value that corresponds to the chosen threshold for significance. Mistakes that were made here include:
    - Comparing test statistics against each other directly
    - Cherry picking different critical values to find more significant values
    - Claiming a test statistic is low because it’s “close to” a critical value without exceeding it
    - Claiming a test statistic is worse because it is 20% higher than another. This is particularly egregious because the comparison is completely meaningless. The probability distribution of a Chi-square value is uneven starting low at 0 with a hump in the middle and an infinite low tail to the right. Because of this shape the amount of probability held between a test statistic and a statistic 20% larger depends entirely on what those numbers happen to be. For d6s where this comparison was made, the difference between test statistics of 4.6 and 5.5 nets you about the largest difference you can get: about .11 p-value. However, the difference between the values of test statistics of 15 and 18 (also 20% off) is very close to 0. The only way to determine what a 20% increase in test statistic means is to first convert them to p-values and then compare those, which is what should have been done in the first place. In the case of the d6s where this comparison was made, the p-values in question are .7553 and .6723 and neither are near any reasonable threshold for significance. In this case the 20% difference in test statistic makes a little over a .08 difference in p-value but both p-values are so large that this difference is irrelevant.
  - Depending on the threshold chosen the presentation should have boiled down to one of the following:
    - At .1 (10%) threshold for significance: “Evidence exists to suggest that the CNC machined aluminum d20 #1 and the two plastic dice tested have statistically significant differences from their ideal distributions.”
    - At the .05 (5%) or .01 (1%) threshold: “Evidence exists to suggest that the two plastic dice tested have statistically significant differences from their ideal distributions.”
    - Those are the only conclusions that the statistics that were done support.
- Given that the incorrect tests were performed and the resulting statistics were badly misinterpreted, the conclusion from the analysis performed is completely unsupported.
  However it is important to note that from the statistics that were performed, all the Honest Dice had strong p-values that suggest they are close to fair dice (or undetectably unfair since fair dice are technically impossible). There is also a question of if the problems with the analysis done stem from a combination of inexperience with the conventions of statistics and overzealous salesmanship or if they stem from intentional misrepresentation. Here is why I am prepared to say these errors aren’t malicious: First, the resource that was cited was an excellent resource but was woefully insufficient for the analysis and presentation required (providing a better source was one of the main drivers behind writing this article). Second, if we assumed malicious behavior, then the results of the d6 test make no sense. The p-value of the competitor’s die is .6723. This is so high that it inspired some very creative massaging of the results to try and claim the Honest D6 is better. Why would a malicious actor go to those lengths when they could simply re-collect their competitor’s d6 data or even more simply just falsify data? Thus it is far easier to assume the test stats calculated are honest data and the excellent results from Honest Dice are legitimate.
  There is definitely room for further tests and study. Because of the problems outlined with sample size the tests performed don’t have as much ability to detect deviations from the ideal distribution as we would like and it would be interesting to see the results of the correct tests to check the claims that Honest Dice are better than their competitors but that will have to wait.

I will leave with one final bit of evidence that Honest Dice are in fact good quality dice. The results calculated show the Honest Die in each die type to have the highest p-value in the group. Let’s assume for a moment that all dice of each type have the same distribution. If x dice all have the exact same distribution, it seems intuitive that if we do a goodness of fit test for each die, each die has an equal chance (1/x) of getting the highest p-value. Thus if there is no difference in the d20s, the chance of the Honest Die getting the highest p-value would be 1/4. For the d6s that would be 1/2. For the d4s that would be 1/3. Since we assume each type of die is all the same we can also safely assume the results across die type would be independent. If these assumptions are true then the probability of getting the highest p-value in all three die types would be the individual probabilities multiplied together. This would be 1/4 * 1/2 * 1/3 = 1/24 ~ .042. Thus given H0: All dice are all the same, the probability of observing the collection of Honest Die p-values we observed is .042. This is lower than a reasonable .05 threshold for significance (.05 is the usual default. It makes sense to use it here). Thus evidence exists to reject H0: all the dice of each type share a distribution.

Last week was part 1: General Approach ^[3].

Next week is part 3: Suggested Analysis ^[4].

Power\Threshold	.1	.05	.01
.8	880	1091	1546
.95	1458	1717	2268

Power\Threshold	.1	.05	.01
.8	1045	1283	1787
.95	1691	1979	2577

Power\Threshold	.1	.05	.01
.8	1706	2056	2765
.95	2618	3018	3822