- Gnome Stew - https://gnomestew.com -

A Case Study In Dice Stat Tests Part 3: Suggested Analysis

Over the last few weeks we’ve been looking at the the analysis done for the Kickstarter for “Honest Dice | Precision Machined Metal Dice You Can Trust [1]” as a case study. We’ve gone over the general idea of how one would go about testing dice and presenting results. We’ve also gone over the analysis that was done and what was wrong with it. This week we’re looking at how it might have been done and what conclusions we can actually draw from the testing that was done and the data that was collected. Using the template established earlier an ideal analysis would look something like this:

With everything planned, here is our test results:

For the D20s, our first test is a Chi-Square test of homogeneity with 57 degrees of freedom. This test results in a Chi Square test statistic of 83.41 which equates to a p-value of .013. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D20s that were tested share the same distribution.

Our next step is to do our three follow up tests. Each of these is a Chi-Square test of homogeneity with 19 degrees of freedom. To control for family wise error rate we’re using a threshold of .017 for each of these tests. Our test statistics for the Honest Dice D20 vs each of the other dice are 22.78 for CNC #1, which equates to a p-value of .247, 20.15 for CNC #2, which is a p-value of .386, and 31.85 for the plastic D20, which is a p-value of .032. None of these are lower than our threshold so none of these are significant and we cannot say there is any evidence that the Honest Dice D20 is any different than any of the other D20s.

Test df Chi-Sq P-Val Threshold Conclusion
All D20s 57 83.41 .013 .05 Evidence exists to suggest at least one difference
vs CNC1 19 22.78 .247 .017 No evidence of difference
vs CNC2 19 20.15 .386 .017 No evidence of difference
vs plastic 19 31.85 .032 .017 No evidence of difference

We come to the same conclusion with the Holms step-down procedure. With this procedure, we order our p-values lowest to highest, compare them to a set of increasing thresholds and reject the lowest p-values until we find one where the p value is greater than the threshold.

Index Test p-val Threshold =
.05/(3+1-index))
Conclusion – Fail to reject where:
1st where p-val > threshold and those after
1 vs plastic .032 .017 p-val > threshold: fail to reject
2 vs CNC1 .247 .025 failed to reject above: fail to reject
3 vs CNC2 .386 .05 failed to reject above: fail to reject

How does this happen? How do we reject that all dice are the same, yet we don’t find any differences between individual dice? There are two possible explanations here. First, it’s possible that due to the sample size deficiencies, there simply is insufficient power to detect differences that are there. Another possibility is that since there are four dice included in the initial test of “at least one difference” but we’re only doing the three follow up tests that are of interest to minimize Family Wise Error Rate, that the difference is found in one of the three other combinations of dice that we didn’t test (CNC1 vs CNC2, CNC1 vs plastic, or CNC2 vs plastic). Given the p-values calculated in the follow up tests, it seems highly likely this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. In any case, further testing is recommended for these D20s.

For the D6s, our first test is a Chi-Square test of homogeneity with 5 degrees of freedom. This test results in a Chi Square test statistic of 1.10 which equates to a p-value of .954. This is higher than our threshold of .05, so this result is not statistically significant and we have not found any evidence to suggest any difference between the two D6s tested.

Test df Chi-Sq P-Val Threshold Conclusion
All D6s 5 1.10 .954 .05 No evidence of difference

For the D4s, our first test is a Chi-Square test of homogeneity with 6 degrees of freedom. This test results in a Chi Square test statistic of 17.18 which equates to a p-value of .009. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D4s that were tested share the same distribution.

Our next step is to do our two follow up tests. Each of these is a Chi-Square test of homogeneity with 3 degrees of freedom. To control for family wise error rate we’re using a threshold of .025 for each of these tests. Our test statistics for the Honest Dice D4 vs each of the other dice are 1.98 for the CNC die, which equates to a p-value of .576, and 9.25 for the plastic die, which is a p-value of .026. Neither of these are lower than our threshold so neither of is significant and we cannot say there is any evidence that the Honest Dice D4 is any different than any of the other D4s.

Test df Chi-Sq P-Val Threshold Conclusion
All D4s 6 17.18 .008 .05 Evidence exists to suggest at least one difference
vs CNC 3 1.98 .576 .025 No evidence of difference
vs plastic 3 9.25 .026 .025 No evidence of difference

We come to the same conclusion with the Holms step-down procedure. With this procedure, we order our p-values lowest to highest, compare them to a set of increasing thresholds and reject the lowest p-values until we find one where the p value is greater than the threshold.

Index Test p-val Threshold =
.05/(3+1-index))
Conclusion – Fail to reject where:
1st where p-val > threshold and those after
1 vs plastic .026 .025 p-val > threshold: fail to reject
2 vs CNC .576 .05 failed to reject above: fail to reject

This is a very similar situation to the D20s. Evidence exists that suggests some difference between the dice tested, but not when follow up tests to determine where that difference is are performed. Again it seems highly likely that this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. Further testing is recommended for these D4s.

Our final conclusion is fairly straightforward: Evidence exists to suggest some differences between dice but in general sample sizes were insufficient due to the demands of the Chi-square test of homogeneity and Family Wise Error Rate. Further testing with larger sample sizes is recommended.

Hopefully this deep dive into the best practices of dice analysis will be helpful for those looking for a resource on how to do their own analysis in the future as well as those reading analysis put out by others.

First week was part 1: General Approach [2].

Second week was part 2: Review of the Honest Dice Analysis [3].

1 Comment (Open | Close)

1 Comment To "A Case Study In Dice Stat Tests Part 3: Suggested Analysis"

#1 Comment By Karel Rizky Ananta On July 14, 2025 @ 6:11 am

Nice info