
Over the last few weeks we’ve been looking at the the analysis done for the Kickstarter for “Honest Dice | Precision Machined Metal Dice You Can Trust” as a case study. We’ve gone over the general idea of how one would go about testing dice and presenting results. We’ve also gone over the analysis that was done and what was wrong with it. This week we’re looking at how it might have been done and what conclusions we can actually draw from the testing that was done and the data that was collected. Using the template established earlier an ideal analysis would look something like this:
- The intent of our testing is to show that Honest Dice are fairer, that is closer to the ideal distribution than several other die options.
- Since we’re testing dice against each other, the appropriate test is the Chi Square Test of Homogeneity. This test is specifically designed to test if sets of dice or other phenomenon have the same distribution or not. Specifically this test checks the hypothesis H0: all dice have the same distribution. Statistical significance in this test will tell us that at least one of the dice has a different distribution.
- Since the base test only determines if at least one die is different, if a difference is detected in the D20s or the D4s, follow up tests of homogeneity will need to be performed to determine where the differences lie. While there are 6 possible pairings of dice that might be different for the d20s and 3 possible pairings for the d4s, each additional test increases our potential family wise error rate and we’re really only interested in differences between the Honest Dice and the other dice being tested if a difference is detected in the D20s, we’ll do three follow up tests: the Honest Dice D20 vs the three other D20 options. For the D4s, if a difference is detected we’ll do two follow ups: the Honest D4 vs each of the other D4 options. Finally, with these follow up tests, significance is still only showing a difference in the two dice, not which is better. In those cases, it’s finally time to do goodness of fit tests to test the two dice against another. In this final set of tests we don’t need to worry about thresholds or family wise error rates because we’re (finally! just looking to compare the two P-values. This comparison of P-values is only valid if the preceding tests showed significance.)
- For these tests, we’re going to use a .05 threshold for significance. This is a common middle of the road threshold. Given the probabilities involved in D20s (.05 chance of rolling any given side, a .1 threshold seems unreasonably high. .05 frankly seems high too, but because of limited data (see below) going with .01 seems unlikely to give a fair chance for finding significance. If I were designing this test from scratch, I would use a .01 threshold and simply increase sample size but that’s unfortunately not an option.
- For the follow up tests, since we have Family Wise Error Rate concerns, we need to pick an adjustment to our significance threshold to account for it. Since all of our follow up tests are going to be the Honest Die option vs another die, it’s reasonable to assume that if the Honest Die is the one with the different distribution then the follow up tests are not independent of one another. Thus for a raw threshold number, we’ll use the Bonferroni adjustment, which is to just to divide the intended threshold by the number of tests. Thus for the D20 follow ups with three tests, our threshold will be .05/3=.017 and for the D4 follow ups with two tests, our threshold will be .025. We’ll also use the Holm’s step down procedure which is similar to the Bonferroni adjustment and which makes fewer assumptions about distribution than alternative step tests. We don’t need to use two options, and in fact two options may give us conflicting results, but I’m interested in using both these techniques as I don’t have experience with them, and I don’t want to use both options and only report one.
- For our D20 tests, our sample size will be 2000, for our D6 tests we’ll use a sample size of 1000 and for our D4 tests we’ll use a sample size of 500. We’re using these sample sizes because that’s the size of the data set we have available, not because those are ideal minimum sample sizes. To determine ideal minimum sample sizes we need to know Effect size (.1), significance threshold (.05, .017, or .025 as appropriate), desired power (.05) and degrees of freedom ( faces-1*dice-1) for each test to feed them into G*Power. Thus for each test the ideal sample sizes I’d like to have for both .05 and .01 base threshold are:
Test Threshold Degrees of Freedom Sample Size .05 Sample Size .01 D20 Test .05 (20-1)*(4-1)=57 4533 5647 D20 Follow ups .017 (20-1)*(2-1)=19 3571 4354 D6 Test .05 (6-1)*(2-1)=5 1979 2577 D4 Test .05 (4-1)*(3-1)=6 2086 2705 D4 Follow ups .025 (4-1)*(2-1)=3 1962 2491
- Performing all these tests requires a data set. Others doing follow up analysis like this is precisely why providing your data set is a best practice. However, despite the fact that a data set was not shared, one can be reverse engineered from the graphs provided using the following method:
- Take the image of the graph provided and find the y coordinate of the pixel that forms the 0 line on the graph.
- For each bar, find the y coordinate of the pixel that forms the top of the bar.
- For each bar, subtract the top of bar coordinate from the 0 line coordinate. This gives you the height of each bar in pixels.
- Sum the total pixels for all bars.
- For each bar, divide the bar height by the total pixels for all bars. This gives you the proportion of all pixels contained in that bar.
- For each bar, multiply the proportion of pixels in the bar by the total number of rolls and round to a whole number. This gives an approximate number of rolls for that die result.
- Find the weighted average roll across all bars, compare to the reported mean roll of the die to verify that your estimated roll set closely approximates the original data.
With everything planned, here is our test results:
For the D20s, our first test is a Chi-Square test of homogeneity with 57 degrees of freedom. This test results in a Chi Square test statistic of 83.41 which equates to a p-value of .013. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D20s that were tested share the same distribution.
Our next step is to do our three follow up tests. Each of these is a Chi-Square test of homogeneity with 19 degrees of freedom. To control for family wise error rate we’re using a threshold of .017 for each of these tests. Our test statistics for the Honest Dice D20 vs each of the other dice are 22.78 for CNC #1, which equates to a p-value of .247, 20.15 for CNC #2, which is a p-value of .386, and 31.85 for the plastic D20, which is a p-value of .032. None of these are lower than our threshold so none of these are significant and we cannot say there is any evidence that the Honest Dice D20 is any different than any of the other D20s.
| Test | df | Chi-Sq | P-Val | Threshold | Conclusion |
| All D20s | 57 | 83.41 | .013 | .05 | Evidence exists to suggest at least one difference |
| vs CNC1 | 19 | 22.78 | .247 | .017 | No evidence of difference |
| vs CNC2 | 19 | 20.15 | .386 | .017 | No evidence of difference |
| vs plastic | 19 | 31.85 | .032 | .017 | No evidence of difference |
We come to the same conclusion with the Holms step-down procedure. With this procedure, we order our p-values lowest to highest, compare them to a set of increasing thresholds and reject the lowest p-values until we find one where the p value is greater than the threshold.
| Index | Test | p-val | Threshold = .05/(3+1-index)) |
Conclusion – Fail to reject where: 1st where p-val > threshold and those after |
| 1 | vs plastic | .032 | .017 | p-val > threshold: fail to reject |
| 2 | vs CNC1 | .247 | .025 | failed to reject above: fail to reject |
| 3 | vs CNC2 | .386 | .05 | failed to reject above: fail to reject |
How does this happen? How do we reject that all dice are the same, yet we don’t find any differences between individual dice? There are two possible explanations here. First, it’s possible that due to the sample size deficiencies, there simply is insufficient power to detect differences that are there. Another possibility is that since there are four dice included in the initial test of “at least one difference” but we’re only doing the three follow up tests that are of interest to minimize Family Wise Error Rate, that the difference is found in one of the three other combinations of dice that we didn’t test (CNC1 vs CNC2, CNC1 vs plastic, or CNC2 vs plastic). Given the p-values calculated in the follow up tests, it seems highly likely this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. In any case, further testing is recommended for these D20s.
For the D6s, our first test is a Chi-Square test of homogeneity with 5 degrees of freedom. This test results in a Chi Square test statistic of 1.10 which equates to a p-value of .954. This is higher than our threshold of .05, so this result is not statistically significant and we have not found any evidence to suggest any difference between the two D6s tested.
| Test | df | Chi-Sq | P-Val | Threshold | Conclusion |
| All D6s | 5 | 1.10 | .954 | .05 | No evidence of difference |
For the D4s, our first test is a Chi-Square test of homogeneity with 6 degrees of freedom. This test results in a Chi Square test statistic of 17.18 which equates to a p-value of .009. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D4s that were tested share the same distribution.
Our next step is to do our two follow up tests. Each of these is a Chi-Square test of homogeneity with 3 degrees of freedom. To control for family wise error rate we’re using a threshold of .025 for each of these tests. Our test statistics for the Honest Dice D4 vs each of the other dice are 1.98 for the CNC die, which equates to a p-value of .576, and 9.25 for the plastic die, which is a p-value of .026. Neither of these are lower than our threshold so neither of is significant and we cannot say there is any evidence that the Honest Dice D4 is any different than any of the other D4s.
| Test | df | Chi-Sq | P-Val | Threshold | Conclusion |
| All D4s | 6 | 17.18 | .008 | .05 | Evidence exists to suggest at least one difference |
| vs CNC | 3 | 1.98 | .576 | .025 | No evidence of difference |
| vs plastic | 3 | 9.25 | .026 | .025 | No evidence of difference |
We come to the same conclusion with the Holms step-down procedure. With this procedure, we order our p-values lowest to highest, compare them to a set of increasing thresholds and reject the lowest p-values until we find one where the p value is greater than the threshold.
| Index | Test | p-val | Threshold = .05/(3+1-index)) |
Conclusion – Fail to reject where: 1st where p-val > threshold and those after |
| 1 | vs plastic | .026 | .025 | p-val > threshold: fail to reject |
| 2 | vs CNC | .576 | .05 | failed to reject above: fail to reject |
This is a very similar situation to the D20s. Evidence exists that suggests some difference between the dice tested, but not when follow up tests to determine where that difference is are performed. Again it seems highly likely that this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. Further testing is recommended for these D4s.
Our final conclusion is fairly straightforward: Evidence exists to suggest some differences between dice but in general sample sizes were insufficient due to the demands of the Chi-square test of homogeneity and Family Wise Error Rate. Further testing with larger sample sizes is recommended.
Hopefully this deep dive into the best practices of dice analysis will be helpful for those looking for a resource on how to do their own analysis in the future as well as those reading analysis put out by others.
First week was part 1: General Approach.
Second week was part 2: Review of the Honest Dice Analysis.












Nice info