Gnome Stew » A Case Study In Dice Stat Tests Part 3: Suggested Analysis

Over the last few weeks we’ve been looking at the the analysis done for the Kickstarter for “Honest Dice | Precision Machined Metal Dice You Can Trust ^[1]” as a case study. We’ve gone over the general idea of how one would go about testing dice and presenting results. We’ve also gone over the analysis that was done and what was wrong with it. This week we’re looking at how it might have been done and what conclusions we can actually draw from the testing that was done and the data that was collected. Using the template established earlier an ideal analysis would look something like this:

The intent of our testing is to show that Honest Dice are fairer, that is closer to the ideal distribution than several other die options.
Since we’re testing dice against each other, the appropriate test is the Chi Square Test of Homogeneity. This test is specifically designed to test if sets of dice or other phenomenon have the same distribution or not. Specifically this test checks the hypothesis H0: all dice have the same distribution. Statistical significance in this test will tell us that at least one of the dice has a different distribution.
Since the base test only determines if at least one die is different, if a difference is detected in the D20s or the D4s, follow up tests of homogeneity will need to be performed to determine where the differences lie. While there are 6 possible pairings of dice that might be different for the d20s and 3 possible pairings for the d4s, each additional test increases our potential family wise error rate and we’re really only interested in differences between the Honest Dice and the other dice being tested if a difference is detected in the D20s, we’ll do three follow up tests: the Honest Dice D20 vs the three other D20 options. For the D4s, if a difference is detected we’ll do two follow ups: the Honest D4 vs each of the other D4 options. Finally, with these follow up tests, significance is still only showing a difference in the two dice, not which is better. In those cases, it’s finally time to do goodness of fit tests to test the two dice against another. In this final set of tests we don’t need to worry about thresholds or family wise error rates because we’re (finally! just looking to compare the two P-values. This comparison of P-values is only valid if the preceding tests showed significance.)
For these tests, we’re going to use a .05 threshold for significance. This is a common middle of the road threshold. Given the probabilities involved in D20s (.05 chance of rolling any given side, a .1 threshold seems unreasonably high. .05 frankly seems high too, but because of limited data (see below) going with .01 seems unlikely to give a fair chance for finding significance. If I were designing this test from scratch, I would use a .01 threshold and simply increase sample size but that’s unfortunately not an option.
For the follow up tests, since we have Family Wise Error Rate concerns, we need to pick an adjustment to our significance threshold to account for it. Since all of our follow up tests are going to be the Honest Die option vs another die, it’s reasonable to assume that if the Honest Die is the one with the different distribution then the follow up tests are not independent of one another. Thus for a raw threshold number, we’ll use the Bonferroni adjustment, which is to just to divide the intended threshold by the number of tests. Thus for the D20 follow ups with three tests, our threshold will be .05/3=.017 and for the D4 follow ups with two tests, our threshold will be .025. We’ll also use the Holm’s step down procedure which is similar to the Bonferroni adjustment and which makes fewer assumptions about distribution than alternative step tests. We don’t need to use two options, and in fact two options may give us conflicting results, but I’m interested in using both these techniques as I don’t have experience with them, and I don’t want to use both options and only report one.

For our D20 tests, our sample size will be 2000, for our D6 tests we’ll use a sample size of 1000 and for our D4 tests we’ll use a sample size of 500. We’re using these sample sizes because that’s the size of the data set we have available, not because those are ideal minimum sample sizes. To determine ideal minimum sample sizes we need to know Effect size (.1), significance threshold (.05, .017, or .025 as appropriate), desired power (.05) and degrees of freedom ( faces-1*dice-1) for each test to feed them into G*Power. Thus for each test the ideal sample sizes I’d like to have for both .05 and .01 base threshold are:

Test	Threshold	Degrees of Freedom	Sample Size .05	Sample Size .01
D20 Test	.05	(20-1)*(4-1)=57	4533	5647
D20 Follow ups	.017	(20-1)*(2-1)=19	3571	4354
D6 Test	.05	(6-1)*(2-1)=5	1979	2577
D4 Test	.05	(4-1)*(3-1)=6	2086	2705
D4 Follow ups	.025	(4-1)*(2-1)=3	1962	2491

Performing all these tests requires a data set. Others doing follow up analysis like this is precisely why providing your data set is a best practice. However, despite the fact that a data set was not shared, one can be reverse engineered from the graphs provided using the following method:
- Take the image of the graph provided and find the y coordinate of the pixel that forms the 0 line on the graph.
- For each bar, find the y coordinate of the pixel that forms the top of the bar.
- For each bar, subtract the top of bar coordinate from the 0 line coordinate. This gives you the height of each bar in pixels.
- Sum the total pixels for all bars.
- For each bar, divide the bar height by the total pixels for all bars. This gives you the proportion of all pixels contained in that bar.
- For each bar, multiply the proportion of pixels in the bar by the total number of rolls and round to a whole number. This gives an approximate number of rolls for that die result.
- Find the weighted average roll across all bars, compare to the reported mean roll of the die to verify that your estimated roll set closely approximates the original data.

With everything planned, here is our test results:

For the D20s, our first test is a Chi-Square test of homogeneity with 57 degrees of freedom. This test results in a Chi Square test statistic of 83.41 which equates to a p-value of .013. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D20s that were tested share the same distribution.

Our next step is to do our three follow up tests. Each of these is a Chi-Square test of homogeneity with 19 degrees of freedom. To control for family wise error rate we’re using a threshold of .017 for each of these tests. Our test statistics for the Honest Dice D20 vs each of the other dice are 22.78 for CNC #1, which equates to a p-value of .247, 20.15 for CNC #2, which is a p-value of .386, and 31.85 for the plastic D20, which is a p-value of .032. None of these are lower than our threshold so none of these are significant and we cannot say there is any evidence that the Honest Dice D20 is any different than any of the other D20s.

Test	df	Chi-Sq	P-Val	Threshold	Conclusion
All D20s	57	83.41	.013	.05	Evidence exists to suggest at least one difference
vs CNC1	19	22.78	.247	.017	No evidence of difference
vs CNC2	19	20.15	.386	.017	No evidence of difference
vs plastic	19	31.85	.032	.017	No evidence of difference

We come to the same conclusion with the Holms step-down procedure. With this procedure, we order our p-values lowest to highest, compare them to a set of increasing thresholds and reject the lowest p-values until we find one where the p value is greater than the threshold.

Index	Test	p-val	Threshold = .05/(3+1-index))	Conclusion – Fail to reject where: 1st where p-val > threshold and those after
1	vs plastic	.032	.017	p-val > threshold: fail to reject
2	vs CNC1	.247	.025	failed to reject above: fail to reject
3	vs CNC2	.386	.05	failed to reject above: fail to reject

How does this happen? How do we reject that all dice are the same, yet we don’t find any differences between individual dice? There are two possible explanations here. First, it’s possible that due to the sample size deficiencies, there simply is insufficient power to detect differences that are there. Another possibility is that since there are four dice included in the initial test of “at least one difference” but we’re only doing the three follow up tests that are of interest to minimize Family Wise Error Rate, that the difference is found in one of the three other combinations of dice that we didn’t test (CNC1 vs CNC2, CNC1 vs plastic, or CNC2 vs plastic). Given the p-values calculated in the follow up tests, it seems highly likely this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. In any case, further testing is recommended for these D20s.

For the D6s, our first test is a Chi-Square test of homogeneity with 5 degrees of freedom. This test results in a Chi Square test statistic of 1.10 which equates to a p-value of .954. This is higher than our threshold of .05, so this result is not statistically significant and we have not found any evidence to suggest any difference between the two D6s tested.

Test	df	Chi-Sq	P-Val	Threshold	Conclusion
All D6s	5	1.10	.954	.05	No evidence of difference

For the D4s, our first test is a Chi-Square test of homogeneity with 6 degrees of freedom. This test results in a Chi Square test statistic of 17.18 which equates to a p-value of .009. This is lower than our threshold of .05, so this result is statistically significant and we can say that evidence exists to reject the hypothesis that all D4s that were tested share the same distribution.

Our next step is to do our two follow up tests. Each of these is a Chi-Square test of homogeneity with 3 degrees of freedom. To control for family wise error rate we’re using a threshold of .025 for each of these tests. Our test statistics for the Honest Dice D4 vs each of the other dice are 1.98 for the CNC die, which equates to a p-value of .576, and 9.25 for the plastic die, which is a p-value of .026. Neither of these are lower than our threshold so neither of is significant and we cannot say there is any evidence that the Honest Dice D4 is any different than any of the other D4s.

Test	df	Chi-Sq	P-Val	Threshold	Conclusion
All D4s	6	17.18	.008	.05	Evidence exists to suggest at least one difference
vs CNC	3	1.98	.576	.025	No evidence of difference
vs plastic	3	9.25	.026	.025	No evidence of difference

Index	Test	p-val	Threshold = .05/(3+1-index))	Conclusion – Fail to reject where: 1st where p-val > threshold and those after
1	vs plastic	.026	.025	p-val > threshold: fail to reject
2	vs CNC	.576	.05	failed to reject above: fail to reject

This is a very similar situation to the D20s. Evidence exists that suggests some difference between the dice tested, but not when follow up tests to determine where that difference is are performed. Again it seems highly likely that this is a sample size power issue caused by the higher level of precision required to avoid Family Wise Error Rate. Further testing is recommended for these D4s.

Our final conclusion is fairly straightforward: Evidence exists to suggest some differences between dice but in general sample sizes were insufficient due to the demands of the Chi-square test of homogeneity and Family Wise Error Rate. Further testing with larger sample sizes is recommended.

Hopefully this deep dive into the best practices of dice analysis will be helpful for those looking for a resource on how to do their own analysis in the future as well as those reading analysis put out by others.

First week was part 1: General Approach ^[2].

Second week was part 2: Review of the Honest Dice Analysis ^[3].