Recently I came across another video featuring the float test for testing die fairness. For those not familiar, the float test consists of floating your dice in a dense bath of salt water and repeatedly spinning, rolling, or shaking them and letting them settle to see if a certain face or set of faces routinely float to the top. This result is supposed to be indicative of voids or differences in density of your dice and proof that they do not roll fairly. In theory, if a die fails the float test you shouldn’t use it.

I’ve always been skeptical of the float test though. Yes, it certainly can tell you if your dice have imperfections that make one face or collection of faces lighter or heavier than others, but do those differences really result in a meaningful difference in rolls? So, I set out to do a not at all repeatable, not at all scientific test to see if the results from a float test on my d20s, were borne out by a chi-square analysis.

Rough Methodology:

• I wanted to test all my d20s, but I discarded three of them for the purposes of the test: two which had insufficient contrast to read easily, and one that is an old-school double d10, not a true d20
• I was left with 22 d20s. I wanted to perform a float test on all of them and note which ones failed and which face(s) repeatedly floated to the top.
• Then I wanted to perform a chi-square goodness of fit test on those dice. However, since we had a clue which face(s) should be the most (or least) common according to the float test, we should actually be able to do a better test that the standard 19 degree of freedom test vs H0: all faces have a .05 chance ofÂ occurrence. Instead we would be able to do the better test against H0: the face(s) indicated by the float test have a chance of occurrence equal to .05 times the number of faces. This test is better since we’re able to target the specific faces that should be off rather than general deviation from the ideal distribution.

Execution:

1. The salt stopped dissolving yet again.
2. Impurities and seeding crystals into the solution (via adding salt) caused a rapid crystallization of the salt out of the solution into a thick crust on the top of the pan which broke loose and sunk.

So, I had gotten about as much salt into the water as I was going to be able to in my kitchen. But even after much of the salt crystallized out of the solution, I was able to float four dice (the four pictured above). That’s not a good result out of 22, but it’s something at least. One of them was very recognizable: my PolyHero Wizard die. The other three are generic d20s. If it’s important, the black one pictured above was the first one to float, the Wizard d20 was the second one to float, but may well have floated better because of its unique textured shape. The two translucent greens were the last to float.

Now that I had four floating dice, I was able to do a float test. The black d20 exclusively had the 16,19,6,9, and 3 cluster at the surface, the Wizard die exclusively had the 20 rise to the surface, and the other two had no discernible tendency. If the float test actually works to detect internal voids and bubbles though, the results of the green dice would make sense, as they are clear enough to visibly confirm that none exist. This gave me two dice to run chi-square goodness of fit test on, but I had already run a general 19 degree of freedom goodness of fit test on my PolyHero Wizard d20, so I was even more skeptical that I would find anything amiss with it. Still, for the sake of being thorough, I went ahead and tested it again.

Remember, that the end result of a chi-square goodness of fit test is a p-value and “if the p is low, H0 must go” i.e.: if your p-value is lower than a standard critical value (usually .1, .05 or .01 depending on how skeptical you want to be) you must reject your original hypothesis. Remember also, in this case our hypothesis is that the faces indicated by the float test came up a proportion of the time equal to .05 times the number of indicated faces (i.e.: the die follows the normal fair distribution). For each, I rolled the die 100 times and ran a one degree of freedom goodness of fit test on the two categories of “float test faces” and “other faces”.

For the wizard die, which had exclusively had the 20 rise to the surface, if it was a fair die we would expect 5 20s to be rolled and 95 other faces. Instead we saw 4 20s and 96 other faces. This results in a p-value for a chi-square goodness of fit with 2 categories (1 degree of freedom) of about .35. This is not sufficiently low to reject our H0, so we do not have sufficient evidence to conclude that the results of the float test are meaningful. As I stated earlier, this isn’t surprising, as I had already run a standard goodness of fit test on this particular die and not found sufficient evidence to reject it’s fairness.

For the black die, which had exclusively had the 16,19,6,9,3 cluster of sides rise to the surface, if it was a fair die we would expect those 5 sides to be rolled 25 times, and the other 15 sides to be rolled 75 times. Instead we saw 26 and 74 occurrences respectively. This resulted in a p-value of about .18. This is lower than the results of the wizard die, but still not sufficiently low to reject our H0. Thus we don’t have sufficient evidence to conclude the float test results are meaningful in this case either.

End Conclusion:

Honestly, this debacle is inconclusive. I couldn’t even get 18 of my 22 dice to float. Either better conditions, a better method of making the solution or a denser solution is required for me to test more dice.Â  If anyone has suggestions of how to improve my results here, I’d love to hear them. I’m willing to give this another go with more dice. Reading sources online I also find that others have seen the same results I have with dice floating at wildly different densities of solution. It’s possible that this is related to the presence of bubbles and voids and more prominent ones make for denser dice. It’s also possible this relates to particular type of material used in manufacture.

However, given the difficulty in successfully executing a float test, and the proportion of my dice that resolutely refused to float, and the fact that in the two cases we could test, we found no evidence to support the conclusions of the float test, I’m going to tentatively call the float test as impractical and not supportable, but would be very interested in running more tests once I have a better protocol to work with.

Have you had better success with the float test? Swear by it? Have you conducted this or a similar experiment yourself? I’d love to hear from you so I can get tips for another go at this.