Recently I came across another video featuring the float test for testing die fairness. For those not familiar, the float test consists of floating your dice in a dense bath of salt water and repeatedly spinning, rolling, or shaking them and letting them settle to see if a certain face or set of faces routinely float to the top. This result is supposed to be indicative of voids or differences in density of your dice and proof that they do not roll fairly. In theory, if a die fails the float test you shouldn’t use it.

I’ve always been skeptical of the float test though. Yes, it certainly can tell you if your dice have imperfections that make one face or collection of faces lighter or heavier than others, but do those differences really result in a meaningful difference in rolls? So, I set out to do a not at all repeatable, not at all scientific test to see if the results from a float test on my d20s, were borne out by a chi-square analysis.

Rough Methodology:

- I wanted to test all my d20s, but I discarded three of them for the purposes of the test: two which had insufficient contrast to read easily, and one that is an old-school double d10, not a true d20
- I was left with 22 d20s. I wanted to perform a float test on all of them and note which ones failed and which face(s) repeatedly floated to the top.
- Then I wanted to perform a chi-square goodness of fit test on those dice. However, since we had a clue which face(s) should be the most (or least) common according to the float test, we should actually be able to do a better test that the standard 19 degree of freedom test vs
*H0: all faces have a .05 chance of**occurrence*. Instead we would be able to do the better test against*H0: the face(s) indicated by the float test have a chance of occurrence equal to .05 times the number of faces*. This test is better since we’re able to target the specific faces that should be off rather than general deviation from the ideal distribution.

Execution:

I started with 3 cups of water in a small bowl, enough to contain all my d20s at once. I then started adding salt to the bowl, one tablespoon at a time with the goal of getting all my dice to float. One of the dice started to float after I added about 3 tablespoons of salt (about a 1/16 concentration) but the rest stubbornly refused to float as I added tablespoon after tablespoon of salt. Eventually around 10 tablespoons of salt (about a 1/5 concentration), another die started to float, but the salt also stopped dissolving in the water with 20 of my dice still sitting solidly on the bottom of the bowl. I fished out all of the dice and microwaved the solution and was able to get another few tablespoons of salt to dissolve but no additional dice were floating. So, after a quick google search to make sure I wasn’t about to ruin my dice, I transferred the entire solution to a pan (featured above) and slowly heated it on the stove with a few of the stubborn dice on the bottom so I’d know when I had enough salt dissolved. I managed to get about 16 total tablespoons of salt to dissolve (about a 1 to 3 ratio, making my solution literally saltier than Poseidon’s trident) before two things happened:

- The salt stopped dissolving yet again.
- Impurities and seeding crystals into the solution (via adding salt) caused a rapid crystallization of the salt out of the solution into a thick crust on the top of the pan which broke loose and sunk.

So, I had gotten about as much salt into the water as I was going to be able to in my kitchen. But even after much of the salt crystallized out of the solution, I was able to float four dice (the four pictured above). That’s not a good result out of 22, but it’s something at least. One of them was very recognizable: my PolyHero Wizard die. The other three are generic d20s. If it’s important, the black one pictured above was the first one to float, the Wizard d20 was the second one to float, but may well have floated better because of its unique textured shape. The two translucent greens were the last to float.

Now that I had four floating dice, I was able to do a float test. The black d20 exclusively had the 16,19,6,9, and 3 cluster at the surface, the Wizard die exclusively had the 20 rise to the surface, and the other two had no discernible tendency. If the float test actually works to detect internal voids and bubbles though, the results of the green dice would make sense, as they are clear enough to visibly confirm that none exist. This gave me two dice to run chi-square goodness of fit test on, but I had already run a general 19 degree of freedom goodness of fit test on my PolyHero Wizard d20, so I was even more skeptical that I would find anything amiss with it. Still, for the sake of being thorough, I went ahead and tested it again.

Remember, that the end result of a chi-square goodness of fit test is a p-value and “if the p is low, H0 must go” i.e.: if your p-value is lower than a standard critical value (usually .1, .05 or .01 depending on how skeptical you want to be) you must reject your original hypothesis. Remember also, in this case our hypothesis is that the faces indicated by the float test came up a proportion of the time equal to .05 times the number of indicated faces (i.e.: the die follows the normal fair distribution). For each, I rolled the die 100 times and ran a one degree of freedom goodness of fit test on the two categories of “float test faces” and “other faces”.

For the wizard die, which had exclusively had the 20 rise to the surface, if it was a fair die we would expect 5 20s to be rolled and 95 other faces. Instead we saw 4 20s and 96 other faces. This results in a p-value for a chi-square goodness of fit with 2 categories (1 degree of freedom) of about .35. This is not sufficiently low to reject our H0, so **we do not have sufficient evidence to conclude that the results of the float test are meaningful.** As I stated earlier, this isn’t surprising, as I had already run a standard goodness of fit test on this particular die and not found sufficient evidence to reject it’s fairness.

For the black die, which had exclusively had the 16,19,6,9,3 cluster of sides rise to the surface, if it was a fair die we would expect those 5 sides to be rolled 25 times, and the other 15 sides to be rolled 75 times. Instead we saw 26 and 74 occurrences respectively. This resulted in a p-value of about .18. This is lower than the results of the wizard die, but still not sufficiently low to reject our H0. Thus **we don’t have sufficient evidence to conclude the float test results are meaningful in this case either**.

End Conclusion:

Honestly, this debacle is inconclusive. I couldn’t even get 18 of my 22 dice to float. Either better conditions, a better method of making the solution or a denser solution is required for me to test more dice. If anyone has suggestions of how to improve my results here, I’d love to hear them. I’m willing to give this another go with more dice. Reading sources online I also find that others have seen the same results I have with dice floating at wildly different densities of solution. It’s possible that this is related to the presence of bubbles and voids and more prominent ones make for denser dice. It’s also possible this relates to particular type of material used in manufacture.

However, given the difficulty in successfully executing a float test, and the proportion of my dice that resolutely refused to float, and the fact that in the two cases we could test, we found no evidence to support the conclusions of the float test, **I’m going to tentatively call the float test as impractical and not supportable**, but would be very interested in running more tests once I have a better protocol to work with.

Have you had better success with the float test? Swear by it? Have you conducted this or a similar experiment yourself? I’d love to hear from you so I can get tips for another go at this.

What kind of salt did you use? Basic table salt? If so, next time try epsom salt, it dissolves much more readily in water and will reach a higher concentration, hopefully letting you get more of your dice to float for a more accurate testing.

I would try using sugar instead of salt. You should be able to get to a more dense solution that way.

You could also try floating the dice in corn syrup which is far denser than water to begin with.

I typed a whole post on power calculatons and the mailer ate it. If you could recover it, that would be great.

I’ll cut to the chase this time. With only 100 trials, you don’t have enough power to reject the null hypothesis that the die is balanced if it actually has a 7.5% chance of landing 20.

With 100 trials, there’s only a 22% chance of rejecting the null at p = 0.05. At 1000 trials, it’s 92% and at 2000, it’s 99%.

And remember, we’re talking about a die that rolls 50% more critical hits than normal.

One way to get a handle on this is just do the simulations of 20 or 30 trials of 100 rolls with a 7.5% chance of success to see how noisy binary data is. Here are number of 20s rolled in 50 experiments of a 100 rolls each for a die with a 7.5% chance of producing a 20:

rbinom(50, 100, 0.075)

9 11 7 9 8 9 10 7 7 7 9 5 5 7 2 12 10 10 7 12 7 7 10 7 4 9 6 5 11 8 13 6 4 9 6 4 9 7 7 3 6 6 6 7 12 6 6 8 8 6

The proportion of 20s ranges from 2% to 13%, with a lot of variation.

Here’s the R code for the power tests if you want to run it yourself:

N <- 10000 # number die throws per experiment

K <- 10000 # number experiments

theta <- 0.06 # bad side probability

N <- 1000 # number rolls per experiment

K <- 10000 # number experiments

theta <- 0.075 # die chance of rolling 20

threshold <- 0.05 # p-value threshold for rejection

rejections = 0

options(comment = "")

for (n in 1:K) {

twenties <- rbinom(1, N, theta)

other <- N – twenties

pvalue <- chisq.test(x = c(twenties, other),

p = c(0.05, 0.95))$p.value

if (pvalue < 0.05) rejections <- rejections + 1

}

print(rejections / K, digits = 5)

I got the post on power calc in the email subscription, but not here. I’ll ask the tech gnomes to look into it.

While I agree with what you’re saying, the problem I run into is that we’re not actually testing to see if dice are fair (with either test) because no such animals exist and for any die, sufficient power will always reject. What we need to test for is if a die detectibly deviates from the expected distribution within the context of a game session. So while you are absolutely correct that more n = more power that more power isn’t necessarily desirable.