Why Cluster Analysis Is Flawed

Author: Keith York

This article is property of the author and may not be reprinted or distributed without permission

Posted June 18, 2000

 

On April 19, 2000 I published the article "A Response To Roy Reinhold's April 6, 2000 'Statistics in Bible Codes Programs' ", which was generally positive in tone.  However, I did express some concerns, as noted in the quote below from that article.

QUOTE

When I first contacted Roy by e-mail concerning his paper, I expressed two concerns about his method and example calculation.  First of all, he correctly states that terms with negative matrix R-values can be expected to occur one or more times in the matrix simply by chance.  Examining his Sid Roth life array, I noticed that only 6 of the 73 ELS's had positive matrix R-values.  If the other 67 can be expected to occur in the matrix simply by chance, then why should they be included in the array in the first place?  Secondly, I pointed out some factors (which I will not go into here) that led me to believe that the overall probability for the Sid Roth array was worse than he had calculated.  Roy responded that the majority of those 67 terms are found in 8 large clusters found within the larger array.  He even states in Part 3 that if a more detailed statistical analysis was feasible that related the terms within clusters to each other and overall clusters to the central term, that the overall probability of that array would be better.  He then provided some initial calculations which he had performed which took the cluster analyses into account, changing some of the negative matrix R-values to positive ones.  He proceeded to show that when this is done, even when the factors that I brought up were taken into account (and he stated that he will take these factors into account in the future), that the overall probability for the array is even better than what he had presented.  He chose not to make the results of the better calculation public until he had ironed out all the details of how to perform that type of analysis.  I commend him in this decision.  Thus rather than his results being overstated, as I initially thought, he showed that they were actually understated.

UNQUOTE 

When I first posted the review of his statistical method, Roy had just given me the calculations for the cluster analysis of the Sid Roth life array.  (Shortly thereafter, he posted those calculations as Part 4 of his article.)  Even though I was not prepared to accept cluster analysis without question, I gave him the public benefit of the doubt since I did not see any errors in the arithmetic.  Then, when I published my own protocol, I simply stated that my method would not be using cluster analysis.

Shortly after I became aware of Roy's cluster analysis, I discovered what I believe to be serious flaws in the method.  Rather than publicly post these concerns, I first engaged him in e-mail discussion on the subject.  Though I have made him aware of these concerns, I have not changed his mind and the cluster analysis is still used on his website.  Thus I have decided to detail my concerns in this article.

My Concerns

Let us say that as a hypothetical example, one has created an array with a few dozen terms. Looking at the matrix R-values, one sees that only half a dozen terms have positive matrix R-values, meaning that the other terms would be considered as statistically insignificant as they are expected to appear in the array more than once by chance. One decides to perform a cluster analysis. A is the central, or main, term. As one looks at the matrix it appears that terms B, C, D, E, F, and G sort of cluster together. Drawing a boundary box around these terms shows that the cluster is 10% of the total matrix area. Therefore for each term its cluster R-value will equal its matrix R-value + 1.000. For discussion purposes, say that these R-values are as follows.

Term: matrix R-value: cluster R-value:

B          0.700                 1.700
C        -0.100                 0.900
D        -0.800                 0.200
E         -1.300               -0.300
F         -0.400                 0.600
G         -0.500                 0.500

There is only one term with a positive matrix R-value, B. It has a value of 0.700. This corresponds to this set of terms contributing a factor of odds of 5 to 1 to the overall matrix odds, looking at only matrix R-values. However, with cluster R-values, the picture radically changes. Now 5 out of 6 terms have positive cluster R-values, meaning that four terms are deemed "statistically significant" that were not before. The sum of the positive cluster R-values is 3.9. However, Roy's proposed method would "correct" that by reducing this sum by the factor of the area of 10%. This would mean subtracting 1.000 from the overall cluster R-value. Once this subtraction is done, the resulting sum is 2.9, or an odds of 794 to 1. Thus the cluster analysis has raised the odds contributed by these six terms by a factor of over 150 times. Now remember that this is only six of the few dozen terms and only one cluster. If one did similar cluster analyses on those groupings of terms which sort of cluster together, one might get overall odds improvements of a factor of a million or more. (Just three clusters like the one above will improve the overall odds by 150 X 150 X 150 = 3,375,000 times.)

If one thinks about it, what is happening here is that 1.000 is added to EACH of the six terms' matrix R-values, while 1.000 is subtracted only ONCE from the total cluster R-value sum. This I why I believe that cluster analysis is too subjective, subject to manipulation, and has the potential to artificially greatly increase the statistical results. The numbers one gets from cluster analysis is TOTALLY dependent upon how one groups the terms into clusters. That is subjective. This ability to selectively determine one's clusters also means that the analysis is open to manipulation. A code researcher could play with the results, trying different groupings of terms, in order to "optimize" the final numbers. If the overall results could be made higher by a factor of a million or more, there is an incredible incentive for just such manipulation to take place. How could such results ever be treated as objective or valid? Finally, for the reason that I stated in this paragraph's first sentence, the "improvements" in the overall statistical results are artificially produced.

The Sid Roth Life Array Without Cluster Analysis

Since Roy uses cluster analysis on the Sid Roth Life Array (see http://ad2004.com/Biblecodes/articles/Statisticspt4.html) to calculate an overall odds of 9.94 billion to 1 for this array, it is instructive to see what the calculated overall odds would be without cluster analysis.  As noted above, only six of the 73 terms have positive R-values.  These are the central term 'Rothbaum' and the five terms listed now with their matrix R-values.  (1) 'ben-Yaacov' (0.805); (2) 'thought' (0.899); (3) 'Messianic' (0.791); (4) 'Vision' (0.368); and (5) 'radio' (1.736).  'Ben-Yaacov' is part of Sid Roth's Hebrew name.  'Thought' is part of the title of a book he wrote, "They Thought For Themselves".  'Messianic', 'Vision', and 'radio' refer to the name of the radio program Sid Roth hosts, "Messianic Vision".

Add the matrix R-values of the five terms to obtain 0.805 + 0.899 + 0.791 + 0.368 + 1.736 = 4.599.  Antilog (4.599) = 39,719.  Divide this by the 13.4005 expected occurrences of 'Rothbaum' in the Tanach in the skip distance range of -4478 to +4478 to obtain 2964.  Divide this by the row-split correction of 6 to obtain 494.  In other words, without cluster analysis, the array has a calculated overall odds of 494 to 1 instead of 9.94 billion to 1, a result approximately 20 million times lower.

The Minimum Threshold Of Plausibility Applied 

In the article just posted (The Minimum Threshold Of Plausibility) I explain why calculated overall matrix odds are unreliable and instead introduce a concept known as the minimum threshold of plausibility for an n-term array.  Does the recalculated Sid Roth life array (with only the six significant terms) meet this threshold?

Since there are 6 terms (the central term and five others), n = 6 and thus n-1 = 5.  Therefore, using the numbers above with the equation presented in the Threshol article, Antilog [4.599 - 5(0.763)]/(13.4055 X 6) = Antilog (0.784)/80.433 = 0.0756.  Since this is less than one, this 6-term matrix is judged not to be a plausible Bible code array.

Conclusion

What is the conclusion?  Is it that Roy Reinhold has been deceitful?  No.  In order for one to be deceitful, one must knowingly propagate error.  Since Roy appears to believe in the validity of cluster analysis, I believe (based upon my analysis above) that he has been mistaken but not deceitful.  His effort to put the codes on a sound statistical footing was commendable, but I believe his cluster analysis method is too flawed to be useful.  I have been reluctant to write this article, not wishing to create a public rift between Roy and myself or between his site and this site.  However, given the flaws that I see in the proposed method of cluster analysis, I decided it was time to address the issue in a public forum.  The Bible codes and the statistical analysis of the codes are too important to let such matters go unanswered.