A Protocol For The Statistical Analysis Of Bible Code
Arrays
Part 1
Author: Keith York
This article is property of the author and may not be reprinted or distributed without permission
April 22, 2000
This paper describes a protocol for statistical analysis of Bible code arrays. In Part 1 the need for a new protocol is first discussed and then the protocol is described. In Part 2 the protocol is demonstrated in the analysis of two different arrays. In Part 3 some guidelines for critical judgment in use of this protocol as well as in examining any Bible code findings are offered.
The Need For A New Protocol
The currently best known method of statistical analysis of Bible codes is the methodology presented by Doron Witztum, Eliyahu Rips, and Yoav Rosenberg (hereafter referred to as WRR) in the 1994 Statistical Science article "Equidistant Letter Sequences in the Book of Genesis". It uses a measure of compactness and proximity of ELS's which WRR call delta, as well as the concept of domains of minimality. It is a very thorough and powerful methodology that calculates numerical scores for ELS pairings in a Biblical text (such as Genesis or the Torah) and ranks those scores against those calculated for pairings of the same words found in randomly generated control texts.
However, while WRR's methodology is indeed powerful, for a number of reasons I and other individuals have seen the need for a new method of statistical analysis to be developed. (1) Since I and many others do not possess software to generate numerous control texts to compare to the results in the search text nor the software to do the actual numerical calculations, the WRR method is not widely available for use. (2) The WRR method is very complicated and is difficult to understand by people who are not professional mathematicians. (3) Since it is out of the reach of so many, it is also ignored by many. I believe that this has contributed to a state of affairs where amateur codes researchers, feeling that statistical analysis is out of their reach, have proceeded without using any statistical analysis at all. What this has resulted in is many arrays being published both here and elsewhere that have been simply random word patterns produced by mere chance. However, what has been sad is that these researchers and their audiences have not known that these arrays were invalid. To help rectify this state of affairs, I have sought for some time to develop a means of statistical analysis that would be easy to understand and use; in other words, codes analysis for the masses.
A Dead End
The Appendix of The Truth About The Bible Codes presented a detailed example of a statistical analysis of a pair of ELS's. Since, as mentioned above, I do not have software that generates randomized control texts or performs the numerical calculations described by WRR, that paper used permuted spellings of ELS's as an alternate method of statistical analysis. My initial plan was to develop a protocol that would extend the permuted spellings methodology presented there to arrays containing more than two ELS's. However, I came to the realization that a protocol based upon permuted spellings, WRR's delta values, and domains of minimality would be impractical for two basic reasons. (1) It would be very time-consuming to perform. Here we face the same objection raised above. If a method of statistical analysis is too difficult to understand or use, it will be ignored, and its effectiveness will be diminished. (2) It would be very inexact because of limits on the number of permutations possible of ELS's. With WRR's method, one's number of control texts is presumably limited only by how long one wishes to tie up one's computer. One can generate either 100 or 1000 or 10,000 or even 1,000,000 control texts, depending upon the level of accuracy desired in the results.
With permuted spellings, the level of accuracy would be dependent upon the number of letters in each ELS. An ELS with n letters (where each letter is different) can have n! ("n factorial") different permuted spellings. Since half of these are simply the reverse spellings of the other half, there are n!/2 reversal-independent permuted spellings. (n! = 1 X 2 X 3 X ... X n-1 X n. For example, 5! = 1X2X3X4X5 = 120. Thus there are 120/2 or 60 reversal-independent permuted spellings of a 5-letter ELS.) For a 4-letter ELS, there would only be 12 (1X2X3X4/2) reversal-independent permuted spellings. Thus any calculated probabilities for a 4-letter ELS would be in increments of one-twelfths. This would be very inexact. Much more exact probabilities could be computed for six-letter ELS's, which have 1X2X3X4X5X6/2 = 360 reversal-independent permuted spellings. However, making a search list of 360 terms, performing the searches, and calculating the results would be extremely laborious.
Description of the Protocol
Faced with these limitations, I sought a new approach, a way to measure the same underlying factors as WRR but in an approach more amenable to quicker calculations. First it should be noted that WRR's delta value analyzes for two ELS's geometrical compactness and proximity to each other in a particular two-dimensional array. Another way of analyzing for that is by measuring the area of a rectangle enclosing both ELS's. CodeFinder does this automatically for an entire array and calculates what are termed matrix R-values for each term for the whole array. These R-values are statistical measures named after Dr. Alex Rotenberg. Roy Reinhold has written a 3-part article, Statistics in Bible Code Programs, showing a "first step" in how these matrix R-values can be used in a statistical analysis of an entire array. That article, when combined with some ideas I had already developed, gave rise to the protocol I am about to present.
First some definitions must be given. As defined by CodeFinder, the text R-value for an ELS with d skip distance equals log (1/Etext), where Etext is the expected number of occurrences of this ELS in the -d to +d skip distance range in the defined search text. Note that the text R-value is defined by the search text and will differ according to the length of the search text. Thus it is important that if an array is found entirely in the Torah, that the search text be defined as the complete length of the Torah. If the array is found in the Tanach (but not entirely in the Torah), then the search text would be defined as the complete length of the Tanach.
If a term is expected to occur once in the text as an ELS in a given skip distance range, then Etext = 1 and log (1/Etext) = log (1/1) = 0. If the expected number of occurrences is greater than one, then Etext > 1, and 1/Etext < 1. Since the logarithm of a number between zero and one is a negative number, then a text R-value is always negative when an ELS is expected more than once in the text. If the expected number of occurrences is less than one, then Etext < 1, and 1/Etext > 1. Since the logarithm of a number greater than one is always a positive number, then a text R-value is always positive when an ELS is expected less than once in the text.
A second term to know is the matrix R-value. As defined by CodeFinder, the matrix R-value for an ELS with d skip distance equals log (1/Ematrix), where Ematrix is the expected number of occurrences of this ELS in the -d to +d skip distance range in a matrix of given area A. For the same reasons as given in the preceding paragraph, a matrix R-value which is a negative number means that the ELS is expected to occur more than once in the matrix in the -d to +d skip distance range. A matrix R-value which is a positive number means that the ELS is expected to occur less than once in the matrix in the -d to +d skip distance range. Since ELS's with negative matrix R-values are expected to occur at least once in the matrix anyway, only those with positive matrix R-values will contribute meaningfully to any calculated overall probability for the whole array.
Roy proposes adding the positive matrix R-values for an array to arrive at an overall R-value for the whole array. This approach shows the probability of finding an array containing each of the terms found anywhere inside a rectangle of given size. While this is a good start, what is really needed is a way of quickly analyzing each term in relation to its skip distance, compactness, and proximity to the central term. In other words, if an array has two occurrences of the same word as an ELS (both at the same skip distance and thus of equal geometrical compactness), the one that is found very close to the central term should be thought of as a better code than the one farther away. However, if one looks only at the matrix R-values of the entire array, the two will be considered as "equally good". To perform a more accurate analysis, each needs to be considered separately as being paired with the central term. Whereas CodeFinder automatically calculates matrix R-values for each term in an entire array, it can also do the same for the central term and a particular ELS when clicking on that particular ELS and choosing "View Report". If two ELS's are very compact and in very close proximity, the rectangle's area will be small. If two ELS's are spread out and/or far from each other in an array, the rectangle's area will be large. At first consideration, one might think that using these matrix R-value derived from the pairings of single ELS's with the central term would give a more accurate overall R-value for the array than the one derived from taking the matrix R-values of each term from the entire array examined at once. However, this approach will overestimate the statistical significance of arrays unless modified as described below.
One cannot perform a valid statistical analysis using simply this area that encompasses the central term and a single ELS. One must account for the possibility that (for example) an ELS found below and to the right of the central term would be just as valid as that same ELS found an equal distance above and to the left of the central term. To account for this, one must calculate a value I designate A' (area-prime). This is done by first designating the central, or primary, term. The "central" term is often the main subject of the array. In any case, the primary term must have a y-axis component and be close to vertical. Secondly, one finds the center of the central term. If the central term has an odd number of letters, its center is simply the middle letter. If the central term has an even number of letters, its center is geometrically halfway between the two middle letters. Thirdly, one finds the farthest corner from the central term's center of the smallest rectangle that encloses the central term and the ELS under consideration. Fourthly, one enlarges the rectangle as follows. However many rows there are above the center of the central term will equal the number of rows below that center, and vice-versa. However many columns there are to the left of the center of the central term will equal the number of columns to the right of that center, and vice-versa. Now the rectangle will be symmetrical above and below the center of the central term, as well as symmetrical to the right and left of the center of the central term. In other words, the center of the central term will now be the center of the rectangle.
What is the probability that a given single occurrence of the ELS under consideration will appear either inside or outside this rectangle? The fraction of the search text that is inside the rectangle is A'/T, where T is the total number of letters in the search text. If the array is found entirely in the Torah, then T is the number of letters in the Torah (304,805). If the array is found in the Tanach (but not entirely in the Torah) then T is the number of letters in the Tanach (1,196,925). The fraction of the search text that is outside the rectangle is (T - A')/T. Here an approximation is made for the express purpose of simplifying the remaining calculations. It is assumed that the probability of a given ELS appearing inside the rectangle is A'/T, and that the expected number of occurrences in a rectangle of area A' is A'/T times than of the expected number of occurrences in the text. (An exact calculation can be performed using some of the equations contained in the book "Breakthrough" by Ed Sherman which can be ordered at http://www.biblecodedigest.com.)
If one examines a CodeFinder report carefully, it will be noticed that for each term, the matrix R-value minus the text R-value equals a constant number. This constant number is log (T/A), where T is the number of letters in the search text (T = 304,805 for the Torah; T = 1,196,925 for the Tanach) and A is the area of the matrix. Stated another way, matrix R-value = log (T/A) + text R-value. Since this is true for a matrix of area A, then one can use a calculator to quickly calculate a new matrix R-value for a rectangle of area A' as being R(A') = log (T/A') + text R-value, where text R-value is read from the CodeFinder report for each ELS; T is the number of letters in the search text (304,805 for the Torah or 1,196,925 for the Tanach); and A' is the area of the expanded rectangle as described two paragraphs ago. This is called Equation 1.1 and is highlighted below.
[Equation 1.1] R(A') = log (T/A') + text R-value
Expanding the rectangle for each term as paired with the central term (as described above) gives one the area A' (which equals the number of rows times the number of columns), and from this number and the text R-value automatically calculated by CodeFinder for each term, R(A') can be calculated for each term.
Once R(A') is calculated for every term except the central term, sum all R(A') values which are positive numbers to arrive at a value R(sum). Note that R(A') comes from the pairing of an ELS with the central term. Since one cannot meaningfully pair the central term with itself, one must use the text R-value of the central term (which will be designated R0) to reflect the expected number of occurrences of the central term in the -d to +d skip distance range in the entire search text. This number R0 is added to R(sum) to achieve an overall R-value for the entire matrix, R(M). Antilog [R(M)] = the odds for an array of this row split to contain each of the terms when taking into account each ELS's skip distance, compactness, and proximity to the central term. Finally, one must also reflect the row split in the final answer by dividing this result by the row split number. (For example, if the central term has a skip distance of 1100 and the line length is 220, then the "row split" is 5. Dividing the result by 5 reflects the fact that row splits 1,2,3,4, or 5 could have been used to give a central term that is at least as geometrically compact as is actually seen.)
This is the case for several terms all found in one array at one line length. What if several terms were found to be paired with a single occurrence of the central term but were found in two or more arrays (i.e., two or more row splits)? (Note that this is an empirically discovered property of the codes: that changing the row split of the central term sometimes produces new arrays containing terms which provide additional information relevant to the central term.) Each term's R(A') would be calculated within its array and the resulting sum of all positive value R(A'), or R(sum), would be added to R0 as above to get R(M). The value of antilog [R(M)] would be calculated as above. However, since there were more than one different row split used, this result would be divided by the number of different row splits in which pairings were found (termed Sdif). This would then be divided by the maximum row split number in which a pairing was found (termed Smax). Thus the equation looks like this.
[Equation 1.2] Overall matrix odds = Antilog [R(sum) + R0]/(Sdif X Smax)
To give an example, say that the sum of all positive value R(A') is 4.355 and the text R-value of the central term R0 = 0.510. Then R(M) = 4.355 + 0.510 = 4.865. Antilog (4.865) = 73,282 (i.e., 10 to the 4.865 power is 73,282). Say also that the terms were found in two separate arrays with a single occurrence of the central term, and that the row splits of the central term in these two arrays were 3 and 5 (Sdif = 2 and Smax = 5). Then overall matrix odds = 73,282/(3X5) = 4885. This means that the odds of this array being by chance (taking into account how well each term is paired with the central term) is only 1 in 4885.
Using the Protocol
The protocol is used by implementing the following steps.
(1) Determine whether the array is completely contained in the Torah or is instead found in the Tanach (Christian Old Testament). If the array is completely contained within the Torah, then the Torah is used as the search text in the following steps and T = 304,805 in the following calculations. Otherwise, the Tanach is used as the search text and T = 1,196,925 in the following calculations. This distinction is in recognition of the fact that some codes proponents (myself not included) do not believe that valid code arrays exist in the Tanach outside the Torah.
(2) Determine the "central" or primary term and label it as E0 (for "ELS 0"). The "central" term is often the main subject of the array. In any case, the primary term must have a y-axis component and be close to vertical. From the CodeFinder "view report", record the text R-value of E0, the central term, and label it as R0. Note that in order for this number to be valid in the following calculations, the Search Area feature in the software must have been set to either the entire Torah or the entire Tanach rather than some shorter text span.
(3) Each of the other terms are labeled E1, E2, etc. From the CodeFinder "view report", record the text R-value of each term. As above, note that in order for this number to be valid in the following calculations, the Search Area feature in the software must have been set to either the entire Torah or the entire Tanach rather than some shorter text span.
(4) For each term E1, E2, etc. examine its pairing with the central term E0 in the array and mentally draw a rectangle around the pairing such that the number of rows above the central term's center is equal to the number of rows below it and the number of columns to the right of the central term's center is equal to the number of columns to its left. (In other words, the center of the central term also becomes the center of the rectangle.) Multiply the number of rows in this rectangle by the number of columns to obtain the area of the rectangle, which is termed A'. Note that the numerical value of A' will be different for each term.
(5) Once A' has been determined for each term's pairing with the central term, use Equation 1.1 to calculate R(A') = log (T/A') + text R-value, where text R-value is read from the CodeFinder report for each ELS as recorded in step 3; T is the number of letters in the search text as determined in step 1 (304,805 for the Torah or 1,196,925 for the Tanach); and A' is the area of the expanded rectangle for each ELS as determined in step 4. Since the numerical value of A' will be different for each term, the numerical value of log (T/A') will be different for each term. (In other words, do not make the mistake of using the same numerical value of log (T/A') for each and every term.)
(6) Sum together all the R(A') values which are positive numbers (and only those which are positive numbers) and call this sum R(sum). (Make sure that each occurrence of an ELS is only counted once. For example, if the same occurrence of an ELS is found in two different arrays of different row splits, then only one R(A') value should be used. However, if two or more separate occurrences of a particular word are found at different row splits, then one R(A') should be used for each one. The important thing is that no occurrence be double-counted.) Add the text R-value of the central term, R0, to R(sum) to obtain an overall R-value for the entire matrix, R(M). Antilog [R(M)] = Antilog [R(sum) + R0] = the odds for an array of this row split to contain each of the terms when taking into account each ELS's skip distance, compactness, and proximity to the central term.
(7) This value is plugged into Equation 1.2, Overall matrix odds = Antilog [R(sum) + R0]/(Sdif X Smax), where Sdif equals the number of different row splits in which pairings were found and Smax equals the maximum row split number in which a pairing was found. If all the terms were found in only one array, then Sdif = 1 and Smax is simply the row split number of the one array. This is the overall odds for the matrix (or matrices, if two or more having the same central term are being analyzed), taking into account the skip distance, compactness, and proximity of each term to the central term; the expected number of occurrences of the central term at its found absolute skip distance or less; and the possibility that the row split of the central term could have been any integer from one to that actually seen in the array(s). Notice that even though the technical measurements are different from that of WRR's method, the same underlying factors (compactness, proximity, near-minimality, and differing possible row splits) are being accounted for.
(8) The probability is simply one divided by the odds. Thus if the overall matrix odds is 10,000 [i.e., "10,000 to 1"], then the overall matrix probability is 1/10,000 = 0.0001 or 1EE-4, where "EE-4" means 10 to the -4 power. Expressed in percent terms, the stated probability is simply multiplied by 100%, so that 0.0001 is the same as 0.01%. As a convention, I will state final odds and probabilities to three significant digits.
(9) A useful statistical analysis report should have enough information for the reader to verify one's calculations. This should include (a) whether the search text is the Torah or Tanach; (b) the Hebrew and English spellings of each term for which a positive number R(A') was found; (c) the skip distance of each term presented in b; (d) the calculated A' for each of those terms; (e) the text R-value for each of those terms; (f) the calculated R(A') value for each of those terms; and (g) the values of Sdif and Smax. A allows one to know whether T = 304,805 or 1,196,925 for the calculations. B and C allows one to verify in CodeFinder if the text R-value for each term is correct. B and C also allows one to generate an array in CodeFinder for each ELS paired with the central term, and ensure that A' was correctly calculated. D allows one to verify the values of log (T/A') and ensure that T(A') was correctly calculated. The verification of each of these allows one to verify the overall matrix odds.
A Slight Modification
In most cases, the actual number of occurrences of the central term in the search text in the -d to +d skip distance range is close to the expected number of occurrences. Sometimes, though, they may differ considerably. If one desires to use the actual number of occurrences of the central term rather than the expected number, Equation 1.2 can be modified to produce
[Equation 1.3] Modified overall matrix odds = Antilog [R(sum)]/(N0XSdifXSmax
where N0 is the actual number of occurrences of the central term in the search text in the -d to +d skip distance range. If one uses this modified equation, then it should be stated in one's report that one has done so.
Go to Part 2
Return to Feature Articles
Return to The Bible Codes