Randomness in Excel

WARNING: unbalanced footnote start tag short code found.

If this warning is irrelevant, please disable the syntax validation feature in the dashboard under General settings > Footnote start and end short codes > Check for balanced shortcodes.

Unbalanced start tag short code found before:

“171 * *s1) % 30269); *s2 =”

When writing formulae in a default installation of Excel, RAND() is the only function from which randomness can be directly derived. According to the documentation, RAND() generates an “evenly distributed random number greater than or equal to 0 and less than 1”. Installation of the Analysis ToolPak also makes the RANDBETWEEN() function available for use, which takes two integers as arguments and, again according to the documentation, returns a “random integer number between the numbers you specify”.

Note that in the case of RAND(), I say derived directly, because it’s also technically possible to use the old Microsoft-endorsed trick of using the sixth decimal place onwards from the NOW() function and multiplying up to the range 0–1. (However, for a given calculation cycle the NOW() technique returns the same random value for every cell in the spreadsheet making use of it, so it’s not always desirable.) Rounding out Excel’s random number generation capabilities are two additional methods, which are slightly more cumbersome but have the key property of allowing a random seed to be specified: the Random Number Generation tool in the Analysis ToolPak and the Rnd() function in Visual Basic for Applications (VBA).

Before we dive into these last two methods, it’s worth walking through the evolution of random number generation in Excel as it’s been a tortuous (indeed, some might say torturous) route getting to where we are at the time of writing.

A brief history of RAND()

Prior to Excel 2003, the RAND() function used a linear congruential generator (LCG) to generate pseudorandom numbers between 0 and 1 (with an initial X value of 0.5):

Xn+1 = (9821 * Xn + 0.211327) % 1

Barreto and Howland characterised the randomness of the pre-2003 Excel RAND() function in their book on Monte Carlo simulation in Excel using a “trapping” technique in which the values generated immediately after a value of between 0.7 and 0.7001 were recorded and plotted over the whole 0–1 range:

Barreto and Howland Figure 9.2.2

Barreto and Howland Figure 9.2.2

Predictably, this failed the then-current DIEHARD randomness test by George Marsaglia and can trivially be shown to be a relatively poor random number generator (RNG). From Excel 2003 onwards, Microsoft switched RAND() to use an implementation of the Wichmann-Hill algorithm. Wichmann-Hill is also a relatively simple generator, based on a summation of three LCGs:

# include 

float wichmann_hill_random (int *s1, int *s2, int *s3) {

	float value;

	*s1 = ((171 * *s1) % 30269);
	*s2 = ((172 * *s2) % 30307);
	*s3 = ((170 * *s3) % 30323);

	value = fmod((float) (*s1) / 30269.0 
	           + (float) (*s2) / 30307.0 
	           + (float) (*s3) / 30323.0, 1.0);

	return value;
	
}

Unfortunately, the first Wichmann-Hill implementation in Excel was found to be incorrect as it would eventually generate negative numbers (an impossibility according to the 1982 Wichmann-Hill paper). A 2004 bug fix was intended to fix the problem. However, McCullough subsequently used “Zeisel recursion” (so named after a 1986 publication by Zeisel, in which Chinese Remainder Theorem was employed to rewrite the Wichmann-Hill RNG as a single LCG) to demonstrate that the updated generator was still not consistent with the Wichmann-Hill RNG as published. Wichmann-Hill should have a periodicity of 6.95 × 1012 (not the 2.78 × 1013 claimed in the original paper) but, as McCullough and Heiser note, the actual periodicity of the Excel implementation is unknown.

While even the incorrect Excel implementation of Wichmann-Hill passed the DIEHARD test, it failed some aspects of all three of L’Ecuyer and Simard’s (PDF) TESTU01 test suites (SmallCrush, Crush, and BigCrush) which were designed to supersede DIEHARD. Specifically, L’Ecuyer and Simard noted that Wichmann-Hill failed 2, 12 and 22 of the SmallCrush, Crush, and BigCrush test suites, respectively.

If the original RAND() documentation is to be believed, the Wichmann-Hill generator is still in use as of Excel 2010, which is listed in the “Applies to” section. However, the Office 2010 documentation notes that the “RAND function now uses a new random number algorithm” and Guy Mélard reports that there is “semi-official” confirmation that Excel 2010 and later in fact use the Mersenne Twister algorithm to power RAND().

The Mersenne Twister has a period length of 219937 − 1, passes the full DIEHARD test suite, all of the SmallCrush tests and all but 2 each of the Crush and BigCrush tests, according to L’Ecuyer and Simard. While many improvements have been made to RNGs since the publication of the Mersenne Twister (even within the same class of generator), using the Mersenne Twister at least brings Excel to parity with many modern programming environments, including the default RNGs in SPSS and R.

However, the exact implementation details of RAND() and RANDBETWEEN() are immaterial for applications requiring reproducible randomness as the functions cannot be seeded. Seeded randomness can only be achieved using one of the two aforementioned approaches: the Random Number Generation tool in the Analysis ToolPak or VBA. In brief (and highly informal) testing, the Random Number Generation tool in the analysis ToolPak was unbearably slow in generating uniformly distributed random numbers, generating around 36 random numbers per second on a 2.3 GHz Core i7. So that leaves VBA.

Generating seeded random numbers using VBA

Based on Microsoft’s documentation, the Rnd() function in Basic has always been based on a power-of-two modulus LCG. In Microsoft Basic versions prior to Visual Basic 1.0, this took the form:

Xn+1 = (214,013 * Xn + 2,531,011) % 224

From Visual Basic versions 1.0 to 6.0, the operands were updated:

Xn+1 = (1,140,671,485 * Xn + 12,820,163) % 224

Note that the performance of the VB 1.0-6.0 LCG was also characterised by Barreto and Howland in the figure, above. It has a periodicity of 16.106. Interestingly, Microsoft notes that the code used to implement the LCG underpinning Rnd() could not be implemented in VBA itself due to the use of an unsigned long data type that’s not available in VBA.

Unfortunately there isn’t much Rnd() documentation for VB versions later than 6, but between L’Ecuyer and Simard (who refer to it as a “toy generator”, in part because of its power-of-two modulus) and Mélard, we can surmise that this implementation is still present in VBA in Excel 2010 (and possibly later).

Thankfully, while Rnd() is an extremely unsophisticated RNG, it is at least easy to use and seed. There are two key functions related to random number generation in VBA: Rnd and Randomize. The documentation states that Randomize should be used to initialize the generator, but in practice, the call isn’t necessary and seeding the generator is as simple as calling Rnd() with a single negative integer, as illustrated in the table below. Making repeated calls to Rnd() with the same negative argument yields exactly the same value and subsequent calls without any argument will continue the sequence using the seed from the first call.

If number is Rnd generates
Less than zero The same number every time, using number as the seed.
Greater than zero The next random number in the sequence.
Equal to zero The most recently generated number.
Not supplied The next random number in the sequence.

If random values are required for use in Excel formulae, techniques such as this can be used to quickly export relatively large arrays of random numbers out to a worksheet for later reference.

Explicit, comprehensible covariance matrices in Java

As part of my code readability push in a recent work project (see also here), I’ve just tided up the code responsible for creating, updating, and referring to covariance matrices of patient characteristics and treatment effects. I first searched the web for examples of how covariance matrices had been implemented previously in Java and found the Covariance class in the Apache Commons Mathematics Library which, as it happens, we were already using for its MersenneTwister implementation and a few subclasses of AbstractRealDistribution.

However, looking at the data structures involved, the Covariance class will calculate and return a covariance matrix as an instance implementing the RealMatrix interface (and accepts the raw data as either a two-dimensional array of doubles or as a RealMatrix). Looking at the Covariance source code, the current implementation returns an instance of BlockRealMatrix, which by default is divided into a series of 52 x 52 blocks (so 2,704 double values per block), which are flattened in row major order into single dimensional arrays, which are themselves stored (again in row major order) in an outer array. This is clearly an extremely high performance approach, with three blocks designed to fit into 64Kb of L1 cache (3 × 2,704 × 8 / 1,024 = 63.4Kb), allowing block multiplication to be conducted entirely within the cache (i.e. C[i][j] += A[i][k] × B[k][j]).

This is all well and good, but passing 2D arrays or RealMatrix instances around runs contrary to my current goal of writing readable code. Not only that, but I want to store Pearson product-moment correlation coefficients alongside each covariance parameter so the model can decide later whether or not to covary parameters whose correlation is not significant. So I’d really need two 2D matrices to retrieve both the covariance and the p-value. Given that I’m dealing with relatively small matrices and not doing anything too computationally intensive with them, I decided to write a thin wrapper class around an EnumMap of EnumMaps, in which the Enum for the inner and outer maps consists of elements describing each covaried characteristic and the inner EnumMaps map each Enum element to a small tuple-like nested class that stores the covariance value and p-value in public instance variables.

Since the model uses a few different covariance matrices (including the aforementioned baseline patient characteristics and treatment effects, for example), I wrote a generic class that takes the Enum class in the constructor, such that it can be instantiated as follows:

CovarianceMatrix<PatientCharacteristic> matrix = new CovarianceMatrix<>(PatientCharacteristic.class);

The constructor then creates the 2D “matrix” of the nested, tuple-like ValuePValuePair classes with all covariance and p-values set to 0.0. This is fairly memory inefficient in that we immediately have n2 instances of ValuePValuePair (where n is the number of enumerable fields), but it helped to get this up and running quickly without hitting any NullPointerExceptions or having to perform null checks. Once the data structure’s set up, values can be added to the matrix as follows (with a couple of static final constants added at the top of the class to improve the readability of the code where it counts):

static final PatientCharacteristic HBA1C = PatientCharacteristic.HBA1C;
static final PatientCharacteristic TOTAL_CHOLESTEROL = PatientCharacteristic.TOTAL_CHOLESTEROL;

matrix.setCovarianceAndPValue(HBA1C, TOTAL_CHOLESTEROL, 0.28, 0.0001);

As a quick aside, it’s worth noting that, since the ability to refer to matrix elements numerically has (apparently) been lost, it initially looks as though a lot of strict matrix-like functionality may have been lost as well. However, since the EnumMap of EnumMap approach relies on the getEnumConstants() method (which always returns Enum elements in the order in which they’re specified) to generate the matrix, the numeric index of each row (and column) can be derived from any given Enum element as follows:

int index = java.util.Arrays.asList(PatientCharacteristic.class.getEnumConstants()).indexOf(PatientCharacteristic.HBA1C);

Using this approach, “lossless” conversion from a RealMatrix instance to this data structure and back would be possible (assuming initial knowledge of which rows and columns map to which characteristics), but even without the numeric references, this meets the current requirement, which is simply to use pre-calculated covariance matrices from SAS in Java in a clear and concise manner.

The full EnumMap wrapper class looks like this:

import java.util.EnumMap;

public class CovarianceMatrix<K extends Enum<K>> {
	
	private EnumMap<K, EnumMap<K, ValuePValuePair>> matrix;
	
	private static final class ValuePValuePair {
		public double value;
		public double pValue;

		public ValuePValuePair(double value, double pValue) {
			this.value = value;
			this.pValue = pValue;
		}
	}
	
	public CovarianceMatrix(Class<K> characteristics) {
		
		this.matrix = new EnumMap<>(characteristics);
		
		for (K initialCharacteristic : characteristics.getEnumConstants()) {
			for (K secondCharacteristic : characteristics.getEnumConstants()) {
				if (this.matrix.get(initialCharacteristic) != null) {
					this.matrix.get(initialCharacteristic).put(secondCharacteristic, new ValuePValuePair(0.0, 0.0));
				} else {
					EnumMap<K, ValuePValuePair> newValue = new EnumMap<>(characteristics);
					newValue.put(secondCharacteristic, new ValuePValuePair(0.0, 0.0));
					this.matrix.put(initialCharacteristic, newValue);
				}
			}
		}
	}
	
	public void setCovarianceAndPValue(K row, K column, double covariance, double pValue) {
		this.matrix.get(row).get(column).value = covariance;
		this.matrix.get(row).get(column).pValue = pValue;
	}

	public void setCovariance(K row, K column, double covariance) {
		this.matrix.get(row).get(column).value = covariance;
	}
	
	public void setCovariancePValue(K row, K column, double pValue) {
		this.matrix.get(row).get(column).pValue = pValue;
	}
	
	public double getCovariance(K row, K column) {
		return this.matrix.get(row).get(column).value;
	}
	
	public double getCovariancePValue(K row, K column) {
		return this.matrix.get(row).get(column).pValue;
	}
	
}

The next step is to perform some profiling on this in situ and probably switch over to instantiating the ValuePValuePair instances lazily with the appropriate null checks added in to the getters. But for now, the class is working very well and achieves the goal of making the clinical modelling portions of the code much more readable.

Replicating Excel’s logarithmic curve fitting in R

For a new work project, we’ve just been provided with a Kaplan-Meier curve showing kidney graft survival over 12 months in two groups of patients. As the data is newly generated, we don’t yet have access to the raw data and there are too many patients in the samples to see any discrete steps on the K-M curve, meaning that we can’t extract data on time to individual events. Since we wanted to get a rough idea of how our budget impact analysis (the main focus of the project) might look with these data in place, we traced the data from the K-M curve using WebPlotDigitizer by Ankit Rohatgi.

Looking at the shape of the data we had and knowing that longer-term kidney graft survival (from donors after brain death) typically looks like that in Figure 1, we opted for a simple logarithmic model of the data:

$$S(t) = \beta ln(t) + c$$

Figure 1 Long-term graft survival after first adult kidney-only transplant from donors after brain death, 2000–2012. (NHS Blood and Transplant (NHSBT) Organ Donation and Transplantation Activity Report 2013/14.)
NHSBT Kidney Survival

It’s very easy to generate a logarithmic fit in Excel by selecting the “Add Trendline…” option for a selected series in a chart, selecting the “Logarithmic” option and checking the “Display equation on chart” and “Display R-squared value on chart”. The latter options display the coefficient and intercept values, and the R-squared value (using the Pearson product-moment correlation coefficient), respectively.

However, since we’re ultimately going to be using R for the analysis (when the raw K-M data become available and we use the R survival package to fit a Weibull model, or similar, to the data), I thought it would make sense to also use R to give us our rough logarithmic fit of the data. As one might expect, it’s very straightforward to replicate the results that are produced in Excel using just a few lines of R:

t <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
survival <- c(1, 0.97909, 0.97171, 0.96186, 0.95694, 0.95387, 0.95264, 0.9520, 0.94956, 0.94526, 0.94279, 0.94033, 0.94033)
s_t <- data.frame(t, survival)
log_estimate <- lm(survival~log(t), data=s_t)
summary(log_estimate)

Running this gives the following output, which tallies exactly with the intercept, log(t) parameters and R squared values reported in Excel for the same data:

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0033502 -0.0016374  0.0000542  0.0014951  0.0037662 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.996234   0.001636  609.13  < 2e-16 ***
log(t)      -0.022377   0.000868  -25.78 3.46e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.002302 on 11 degrees of freedom
Multiple R-squared: 0.9837,	Adjusted R-squared: 0.9822 
F-statistic: 664.6 on 1 and 11 DF,  p-value: 3.458e-11

And if we overlay the resulting curve over the NHSBT data in Figure 1, we can see that the simple logarithmic fit matches the clinical reality pretty well when projected over a five-year time horizon:

Figure 2 NHSBT long-term kidney graft survival with logarithmic fit to K-M data overlaid in black
NHSBT Kidney Survival with Overlay

We can then use this approximation of the K-M curve to derive a survival function that we’ll use in the model until we have access to the full data set. A nice, quick approach that certainly couldn’t be accused of overfitting.