Generating a hyperlinked table of contents for an Excel workbook using VBA

If you’ve ever needed to generate a hyperlinked table of contents (TOC) for an Excel workbook with more than a few worksheets, you’ll appreciate that it can be quite a laborious task. The following snippet of VBA will iterate over the sheets in a workbook, generating a list of hyperlinks to each sheet in a new sheet at the start of the workbook called “TOC”. If a “TOC” worksheet already exists, the script will ask for a new worksheet name in which to place the table of contents. Typing the same name again will overwrite the contents of the sheet with the TOC. Simple as that.

The code’s fairly run-of-the-mill with one possible exception: the implementation of the SheetExists function. SheetExists takes a string and returns true or false depending on whether a sheet with the specified name already exists in the active workbook. The function uses VBA’s On Error Resume Next statement to achieve this. As per the MSDN documentation, this “specifies that when a run-time error occurs, control goes to the statement immediately following the statement where the error occurred where execution continues.”

As such, in the SheetExists, the call to Sheets(SheetName) would ordinarily throw a “Run-time error 9: Subscript out of range” if the sheet didn’t exist. But with On Error Resume Next the code continues uninterrupted and can check to establish whether the assignment to TestWorksheet was successful or not. If it was, the function return value is set to True, the TestWorksheet variable is set to Nothing (so it can be garbage collected) and On Error GoTo 0 is called to disable the On Error Resume Next. Technically this last step isn’t required as custom error handling is disabled automatically when exiting the Function, but it’s a good practice to get into to avoid leaving run-time errors going unchecked over vast swathes of code.

There’s a gist available here for clones, forks or comments.

Randomness in Excel

When writing formulae in a default installation of Excel, RAND() is the only function from which randomness can be directly derived. According to the documentation, RAND() generates an “evenly distributed random number greater than or equal to 0 and less than 1”. Installation of the Analysis ToolPak also makes the RANDBETWEEN() function available for use, which takes two integers as arguments and, again according to the documentation, returns a “random integer number between the numbers you specify”.

Note that in the case of RAND(), I say derived directly, because it’s also technically possible to use the old Microsoft-endorsed trick of using the sixth decimal place onwards from the NOW() function and multiplying up to the range 0–1. (However, for a given calculation cycle the NOW() technique returns the same random value for every cell in the spreadsheet making use of it, so it’s not always desirable.) Rounding out Excel’s random number generation capabilities are two additional methods, which are slightly more cumbersome but have the key property of allowing a random seed to be specified: the Random Number Generation tool in the Analysis ToolPak and the Rnd() function in Visual Basic for Applications (VBA).

Before we dive into these last two methods, it’s worth walking through the evolution of random number generation in Excel as it’s been a tortuous (indeed, some might say torturous) route getting to where we are at the time of writing.

A brief history of RAND()

Prior to Excel 2003, the RAND() function used a linear congruential generator (LCG) to generate pseudorandom numbers between 0 and 1 (with an initial X value of 0.5):

Xn+1 = (9821 * Xn + 0.211327) % 1

Barreto and Howland characterised the randomness of the pre-2003 Excel RAND() function in their book on Monte Carlo simulation in Excel using a “trapping” technique in which the values generated immediately after a value of between 0.7 and 0.7001 were recorded and plotted over the whole 0–1 range:

Barreto and Howland Figure 9.2.2

Barreto and Howland Figure 9.2.2

Predictably, this failed the then-current DIEHARD randomness test by George Marsaglia and can trivially be shown to be a relatively poor random number generator (RNG). From Excel 2003 onwards, Microsoft switched RAND() to use an implementation of the Wichmann-Hill algorithm. Wichmann-Hill is also a relatively simple generator, based on a summation of three LCGs:

Unfortunately, the first Wichmann-Hill implementation in Excel was found to be incorrect as it would eventually generate negative numbers (an impossibility according to the 1982 Wichmann-Hill paper). A 2004 bug fix was intended to fix the problem. However, McCullough subsequently used “Zeisel recursion” (so named after a 1986 publication by Zeisel, in which Chinese Remainder Theorem was employed to rewrite the Wichmann-Hill RNG as a single LCG) to demonstrate that the updated generator was still not consistent with the Wichmann-Hill RNG as published. Wichmann-Hill should have a periodicity of 6.95 × 1012 (not the 2.78 × 1013 claimed in the original paper) but, as McCullough and Heiser note, the actual periodicity of the Excel implementation is unknown.

While even the incorrect Excel implementation of Wichmann-Hill passed the DIEHARD test, it failed some aspects of all three of L’Ecuyer and Simard’s (PDF) TESTU01 test suites (SmallCrush, Crush, and BigCrush) which were designed to supersede DIEHARD. Specifically, L’Ecuyer and Simard noted that Wichmann-Hill failed 2, 12 and 22 of the SmallCrush, Crush, and BigCrush test suites, respectively.

If the original RAND() documentation is to be believed, the Wichmann-Hill generator is still in use as of Excel 2010, which is listed in the “Applies to” section. However, the Office 2010 documentation notes that the “RAND function now uses a new random number algorithm” and Guy Mélard reports that there is “semi-official” confirmation that Excel 2010 and later in fact use the Mersenne Twister algorithm to power RAND().

The Mersenne Twister has a period length of 219937 − 1, passes the full DIEHARD test suite, all of the SmallCrush tests and all but 2 each of the Crush and BigCrush tests, according to L’Ecuyer and Simard. While many improvements have been made to RNGs since the publication of the Mersenne Twister (even within the same class of generator), using the Mersenne Twister at least brings Excel to parity with many modern programming environments, including the default RNGs in SPSS and R.

However, the exact implementation details of RAND() and RANDBETWEEN() are immaterial for applications requiring reproducible randomness as the functions cannot be seeded. Seeded randomness can only be achieved using one of the two aforementioned approaches: the Random Number Generation tool in the Analysis ToolPak or VBA. In brief (and highly informal) testing, the Random Number Generation tool in the analysis ToolPak was unbearably slow in generating uniformly distributed random numbers, generating around 36 random numbers per second on a 2.3 GHz Core i7. So that leaves VBA.

Generating seeded random numbers using VBA

Based on Microsoft’s documentation, the Rnd() function in Basic has always been based on a power-of-two modulus LCG. In Microsoft Basic versions prior to Visual Basic 1.0, this took the form:

Xn+1 = (214,013 * Xn + 2,531,011) % 224

From Visual Basic versions 1.0 to 6.0, the operands were updated:

Xn+1 = (1,140,671,485 * Xn + 12,820,163) % 224

Note that the performance of the VB 1.0-6.0 LCG was also characterised by Barreto and Howland in the figure, above. It has a periodicity of 16.106. Interestingly, Microsoft notes that the code used to implement the LCG underpinning Rnd() could not be implemented in VBA itself due to the use of an unsigned long data type that’s not available in VBA.

Unfortunately there isn’t much Rnd() documentation for VB versions later than 6, but between L’Ecuyer and Simard (who refer to it as a “toy generator”, in part because of its power-of-two modulus) and Mélard, we can surmise that this implementation is still present in VBA in Excel 2010 (and possibly later).

Thankfully, while Rnd() is an extremely unsophisticated RNG, it is at least easy to use and seed. There are two key functions related to random number generation in VBA: Rnd and Randomize. The documentation states that Randomize should be used to initialize the generator, but in practice, the call isn’t necessary and seeding the generator is as simple as calling Rnd() with a single negative integer, as illustrated in the table below. Making repeated calls to Rnd() with the same negative argument yields exactly the same value and subsequent calls without any argument will continue the sequence using the seed from the first call.

If number is Rnd generates
Less than zero The same number every time, using number as the seed.
Greater than zero The next random number in the sequence.
Equal to zero The most recently generated number.
Not supplied The next random number in the sequence.

If random values are required for use in Excel formulae, techniques such as this can be used to quickly export relatively large arrays of random numbers out to a worksheet for later reference.

Replacing cell value assignment loops with single array assignments in VBA

If you’re shifting large amounts of data around between VBA variables and Excel worksheets, there are substantial performance gains that can be achieved by making relatively straightforward changes to your VBA, especially if you’re faced with code similar to the following:

If the intention is to simply move data from VBA to a worksheet in a single, atomic operation, the very first (and simplest) performance optimization is to switch off automatic recalculation before entering the loop:

In the above case of generating 2,500 (in this case, random) values in VBA and writing them out to the RandomValues sheet, switching off automatic recalculation reduced the mean runtime over five runs from 35.22s (standard deviation [SD] = 0.367s) to 0.270s (SD = 0.010s), a performance improvement of just over two orders of magnitude.

But what happens if we increase the number of values to, say, 250,000 in a 500×500 grid? In this case, even with automatic recalculation switched off, we see the runtime of the above code increasing roughly linearly back up to 26.62s (SD = 0.317s). The major bottleneck has now shifted to the cell value assignment, which is being called 250,000 times in the inner loop. The easiest way to alleviate this is to assign the values to an array in the inner loop, then assign the array directly to the desired range:

This once again gives us a performance improvement in excess of two orders of magnitude relative to the manual calculation single-cell assignment loop. With 100x more values than in the first two examples, the runtime dropped back down to 0.138s (SD = 0.004s). Since the VBA is no longer interacting with the worksheet in the inner loop, we could actually leave automatic recalculation on in this final example, although there is still a slight performance hit relative to switching on manual recalculation: leaving automatic recalculation on increased the runtime to 0.163s (SD = 0.004s).

If we examine these three approaches together on a modest number of data points, say 1,000 in a 10×100 grid, we can see the substantial incremental effects on the performance of moving the data out onto the worksheet:

SD, standard deviation; n=5 for each technique
Technique Mean runtime (ms) SD (ms)
Automatic recalculation with individual cell assignment 13,851.80 292.54
Manual recalculation with individual cell assignment 106.04 3.48
Manual recalculation with single array assignment 0.86 0.14

The improvement is so dramatic that if you do have any VBA that’s performing a similar task, it’s definitely worth investigating both of these relatively straightforward optimizations.