Validating Excel Markov models in R

As part of my ongoing campaign to stop using Microsoft Excel for tasks to which it isn’t well suited, I’ve recently started validating all Excel-based Markov implementations using R. Regrettably, Excel is still the preferred “platform” for developing health economic models (hence R being relegated to validation work), but recreating Markov models in R starkly illustrates how much time could be saved if more reimbursement or health technology assessment agencies would consider models written in non-proprietary languages rather than proprietary file formats such as Excel.

As a brief re-cap, strict Markov chains are memoryless, probability-driven processes in which transitions through a defined state space, S, are driven by a transition matrix, briefly as follows:

$$\Pr(X_{n+1}=x\mid X_1=x_1, X_2=x_2, \ldots, X_n=x_n) = \Pr(X_{n+1}=x\mid X_n=x_n)$$

where $$S = \{X_1 \ldots X_n\}$$

A typical Markov modelling approach in Excel is to start with a transition matrix as a contiguous range at the top of sheet, lay out the initial state distributions immediately below it, and model state transitions using a series of rows containing SUMPRODUCT() formulae, which multiply and sum the state distribution in the previous row with the relevant transition probability row. As an illustrative example, a simple 5-state Markov model running over 120 cycles (say, 10 years with a monthly cycle length) would result in 600 SUMPRODUCT() formulae. Despite the ease with which the cells can be populated with Excel’s AutoFill functionality, making any change to the model (e.g. adding a state) requires the whole range to be updated every time. The use of per-row summation checks with a Machine epsilon (2^-1022) can highlight errors in the formulae, but this requires yet more formulae to update should the model structure change.

Conversely, in R, once the transition matrix and initial states have been defined, implementing a similar Markov model requires just one line of code (with the use of the Matrix exponential package):

install.packages("expm")

tm <- matrix(c(
0.9,0.1,0,0,0,
0,0.9,0.1,0,0,
0,0,0.9,0.1,0,
0,0,0,0.9,0.1,
0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

print(initial %*% (tm %^% 120))

[,1]         [,2]         [,3]        [,4]     [,5]
[1,] 3.229246e-06 4.305661e-05 0.0002846521 0.001244035 0.998425


This simply prints out the final state distribution after 120 cycles, but the entire Markov trace can be printed using R’s built-in sapply() and t() functions without a dramatic increase in the complexity of the code:

print(t(sapply(1:120, function(x) initial %*% (tm %^% x))))

[,1]         [,2]         [,3]        [,4]        [,5]
[1,] 1.000000e+00 0.000000e+00 0.0000000000 0.000000000 0.000000000
[2,] 9.000000e-01 1.000000e-01 0.0000000000 0.000000000 0.000000000
...


At this point in Excel, the state distributions might subsequently be used to tally up costs and quality-adjusted life expectancy (in quality-adjusted life years or QALYs) for each state. This would require another two identically-sized ranges of cells to capture cost and QALY estimates for each state, trebling the number of formulas to 1,800. In R, adding in and summing up state costs is much more straightforward:

costs <- c(300, 350, 400, 200, 0)
timehorizon <- 120
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
discount <- 0.035

print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x), c(1), sum))
print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x), c(1), sum))

[1] 2699.991 3499.853 3998.790 1997.099    0.000
[1] 7.649975 7.999664 7.497731 8.487670 0.000000

To output the top-line health economic outcomes over the entire Markov simulation, the code can then actually be simplified (since we don’t need cycle-by-cycle results) to sum the costs and the quality-adjusted life expectancy:

print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)))
print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)))

[1] 6293.488
[1] 16.01336

And finally, with the introduction of a second transition matrix, a complete, two-arm, five-state Markov model with support for discounting and an arbitrary time horizon can be implemented in 10 lines of R as follows:

tm <- matrix(c(
0.9,0.1,0,0,0,
0,0.9,0.1,0,0,
0,0,0.9,0.1,0,
0,0,0,0.9,0.1,
0,0,0,0,1), byrow=TRUE, nrow=5)

tm2 <- matrix(c(
0.91,0.09,0,0,0,
0,0.91,0.09,0,0,
0,0,0.91,0.09,0,
0,0,0,0.91,0.09,
0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

costs <- c(300, 350, 400, 200, 0)
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
timehorizon <- 120
discount <- 0.035

cost <- c(
sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)),
sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * costs * 1/(1+discount)^x))
)
qale <- c(
sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)),
sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * utilities * 1/(1+discount)^x))
)
print( (cost[1] - cost[2])/(qale[1] - qale[2]) )

[1] 395.0946

Having all of this functionality in 10 easily readable lines of code strikes an excellent balance between concision and clarity, and is definitely preferable to 1,800+ formulae in Excel with equivalent functionality (or even to a similar Markov model laid out in TreeAge). The above R code, while still slightly contrived in its simplicity, can be printed on a single sheet of paper and, crucially, can be very easily adapted and modified without needing to worry about filling ranges of cells with formulae including a mix of relative and absolute cell referencing.

While R is not necessarily the very best language for health economic modelling, given how straightforwardly a Markov model (as but one example) can be implemented in the language, it wouldn’t be a bad candidate to be supported by healthcare technology assessment agencies for the purposes of economic evaluation. More importantly, regardless of any specific merits and shortcomings of R, the adoption or acceptance of any similar language would represent an excellent first step away from the current near-ubiquitous use of proprietary modelling technologies or platforms that are, in many cases, ill-suited to the task.

Non-volatile summation of contiguous dynamic ranges in Excel

This is just a very quick post, more as a reminder to my future self than anything else, but when summing a contiguous, dynamically-sized range in Excel, it can be orders of magnitude faster to use INDEX in place OFFSET, the latter of which is volatile and is therefore recalculated in response to changes to any other value on the sheet, regardless of its dependencies. This has already been explored by others, but just for future reference, always try to replace:

=SUM(A1:OFFSET(A1,rowOffset,0))

with:

=SUM(A1:INDEX(A:A,1+rowOffset))

That’s all for now!

The code’s fairly run-of-the-mill with one possible exception: the implementation of the SheetExists function. SheetExists takes a string and returns true or false depending on whether a sheet with the specified name already exists in the active workbook. The function uses VBA’s On Error Resume Next statement to achieve this. As per the MSDN documentation, this “specifies that when a run-time error occurs, control goes to the statement immediately following the statement where the error occurred where execution continues.”

As such, in the SheetExists, the call to Sheets(SheetName) would ordinarily throw a “Run-time error 9: Subscript out of range” if the sheet didn’t exist. But with On Error Resume Next the code continues uninterrupted and can check to establish whether the assignment to TestWorksheet was successful or not. If it was, the function return value is set to True, the TestWorksheet variable is set to Nothing (so it can be garbage collected) and On Error GoTo 0 is called to disable the On Error Resume Next. Technically this last step isn’t required as custom error handling is disabled automatically when exiting the Function, but it’s a good practice to get into to avoid leaving run-time errors going unchecked over vast swathes of code.

There’s a gist available here for clones, forks or comments.

Sub GenerateLinkedTOCFromWorkSheetNames()

Dim ProposedTOCWorksheetName As String
Dim NewTOCWorksheetName As String
Dim CurrentWorksheet As Worksheet
Dim Count As Integer

ProposedTOCWorksheetName = "TOC"
NewTOCWorksheetName = "TOC"
RowCounter = 2

Application.ScreenUpdating = False

Do While SheetExists(NewTOCWorksheetName)
NewTOCWorksheetName = Application.InputBox( _
Prompt:="A sheet named '" & ProposedTOCWorksheetName & "' already exists. " & _
"Enter a new sheet name or type '" & ProposedTOCWorksheetName & "' to overwrite.", _
Type:=2)

If NewTOCWorksheetName = ProposedTOCWorksheetName Then
Exit Do
End If

ProposedTOCWorksheetName = NewTOCWorksheetName
Loop

If SheetExists(NewTOCWorksheetName) Then
Sheets(NewTOCWorksheetName).Cells.Clear
Else
Worksheets(1).Name = NewTOCWorksheetName
End If

For Each CurrentWorksheet In Worksheets
If CurrentWorksheet.Name <> NewTOCWorksheetName Then
Sheets(NewTOCWorksheetName).Range("B" & RowCounter).Value = CurrentWorksheet.Name

Anchor:=Sheets(NewTOCWorksheetName).Range("B" & RowCounter), _
SubAddress:="'" & CurrentWorksheet.Name & "'!A1", _
TextToDisplay:=CurrentWorksheet.Name, _
ScreenTip:=CurrentWorksheet.Name

RowCounter = RowCounter + 1
End If
Next

Application.ScreenUpdating = True

End Sub

Function SheetExists(SheetName As String) As Boolean

Dim TestWorksheet As Worksheet
SheetExists = False

On Error Resume Next
Set TestWorksheet = Sheets(SheetName)
If Not TestWorksheet Is Nothing Then SheetExists = True
Set TestWorksheet = Nothing
On Error GoTo 0

End Function