Validating Excel Markov models in R

As part of my ongoing campaign to stop using Microsoft Excel for tasks to which it isn’t well suited, I’ve recently started validating all Excel-based Markov implementations using R. Regrettably, Excel is still the preferred “platform” for developing health economic models (hence R being relegated to validation work), but recreating Markov models in R starkly illustrates how much time could be saved if more reimbursement or health technology assessment agencies would consider models written in non-proprietary languages rather than proprietary file formats such as Excel.

As a brief re-cap, strict Markov chains are memoryless, probability-driven processes in which transitions through a defined state space, S, are driven by a transition matrix, briefly as follows:

$$ \Pr(X_{n+1}=x\mid X_1=x_1, X_2=x_2, \ldots, X_n=x_n) = \Pr(X_{n+1}=x\mid X_n=x_n) $$

where $$S = \{X_1 \ldots X_n\} $$

A typical Markov modelling approach in Excel is to start with a transition matrix as a contiguous range at the top of sheet, lay out the initial state distributions immediately below it, and model state transitions using a series of rows containing SUMPRODUCT() formulae, which multiply and sum the state distribution in the previous row with the relevant transition probability row. As an illustrative example, a simple 5-state Markov model running over 120 cycles (say, 10 years with a monthly cycle length) would result in 600 SUMPRODUCT() formulae. Despite the ease with which the cells can be populated with Excel’s AutoFill functionality, making any change to the model (e.g. adding a state) requires the whole range to be updated every time. The use of per-row summation checks with a Machine epsilon (2^-1022) can highlight errors in the formulae, but this requires yet more formulae to update should the model structure change.

Conversely, in R, once the transition matrix and initial states have been defined, implementing a similar Markov model requires just one line of code (with the use of the Matrix exponential package):


tm <- matrix(c(
	0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

print(initial %*% (tm %^% 120))

             [,1]         [,2]         [,3]        [,4]     [,5]
[1,] 3.229246e-06 4.305661e-05 0.0002846521 0.001244035 0.998425

This simply prints out the final state distribution after 120 cycles, but the entire Markov trace can be printed using R’s built-in sapply() and t() functions without a dramatic increase in the complexity of the code:

print(t(sapply(1:120, function(x) initial %*% (tm %^% x))))

             [,1]         [,2]         [,3]        [,4]        [,5]
[1,] 1.000000e+00 0.000000e+00 0.0000000000 0.000000000 0.000000000
[2,] 9.000000e-01 1.000000e-01 0.0000000000 0.000000000 0.000000000

At this point in Excel, the state distributions might subsequently be used to tally up costs and quality-adjusted life expectancy (in quality-adjusted life years or QALYs) for each state. This would require another two identically-sized ranges of cells to capture cost and QALY estimates for each state, trebling the number of formulas to 1,800. In R, adding in and summing up state costs is much more straightforward:

costs <- c(300, 350, 400, 200, 0)
timehorizon <- 120
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
discount <- 0.035

print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x), c(1), sum)) 
print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x), c(1), sum))

[1] 2699.991 3499.853 3998.790 1997.099    0.000
[1] 7.649975 7.999664 7.497731 8.487670 0.000000

To output the top-line health economic outcomes over the entire Markov simulation, the code can then actually be simplified (since we don’t need cycle-by-cycle results) to sum the costs and the quality-adjusted life expectancy:

print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)))
print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)))

[1] 6293.488
[1] 16.01336

And finally, with the introduction of a second transition matrix, a complete, two-arm, five-state Markov model with support for discounting and an arbitrary time horizon can be implemented in 10 lines of R as follows:

tm <- matrix(c(
	0,0,0,0,1), byrow=TRUE, nrow=5)
tm2 <- matrix(c(
	0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

costs <- c(300, 350, 400, 200, 0)
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
timehorizon <- 120
discount <- 0.035

cost <- c(
	sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)),
	sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * costs * 1/(1+discount)^x))
qale <- c(
	sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)),
	sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * utilities * 1/(1+discount)^x))
print( (cost[1] - cost[2])/(qale[1] - qale[2]) )

[1] 395.0946

Having all of this functionality in 10 easily readable lines of code strikes an excellent balance between concision and clarity, and is definitely preferable to 1,800+ formulae in Excel with equivalent functionality (or even to a similar Markov model laid out in TreeAge). The above R code, while still slightly contrived in its simplicity, can be printed on a single sheet of paper and, crucially, can be very easily adapted and modified without needing to worry about filling ranges of cells with formulae including a mix of relative and absolute cell referencing.

While R is not necessarily the very best language for health economic modelling, given how straightforwardly a Markov model (as but one example) can be implemented in the language, it wouldn’t be a bad candidate to be supported by healthcare technology assessment agencies for the purposes of economic evaluation. More importantly, regardless of any specific merits and shortcomings of R, the adoption or acceptance of any similar language would represent an excellent first step away from the current near-ubiquitous use of proprietary modelling technologies or platforms that are, in many cases, ill-suited to the task.

Replicating Excel’s logarithmic curve fitting in R

For a new work project, we’ve just been provided with a Kaplan-Meier curve showing kidney graft survival over 12 months in two groups of patients. As the data is newly generated, we don’t yet have access to the raw data and there are too many patients in the samples to see any discrete steps on the K-M curve, meaning that we can’t extract data on time to individual events. Since we wanted to get a rough idea of how our budget impact analysis (the main focus of the project) might look with these data in place, we traced the data from the K-M curve using WebPlotDigitizer by Ankit Rohatgi.

Looking at the shape of the data we had and knowing that longer-term kidney graft survival (from donors after brain death) typically looks like that in Figure 1, we opted for a simple logarithmic model of the data:

$$S(t) = \beta ln(t) + c$$

Figure 1 Long-term graft survival after first adult kidney-only transplant from donors after brain death, 2000–2012. (NHS Blood and Transplant (NHSBT) Organ Donation and Transplantation Activity Report 2013/14.)
NHSBT Kidney Survival

It’s very easy to generate a logarithmic fit in Excel by selecting the “Add Trendline…” option for a selected series in a chart, selecting the “Logarithmic” option and checking the “Display equation on chart” and “Display R-squared value on chart”. The latter options display the coefficient and intercept values, and the R-squared value (using the Pearson product-moment correlation coefficient), respectively.

However, since we’re ultimately going to be using R for the analysis (when the raw K-M data become available and we use the R survival package to fit a Weibull model, or similar, to the data), I thought it would make sense to also use R to give us our rough logarithmic fit of the data. As one might expect, it’s very straightforward to replicate the results that are produced in Excel using just a few lines of R:

t <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
survival <- c(1, 0.97909, 0.97171, 0.96186, 0.95694, 0.95387, 0.95264, 0.9520, 0.94956, 0.94526, 0.94279, 0.94033, 0.94033)
s_t <- data.frame(t, survival)
log_estimate <- lm(survival~log(t), data=s_t)

Running this gives the following output, which tallies exactly with the intercept, log(t) parameters and R squared values reported in Excel for the same data:

       Min         1Q     Median         3Q        Max 
-0.0033502 -0.0016374  0.0000542  0.0014951  0.0037662 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.996234   0.001636  609.13  < 2e-16 ***
log(t)      -0.022377   0.000868  -25.78 3.46e-11 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.002302 on 11 degrees of freedom
Multiple R-squared: 0.9837,	Adjusted R-squared: 0.9822 
F-statistic: 664.6 on 1 and 11 DF,  p-value: 3.458e-11

And if we overlay the resulting curve over the NHSBT data in Figure 1, we can see that the simple logarithmic fit matches the clinical reality pretty well when projected over a five-year time horizon:

Figure 2 NHSBT long-term kidney graft survival with logarithmic fit to K-M data overlaid in black
NHSBT Kidney Survival with Overlay

We can then use this approximation of the K-M curve to derive a survival function that we’ll use in the model until we have access to the full data set. A nice, quick approach that certainly couldn’t be accused of overfitting.