As part of my ongoing campaign to stop using Microsoft Excel for tasks to which it isn’t well suited, I’ve recently started validating all Excel-based Markov implementations using R. Regrettably, Excel is still the preferred “platform” for developing health economic models (hence R being relegated to validation work), but recreating Markov models in R starkly illustrates how much time could be saved if more reimbursement or health technology assessment agencies would consider models written in non-proprietary languages rather than proprietary file formats such as Excel.
As a brief re-cap, strict Markov chains are memoryless, probability-driven processes in which transitions through a defined state space, S, are driven by a transition matrix, briefly as follows:
$$ \Pr(X_{n+1}=x\mid X_1=x_1, X_2=x_2, \ldots, X_n=x_n) = \Pr(X_{n+1}=x\mid X_n=x_n) $$
where $$S = \{X_1 \ldots X_n\} $$
A typical Markov modelling approach in Excel is to start with a transition matrix as a contiguous range at the top of sheet, lay out the initial state distributions immediately below it, and model state transitions using a series of rows containing SUMPRODUCT() formulae, which multiply and sum the state distribution in the previous row with the relevant transition probability row. As an illustrative example, a simple 5-state Markov model running over 120 cycles (say, 10 years with a monthly cycle length) would result in 600 SUMPRODUCT() formulae. Despite the ease with which the cells can be populated with Excel’s AutoFill functionality, making any change to the model (e.g. adding a state) requires the whole range to be updated every time. The use of per-row summation checks with a Machine epsilon (2^-1022) can highlight errors in the formulae, but this requires yet more formulae to update should the model structure change.
Conversely, in R, once the transition matrix and initial states have been defined, implementing a similar Markov model requires just one line of code (with the use of the Matrix exponential package):
install.packages("expm") tm <- matrix(c( 0.9,0.1,0,0,0, 0,0.9,0.1,0,0, 0,0,0.9,0.1,0, 0,0,0,0.9,0.1, 0,0,0,0,1), byrow=TRUE, nrow=5) initial <- c(1,0,0,0,0) print(initial %*% (tm %^% 120)) [,1] [,2] [,3] [,4] [,5] [1,] 3.229246e-06 4.305661e-05 0.0002846521 0.001244035 0.998425
This simply prints out the final state distribution after 120 cycles, but the entire Markov trace can be printed using R’s built-in sapply()
and t()
functions without a dramatic increase in the complexity of the code:
print(t(sapply(1:120, function(x) initial %*% (tm %^% x)))) [,1] [,2] [,3] [,4] [,5] [1,] 1.000000e+00 0.000000e+00 0.0000000000 0.000000000 0.000000000 [2,] 9.000000e-01 1.000000e-01 0.0000000000 0.000000000 0.000000000 ...
At this point in Excel, the state distributions might subsequently be used to tally up costs and quality-adjusted life expectancy (in quality-adjusted life years or QALYs) for each state. This would require another two identically-sized ranges of cells to capture cost and QALY estimates for each state, trebling the number of formulas to 1,800. In R, adding in and summing up state costs is much more straightforward:
costs <- c(300, 350, 400, 200, 0) timehorizon <- 120 utilities <- c(0.85, 0.8, 0.75, 0.85, 0) discount <- 0.035 print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x), c(1), sum)) print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x), c(1), sum)) [1] 2699.991 3499.853 3998.790 1997.099 0.000 [1] 7.649975 7.999664 7.497731 8.487670 0.000000
To output the top-line health economic outcomes over the entire Markov simulation, the code can then actually be simplified (since we don’t need cycle-by-cycle results) to sum the costs and the quality-adjusted life expectancy:
print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x))) print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x))) [1] 6293.488 [1] 16.01336
And finally, with the introduction of a second transition matrix, a complete, two-arm, five-state Markov model with support for discounting and an arbitrary time horizon can be implemented in 10 lines of R as follows:
tm <- matrix(c( 0.9,0.1,0,0,0, 0,0.9,0.1,0,0, 0,0,0.9,0.1,0, 0,0,0,0.9,0.1, 0,0,0,0,1), byrow=TRUE, nrow=5) tm2 <- matrix(c( 0.91,0.09,0,0,0, 0,0.91,0.09,0,0, 0,0,0.91,0.09,0, 0,0,0,0.91,0.09, 0,0,0,0,1), byrow=TRUE, nrow=5) initial <- c(1,0,0,0,0) costs <- c(300, 350, 400, 200, 0) utilities <- c(0.85, 0.8, 0.75, 0.85, 0) timehorizon <- 120 discount <- 0.035 cost <- c( sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)), sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * costs * 1/(1+discount)^x)) ) qale <- c( sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)), sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * utilities * 1/(1+discount)^x)) ) print( (cost[1] - cost[2])/(qale[1] - qale[2]) ) [1] 395.0946
Having all of this functionality in 10 easily readable lines of code strikes an excellent balance between concision and clarity, and is definitely preferable to 1,800+ formulae in Excel with equivalent functionality (or even to a similar Markov model laid out in TreeAge). The above R code, while still slightly contrived in its simplicity, can be printed on a single sheet of paper and, crucially, can be very easily adapted and modified without needing to worry about filling ranges of cells with formulae including a mix of relative and absolute cell referencing.
While R is not necessarily the very best language for health economic modelling, given how straightforwardly a Markov model (as but one example) can be implemented in the language, it wouldn’t be a bad candidate to be supported by healthcare technology assessment agencies for the purposes of economic evaluation. More importantly, regardless of any specific merits and shortcomings of R, the adoption or acceptance of any similar language would represent an excellent first step away from the current near-ubiquitous use of proprietary modelling technologies or platforms that are, in many cases, ill-suited to the task.