Parsing external DTDs with Spring 4.x Jaxb2Marshaller

Having recently upgraded a fairly sizeable Spring project to Spring 4.1.7, I uncovered an issue in which, after the upgrade, a class that talks to an external XML API was failing with the following stack trace:

org.springframework.http.converter.HttpMessageNotReadableException: Could not read [class com.richpollock.blog.ExampleClass];
nested exception is org.springframework.oxm.UnmarshallingFailureException: JAXB unmarshalling exception;
nested exception is javax.xml.bind.UnmarshalException
 - with linked exception:
	[org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.]
	at org.springframework.http.converter.xml.MarshallingHttpMessageConverter.readFromSource(MarshallingHttpMessageConverter.java:134)
	at org.springframework.http.converter.xml.AbstractXmlHttpMessageConverter.readInternal(AbstractXmlHttpMessageConverter.java:61)
	...

As with most exceptions thrown by large libraries such as Spring, there’s an underlying exception that’s been thrown, wrapped and rethrown. And, as with most exceptions thrown by JAXB, there are also a lot of linked exceptions, which in this case originate from a SAXParseException thrown by Xerces (a JSR 206-compliant, fully-conforming XML Schema 1.0 processor).

In this instance, the error is thrown by the PrologDispatcher (née PrologDriver), a nested class that forms part of Xerces’ XMLDocumentScannerImpl class. That the exception is being thrown in the XML prolog shows that the failure occurs before the start tag of the XML document is reached. Specifically, the following line in PrologDispatcher is responsible for throwing the exception:

switch(fScannerState){
...
case SCANNER_STATE_DOCTYPE: {
	if (fDisallowDoctype) {
		reportFatalError("DoctypeNotAllowed", null);
	}
...

The difficulty is that the code is buried way down in the inner workings of the XML parser, which in this case, wasn’t instantiated by us in the first place. Indeed, the last code that was under our direct control was a call to the getForObject() method on an autowired RestTemplate instance. Regardless, the fDisallowDoctype check in PrologDispatcher is reminiscent of the problem reported in the initial stack trace: “DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true.”

As the name suggests, the disallow-doctype-decl feature prevents an XML document from being parsed if it contains a document type definition (DTD; specified using the DOCTYPE declaration in the XML). Along with the related FEATURE_SECURE_PROCESSING option, this can prevent both XML eXternal Entity (XXE) attacks, which can expose local file content, and XML Entity Expansion (XEE) attacks, which can result in denial of service. As such, the disallow-doctype-decl feature shouldn’t be disabled without giving due consideration to the security implications.

With that said, a bit of searching around reveals a few options for how the disallow-doctype-decl feature can be configured, but they depend on having direct access to the SAXParserFactory instance, setting and unsetting System properties, setting a JRE-wide jaxp.properties file, or passing command-line flags to the JVM. None of these are particularly desirable (or easily achievable).

So the next step was to identify the calls between the Spring RestTemplate (over which we have direct control in code) and the XML parsing code that’s throwing the exception. Thankfully, in this instance, we have control of the RestTemplate bean configuration in the application context as follows:

<bean id="restTemplate" class="org.springframework.web.client.RestTemplate">
    <constructor-arg ref="httpClientFactory"/>

    <property name="messageConverters"> 
        <list> 
            <bean class="org.springframework.http.converter.xml.MarshallingHttpMessageConverter"> 
                <property name="marshaller" ref="jaxbMarshaller"/> 
                <property name="unmarshaller" ref="jaxbMarshaller"/> 
            </bean> 
        </list> 
    </property>
</bean>

The jaxbMarshaller bean (as referenced above) is also configured in the application context as an instance of Jaxb2Marshaller:

<bean id="jaxbMarshaller" class="org.springframework.oxm.jaxb.Jaxb2Marshaller">

    <property name="classesToBeBound"> 
        <list> 
         ...    
        </list> 
    </property>

</bean>

Having initially tried to set the “http://apache.org/xml/features/disallow-doctype-decl” option to “false” through the setMarshallerProperties() method on Jaxb2Marshaller, I subsequently noticed the setSupportDtd property, which “Indicates whether DTD parsing should be supported.” This resolves the issue and, ultimately, the fix comes down to configuring the Jaxb2Marshaller bean with the following option:

<property name="supportDtd" value="true" />

Note that supportDtd can also be set to true by setting the Jaxb2Marshaller processExternalEntities property to true; the difference being that the latter both allows parsing of XML with a DOCTYPE declaration and processing of external entities referenced from the XML document. Jaxb2Marshaller ultimately uses the (logical complement of the) supportDtd option to set the “http://apache.org/xml/features/disallow-doctype-decl” feature on whichever XMLReader implementation is returned from XMLReaderFactory. By default, this is the class that the JVM-instance-wide org.xml.sax.driver property is set to, the class specified in META-INF/services/org.xml.sax.driver or, as a fallback, com.sun.org.apache.xerces.internal.parsers.SAXParser.

For security reasons, it would make sense to configure 2 Jaxb2Marshaller instances, one with DOCTYPE support enabled and one without, using the instance with DOCTYPE support only where it’s absolutely necessary and the XML source is trusted.

Validating Excel Markov models in R

As part of my ongoing campaign to stop using Microsoft Excel for tasks to which it isn’t well suited, I’ve recently started validating all Excel-based Markov implementations using R. Regrettably, Excel is still the preferred “platform” for developing health economic models (hence R being relegated to validation work), but recreating Markov models in R starkly illustrates how much time could be saved if more reimbursement or health technology assessment agencies would consider models written in non-proprietary languages rather than proprietary file formats such as Excel.

As a brief re-cap, strict Markov chains are memoryless, probability-driven processes in which transitions through a defined state space, S, are driven by a transition matrix, briefly as follows:

$$ \Pr(X_{n+1}=x\mid X_1=x_1, X_2=x_2, \ldots, X_n=x_n) = \Pr(X_{n+1}=x\mid X_n=x_n) $$

where $$S = \{X_1 \ldots X_n\} $$

A typical Markov modelling approach in Excel is to start with a transition matrix as a contiguous range at the top of sheet, lay out the initial state distributions immediately below it, and model state transitions using a series of rows containing SUMPRODUCT() formulae, which multiply and sum the state distribution in the previous row with the relevant transition probability row. As an illustrative example, a simple 5-state Markov model running over 120 cycles (say, 10 years with a monthly cycle length) would result in 600 SUMPRODUCT() formulae. Despite the ease with which the cells can be populated with Excel’s AutoFill functionality, making any change to the model (e.g. adding a state) requires the whole range to be updated every time. The use of per-row summation checks with a Machine epsilon (2^-1022) can highlight errors in the formulae, but this requires yet more formulae to update should the model structure change.

Conversely, in R, once the transition matrix and initial states have been defined, implementing a similar Markov model requires just one line of code (with the use of the Matrix exponential package):

install.packages("expm")

tm <- matrix(c(
	0.9,0.1,0,0,0,
	0,0.9,0.1,0,0,
	0,0,0.9,0.1,0,
	0,0,0,0.9,0.1,
	0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

print(initial %*% (tm %^% 120))

             [,1]         [,2]         [,3]        [,4]     [,5]
[1,] 3.229246e-06 4.305661e-05 0.0002846521 0.001244035 0.998425

This simply prints out the final state distribution after 120 cycles, but the entire Markov trace can be printed using R’s built-in sapply() and t() functions without a dramatic increase in the complexity of the code:

print(t(sapply(1:120, function(x) initial %*% (tm %^% x))))

             [,1]         [,2]         [,3]        [,4]        [,5]
[1,] 1.000000e+00 0.000000e+00 0.0000000000 0.000000000 0.000000000
[2,] 9.000000e-01 1.000000e-01 0.0000000000 0.000000000 0.000000000
...

At this point in Excel, the state distributions might subsequently be used to tally up costs and quality-adjusted life expectancy (in quality-adjusted life years or QALYs) for each state. This would require another two identically-sized ranges of cells to capture cost and QALY estimates for each state, trebling the number of formulas to 1,800. In R, adding in and summing up state costs is much more straightforward:

costs <- c(300, 350, 400, 200, 0)
timehorizon <- 120
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
discount <- 0.035

print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x), c(1), sum)) 
print(apply(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x), c(1), sum))

[1] 2699.991 3499.853 3998.790 1997.099    0.000
[1] 7.649975 7.999664 7.497731 8.487670 0.000000

To output the top-line health economic outcomes over the entire Markov simulation, the code can then actually be simplified (since we don’t need cycle-by-cycle results) to sum the costs and the quality-adjusted life expectancy:

print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)))
print(sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)))

[1] 6293.488
[1] 16.01336

And finally, with the introduction of a second transition matrix, a complete, two-arm, five-state Markov model with support for discounting and an arbitrary time horizon can be implemented in 10 lines of R as follows:

tm <- matrix(c(
	0.9,0.1,0,0,0,
	0,0.9,0.1,0,0,
	0,0,0.9,0.1,0,
	0,0,0,0.9,0.1,
	0,0,0,0,1), byrow=TRUE, nrow=5)
	
tm2 <- matrix(c(
	0.91,0.09,0,0,0,
	0,0.91,0.09,0,0,
	0,0,0.91,0.09,0,
	0,0,0,0.91,0.09,
	0,0,0,0,1), byrow=TRUE, nrow=5)

initial <- c(1,0,0,0,0)

costs <- c(300, 350, 400, 200, 0)
utilities <- c(0.85, 0.8, 0.75, 0.85, 0)
timehorizon <- 120
discount <- 0.035

cost <- c(
	sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * costs * 1/(1+discount)^x)),
	sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * costs * 1/(1+discount)^x))
)
qale <- c(
	sum(sapply(1:timehorizon, function(x) initial %*% (tm %^% x) * utilities * 1/(1+discount)^x)),
	sum(sapply(1:timehorizon, function(x) initial %*% (tm2 %^% x) * utilities * 1/(1+discount)^x))
)
print( (cost[1] - cost[2])/(qale[1] - qale[2]) )

[1] 395.0946

Having all of this functionality in 10 easily readable lines of code strikes an excellent balance between concision and clarity, and is definitely preferable to 1,800+ formulae in Excel with equivalent functionality (or even to a similar Markov model laid out in TreeAge). The above R code, while still slightly contrived in its simplicity, can be printed on a single sheet of paper and, crucially, can be very easily adapted and modified without needing to worry about filling ranges of cells with formulae including a mix of relative and absolute cell referencing.

While R is not necessarily the very best language for health economic modelling, given how straightforwardly a Markov model (as but one example) can be implemented in the language, it wouldn’t be a bad candidate to be supported by healthcare technology assessment agencies for the purposes of economic evaluation. More importantly, regardless of any specific merits and shortcomings of R, the adoption or acceptance of any similar language would represent an excellent first step away from the current near-ubiquitous use of proprietary modelling technologies or platforms that are, in many cases, ill-suited to the task.

Ensuring that spreadsheets created by xlsx4j can be opened by Quick Look and Numbers

Following on from an earlier post on simplifying the addition of text-only cells to an Excel worksheet using xlsx4j, here’s a brief addendum that allows Apple’s Quick Look to preview the spreadsheet and for it to opened in Numbers (Apple’s spreadsheet package for OS X and iOS).

In short, Excel’s requirement for a “minimum viable OOXML spreadsheet” seems to be less stringent than Apple’s and, as such, following the steps outlined in the previous post would result in a spreadsheet that could be opened in Excel, but not in Quick Look or Numbers.

Thankfully, the fix is quite straightforward; it transpires that Apple’s software can’t open the spreadsheet if the /xl/styles.xml “part” is missing from the spreadsheet. Even an empty stylesheet is sufficient:

SpreadsheetMLPackage pkg = SpreadsheetMLPackage.createPackage();
WorksheetPart sheet = pkg.createWorksheetPart(new PartName("/xl/worksheets/sheet1.xml"), "Sheet 1", 1);
SheetData sheetData = sheet.getContents().getSheetData();

// Add a minimal stylesheet to the package
Styles styles = new Styles(new PartName("/xl/styles.xml"));
CTStylesheet ss = Context.getsmlObjectFactory().createCTStylesheet();
styles.setJaxbElement(ss);
pkg.getWorkbookPart().addTargetPart(styles);

Row titleRow = Context.getsmlObjectFactory().createRow();
titleRow.setHt(22.0);
titleRow.setCustomHeight(Boolean.TRUE);
Cell headerCell = this.newCellWithInlineString("Sheet 1 Heading");

titleRow.getC().add(headerCell);
sheetData.getRow().add(titleRow);

pkg.save(new File("Example.xlsx"));

See the previous post for the implementation of the newCellWithInlineString method. It’s also worth noting that the getJaxbElement() method on the WorksheetPart class has been deprecated since the last post and replaced with getContents(), as above.