Parsing external DTDs with Spring 4.x Jaxb2Marshaller

Having recently upgraded a fairly sizeable Spring project to Spring 4.1.7, I uncovered an issue in which, after the upgrade, a class that talks to an external XML API was failing with the following stack trace:

org.springframework.http.converter.HttpMessageNotReadableException: Could not read [class com.richpollock.blog.ExampleClass];
nested exception is org.springframework.oxm.UnmarshallingFailureException: JAXB unmarshalling exception;
nested exception is javax.xml.bind.UnmarshalException
 - with linked exception:
	[org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.]
	at org.springframework.http.converter.xml.MarshallingHttpMessageConverter.readFromSource(MarshallingHttpMessageConverter.java:134)
	at org.springframework.http.converter.xml.AbstractXmlHttpMessageConverter.readInternal(AbstractXmlHttpMessageConverter.java:61)
	...

As with most exceptions thrown by large libraries such as Spring, there’s an underlying exception that’s been thrown, wrapped and rethrown. And, as with most exceptions thrown by JAXB, there are also a lot of linked exceptions, which in this case originate from a SAXParseException thrown by Xerces (a JSR 206-compliant, fully-conforming XML Schema 1.0 processor).

In this instance, the error is thrown by the PrologDispatcher (née PrologDriver), a nested class that forms part of Xerces’ XMLDocumentScannerImpl class. That the exception is being thrown in the XML prolog shows that the failure occurs before the start tag of the XML document is reached. Specifically, the following line in PrologDispatcher is responsible for throwing the exception:

switch(fScannerState){
...
case SCANNER_STATE_DOCTYPE: {
	if (fDisallowDoctype) {
		reportFatalError("DoctypeNotAllowed", null);
	}
...

The difficulty is that the code is buried way down in the inner workings of the XML parser, which in this case, wasn’t instantiated by us in the first place. Indeed, the last code that was under our direct control was a call to the getForObject() method on an autowired RestTemplate instance. Regardless, the fDisallowDoctype check in PrologDispatcher is reminiscent of the problem reported in the initial stack trace: “DOCTYPE is disallowed when the feature “http://apache.org/xml/features/disallow-doctype-decl” set to true.”

As the name suggests, the disallow-doctype-decl feature prevents an XML document from being parsed if it contains a document type definition (DTD; specified using the DOCTYPE declaration in the XML). Along with the related FEATURE_SECURE_PROCESSING option, this can prevent both XML eXternal Entity (XXE) attacks, which can expose local file content, and XML Entity Expansion (XEE) attacks, which can result in denial of service. As such, the disallow-doctype-decl feature shouldn’t be disabled without giving due consideration to the security implications.

With that said, a bit of searching around reveals a few options for how the disallow-doctype-decl feature can be configured, but they depend on having direct access to the SAXParserFactory instance, setting and unsetting System properties, setting a JRE-wide jaxp.properties file, or passing command-line flags to the JVM. None of these are particularly desirable (or easily achievable).

So the next step was to identify the calls between the Spring RestTemplate (over which we have direct control in code) and the XML parsing code that’s throwing the exception. Thankfully, in this instance, we have control of the RestTemplate bean configuration in the application context as follows:

<bean id="restTemplate" class="org.springframework.web.client.RestTemplate">
    <constructor-arg ref="httpClientFactory"/>

    <property name="messageConverters"> 
        <list> 
            <bean class="org.springframework.http.converter.xml.MarshallingHttpMessageConverter"> 
                <property name="marshaller" ref="jaxbMarshaller"/> 
                <property name="unmarshaller" ref="jaxbMarshaller"/> 
            </bean> 
        </list> 
    </property>
</bean>

The jaxbMarshaller bean (as referenced above) is also configured in the application context as an instance of Jaxb2Marshaller:

<bean id="jaxbMarshaller" class="org.springframework.oxm.jaxb.Jaxb2Marshaller">

    <property name="classesToBeBound"> 
        <list> 
         ...    
        </list> 
    </property>

</bean>

Having initially tried to set the “http://apache.org/xml/features/disallow-doctype-decl” option to “false” through the setMarshallerProperties() method on Jaxb2Marshaller, I subsequently noticed the setSupportDtd property, which “Indicates whether DTD parsing should be supported.” This resolves the issue and, ultimately, the fix comes down to configuring the Jaxb2Marshaller bean with the following option:

<property name="supportDtd" value="true" />

Note that supportDtd can also be set to true by setting the Jaxb2Marshaller processExternalEntities property to true; the difference being that the latter both allows parsing of XML with a DOCTYPE declaration and processing of external entities referenced from the XML document. Jaxb2Marshaller ultimately uses the (logical complement of the) supportDtd option to set the “http://apache.org/xml/features/disallow-doctype-decl” feature on whichever XMLReader implementation is returned from XMLReaderFactory. By default, this is the class that the JVM-instance-wide org.xml.sax.driver property is set to, the class specified in META-INF/services/org.xml.sax.driver or, as a fallback, com.sun.org.apache.xerces.internal.parsers.SAXParser.

For security reasons, it would make sense to configure 2 Jaxb2Marshaller instances, one with DOCTYPE support enabled and one without, using the instance with DOCTYPE support only where it’s absolutely necessary and the XML source is trusted.

Ensuring that spreadsheets created by xlsx4j can be opened by Quick Look and Numbers

Following on from an earlier post on simplifying the addition of text-only cells to an Excel worksheet using xlsx4j, here’s a brief addendum that allows Apple’s Quick Look to preview the spreadsheet and for it to opened in Numbers (Apple’s spreadsheet package for OS X and iOS).

In short, Excel’s requirement for a “minimum viable OOXML spreadsheet” seems to be less stringent than Apple’s and, as such, following the steps outlined in the previous post would result in a spreadsheet that could be opened in Excel, but not in Quick Look or Numbers.

Thankfully, the fix is quite straightforward; it transpires that Apple’s software can’t open the spreadsheet if the /xl/styles.xml “part” is missing from the spreadsheet. Even an empty stylesheet is sufficient:

SpreadsheetMLPackage pkg = SpreadsheetMLPackage.createPackage();
WorksheetPart sheet = pkg.createWorksheetPart(new PartName("/xl/worksheets/sheet1.xml"), "Sheet 1", 1);
SheetData sheetData = sheet.getContents().getSheetData();

// Add a minimal stylesheet to the package
Styles styles = new Styles(new PartName("/xl/styles.xml"));
CTStylesheet ss = Context.getsmlObjectFactory().createCTStylesheet();
styles.setJaxbElement(ss);
pkg.getWorkbookPart().addTargetPart(styles);

Row titleRow = Context.getsmlObjectFactory().createRow();
titleRow.setHt(22.0);
titleRow.setCustomHeight(Boolean.TRUE);
Cell headerCell = this.newCellWithInlineString("Sheet 1 Heading");

titleRow.getC().add(headerCell);
sheetData.getRow().add(titleRow);

pkg.save(new File("Example.xlsx"));

See the previous post for the implementation of the newCellWithInlineString method. It’s also worth noting that the getJaxbElement() method on the WorksheetPart class has been deprecated since the last post and replaced with getContents(), as above.

Saving a file to an OS X WebDAV server using Apache VFS and Sardine

I’ve recently been looking into the options for storing a relatively large number of user-uploaded files in a non-hierarchical storage “bucket” with a sufficient degree of abstraction to allow the files to reside either locally or on a server (e.g. WebDAV, SFTP) without any (or many) changes to the code. A virtual file system (VFS) is pretty well suited to this kind of task and, as it happens, there are a number of VFS implementations available for Java including Apache Commons VFS and TrueVFS.

For now, I’ve opted to use Apache Commons VFS with a WebDAV file provider. Overall it’s relatively straightforward to get the various moving parts up and running, but there were a few pitfalls in setting up the WebDAV server, ensuring that Maven is pulling in the right dependencies and copying file content to VFS.

Rationale for using Sardine

From the Commons VFS homepage: “Apache Commons VFS provides a single API for accessing various different file systems. It presents a uniform view of the files from various different sources, such as the files on local disk, on an HTTP server, or inside a Zip archive.” WebDAV is one of the sources supported by Commons VFS, but the default implementation is the same as that used by Apache JackRabbit. As of writing, the JackRabbit implementation relies on the deprecated Commons HttpClient for its HTTP support, which has long been superseded by Apache HttpComponents. Including a deprecated and unsupported HTTP client library isn’t very appealing from a security standpoint, neither is it desirable in terms of longer-term project maintenance.

Thankfully, Sardine comes to the rescue here. In brief, Sardine is a relatively modern, lightweight WebDAV client focussed on the most common use cases for WebDAV. In combination with Nicolas Delsaux’s excellent commons-vfs-webdav-sardine project, it’s relatively straightforward to use Sardine as a Commons VFS WebDAV provider.

Configuring a WebDAV server compatible with Sardine

One of most attractive aspects of WebDAV is that it’s extremely easy to configure, especially on machines that are already running Apache. However, even given the relative ease of configuration, there are usually a couple of file permission and directory index issues that needed to be resolved.

A basic WebDAV server can be set up by including (or preferably Including) the following in httpd.conf:

LoadModule mod_dav_fs

DavLockDB "/usr/webdav/DavLock"

Alias /uploads "/usr/uploads"

<Directory "/usr/uploads">
    Dav On

    AllowOverride None
    Options Indexes FollowSymLinks
    Order Allow,Deny
    Allow from all

    AuthType Basic
    AuthName WebDAV-Realm

    AuthUserFile "/usr/webdav.passwd"

    <LimitExcept GET OPTIONS>
        require user admin
    </LimitExcept>
</Directory>

The inclusion of the “Options Indexes FollowSymLinks” directive is important as it prevents 403 errors being returned in response to HEAD requests to test for the existence of files on the WebDAV share.

The directory containing the DavLockDB file and the WebDAV directory both need to be readable and writable by the current Apache user and group. The following commands should ensure that this is the case (given the above configuration in httpd.conf):

sudo mkdir /usr/webdav
sudo mkdir /usr/uploads
sudo chown www:www /usr/webdav /usr/uploads

With AuthType set to Basic, the only remaining task is to create the “admin” user referenced in the LimitExcept directive and restart Apache:

sudo htpasswd -c /usr/webdav.passwd admin
sudo apachectl graceful

On OS X, the server configuration can then be tested using “Connect to Server…” in the Finder (Cmd-K):

WebDAV connection

Authenticating with “admin” and the password set in the htpasswd stage above should mount a readable and writable WebDAV share in the Finder.

Using Sardine with Apache VFS

With a WebDAV server running locally, VFS and Sardine can be used to take a “local” file and upload it to the WebDAV server as follows:

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.nio.file.FileSystemException;
import java.util.UUID;

import org.apache.commons.io.IOUtils;
import org.apache.commons.vfs2.FileObject;
import org.apache.commons.vfs2.impl.DefaultFileSystemManager;
import org.apache.commons.vfs2.provider.local.DefaultLocalFileProvider;

// This import causes the Sardine WebdavFileProvider to be used rather than the default VFS provider
import fr.perigee.commonsvfs.webdav.WebdavFileProvider;

public String saveFileToDefaultFileSystem(File file, String baseURL) throws FileSystemException, IOException {
	
	String fileName;
	DefaultFileSystemManager fsManager;

	try {
		fsManager = new DefaultFileSystemManager();
		fsManager.addProvider("webdav", new WebdavFileProvider());
		fsManager.addProvider("file", new DefaultLocalFileProvider());
		fsManager.init();
	} catch (org.apache.commons.vfs2.FileSystemException e) {
		throw new FileSystemException("Exception initializing DefaultFileSystemManager: " + e.getMessage());
	}
	
	UUID uuid = UUID.randomUUID();
	fileName = uuid.toString();
	
	FileObject uploadedFile;
	FileObject destinationFile;
	
	try {
		uploadedFile = fsManager.toFileObject(file);
		destinationFile = fsManager.resolveFile(baseURL + fileName);
		destinationFile.createFile();
	} catch (org.apache.commons.vfs2.FileSystemException e) {
		fsManager.close();
		throw new FileSystemException("Exception resolving file in file store: " + e.getMessage());
	}
	
	try (InputStream in = uploadedFile.getContent().getInputStream();
		 OutputStream out = destinationFile.getContent().getOutputStream()) {
		
		IOUtils.copy(in, out);
		} catch (IOException e) {
		throw new IOException("Exception copying data: " + e.getMessage());
	} finally {
		fsManager.close();
	}
	
	return fileName;
		
}

The above saveFileToDefaultFileSystem() method:

  • Takes a Java File and a baseURL as a String
  • Sets up the VFS DefaultFileSystemManager adding providers for WebDAV and the local filesystem
  • Generates a random filename (UUID) for the bucket store
  • Resolves and creates the (heretofore non-existent) file in the WebDAV file system
  • Converts the passed-in File to a VFS FileObject
  • Uses a try-with-resources block to open an InputStream and an OutputStream from the source and destination FileObjects, respectively
  • Uses the Apache Commons IOUtils to copy the InputStream to the OutputStream and closes everything down
  • Returns the UUID as a String

The only other aspect not covered in the code is the form of the URL passed into the method, which must contain the full URL for the WebDAV server, including credentials as follows:

webdav://username:password@127.0.0.1:80/uploads/

VFS does support use of a UserAuthenticator rather than including credentials in the URL, but that’s probably overkill for getting something up and running quickly. There’s an example of how to use UserAuthenticator on the Commons VFS website, but as a middle ground, the password can be encrypted:

java -cp commons-vfs-2.0.jar org.apache.commons.vfs2.util.EncryptUtil encrypt mypassword

and the output enclosed in curly braces in the URL as follows:

webdav://username:{D7B82198B272F5C93790FEB38A73C7B8}@127.0.0.1:80/uploads/

Maven

Sardine, Apache Commons VFS2 and the Commons VFS Sardine WebDAV file provider will all be added as dependencies when the commons-vfs-webdav-sardine dependency is added to Maven.

Sardine has a dependency on the Apache Commons Codec library. Notably, it makes use of the public Base64(int lineLength) constructor for the Base64 class, which was only introduced in version 1.4. The current project also had a dependency on docx4j, which in turn depends on version 1.3 of the Apache Commons Codec library, resulting in NoSuchMethodExceptions being thrown when the constructor is used. The solution is to either ensure that the Sardine dependency is declared first in the POM (in Maven versions 2.0.8 and later) or that an Apache Commons Codec version >=1.4 dependency is added explicitly to the POM. From the Maven Introduction to the Dependency Mechanism:

You can always guarantee a version by declaring it explicitly in your project’s POM.

Conclusion

There’s obviously much, much more to using Commons VFS with Sardine and WebDAV, but the above should get the basic toolchain up and running, providing a good base on which more sophisticated file operations can be built.