Explicit, comprehensible covariance matrices in Java

As part of my code readability push in a recent work project (see also here), I’ve just tided up the code responsible for creating, updating, and referring to covariance matrices of patient characteristics and treatment effects. I first searched the web for examples of how covariance matrices had been implemented previously in Java and found the Covariance class in the Apache Commons Mathematics Library which, as it happens, we were already using for its MersenneTwister implementation and a few subclasses of AbstractRealDistribution.

However, looking at the data structures involved, the Covariance class will calculate and return a covariance matrix as an instance implementing the RealMatrix interface (and accepts the raw data as either a two-dimensional array of doubles or as a RealMatrix). Looking at the Covariance source code, the current implementation returns an instance of BlockRealMatrix, which by default is divided into a series of 52 x 52 blocks (so 2,704 double values per block), which are flattened in row major order into single dimensional arrays, which are themselves stored (again in row major order) in an outer array. This is clearly an extremely high performance approach, with three blocks designed to fit into 64Kb of L1 cache (3 × 2,704 × 8 / 1,024 = 63.4Kb), allowing block multiplication to be conducted entirely within the cache (i.e. C[i][j] += A[i][k] × B[k][j]).

This is all well and good, but passing 2D arrays or RealMatrix instances around runs contrary to my current goal of writing readable code. Not only that, but I want to store Pearson product-moment correlation coefficients alongside each covariance parameter so the model can decide later whether or not to covary parameters whose correlation is not significant. So I’d really need two 2D matrices to retrieve both the covariance and the p-value. Given that I’m dealing with relatively small matrices and not doing anything too computationally intensive with them, I decided to write a thin wrapper class around an EnumMap of EnumMaps, in which the Enum for the inner and outer maps consists of elements describing each covaried characteristic and the inner EnumMaps map each Enum element to a small tuple-like nested class that stores the covariance value and p-value in public instance variables.

Since the model uses a few different covariance matrices (including the aforementioned baseline patient characteristics and treatment effects, for example), I wrote a generic class that takes the Enum class in the constructor, such that it can be instantiated as follows:

The constructor then creates the 2D “matrix” of the nested, tuple-like ValuePValuePair classes with all covariance and p-values set to 0.0. This is fairly memory inefficient in that we immediately have n2 instances of ValuePValuePair (where n is the number of enumerable fields), but it helped to get this up and running quickly without hitting any NullPointerExceptions or having to perform null checks. Once the data structure’s set up, values can be added to the matrix as follows (with a couple of static final constants added at the top of the class to improve the readability of the code where it counts):

As a quick aside, it’s worth noting that, since the ability to refer to matrix elements numerically has (apparently) been lost, it initially looks as though a lot of strict matrix-like functionality may have been lost as well. However, since the EnumMap of EnumMap approach relies on the getEnumConstants() method (which always returns Enum elements in the order in which they’re specified) to generate the matrix, the numeric index of each row (and column) can be derived from any given Enum element as follows:

Using this approach, “lossless” conversion from a RealMatrix instance to this data structure and back would be possible (assuming initial knowledge of which rows and columns map to which characteristics), but even without the numeric references, this meets the current requirement, which is simply to use pre-calculated covariance matrices from SAS in Java in a clear and concise manner.

The full EnumMap wrapper class looks like this:

The next step is to perform some profiling on this in situ and probably switch over to instantiating the ValuePValuePair instances lazily with the appropriate null checks added in to the getters. But for now, the class is working very well and achieves the goal of making the clinical modelling portions of the code much more readable.

Returning the first instance of a Java class in a collection of instances of its superclass

In an ongoing work project, we have a Java class that models a patient with type 1 diabetes. Since a lot of the risk models that operate on the simulated patient are affected by various medications (antihypertensives, antithrombotics, lipid modifying medications, etc.), we have an ArrayList on the patient that holds instances of a Medication class (or any of its many subclasses) to represent which of these the patient is currently taking:

This allows us to write things like the fairly English-sounding:

(We actually opted to use the getter for the riskAdjustingMedications ArrayList and the .add() method directly for the very reason that the code is extremely readable when written as above.)

But how can we write something that allows us to establish whether or not the ArrayList already contains an instance of Medication itself or one of its subclasses? For now, we’re using a private method in the Patient class that accomplishes this as generally as possible:

So that’s a generic method with the type parameters T and E (where E extends T) that takes the class of E (in our specific example, Medication.class or any subclass) and a Collection of type T as arguments. If an instance of type Class is found in the Collection, the method returns the first instance, otherwise it returns null.

In the Patient class, we’ve then implemented a public convenience method to support querying for Medication or any of its subclasses:

For maximum readability in our classes that are evaluating complication risk and patient physiology, we can then declare statics for the specific Medication or subclass types we’re using and write code like the following:

Eminently readable.

Adding text cells to an Excel worksheet using xlsx4j

I’ve just started using xlsx4j (a large collection of classes in what is more widely known as the docx4j project) to implement Excel export as part of a large, ongoing StackThread project. As an extremely brief introduction, docx4j uses the Java Architecture for XML Binding (JAXB) to create an in-memory representation of the Microsoft Open XML format that underlies all Microsoft Office documents in the post-2007 era.

Since xlsx4j is effectively a wrapper around the Open XML format, it becomes immediately apparent in using the library that you need to have some familiarity with the file format in order to get anything working. The first snag I hit was the manner in which Excel handles string storage, namely through the use of a shared string table. The shared string table contains every unique string in the workbook only once. Every cell that contains a string actually just holds a reference to the appropriate string in the shared string table. While I did eventually get this working, it took quite a lot of code to do so:

This is, of course, greatly simplified. Notably, as you’ll see on line 46, it’s not keeping track of unique strings. While this wouldn’t be too difficult to implement (a HashMap<String, Integer> could be used to keep count of unique strings and instances), the above code feels very heavyweight for one of the most basic spreadsheet manipulation tasks. Thankfully, there is an easier way.

The easier way

After a little bit of digging around, I discovered that cells also support inline strings. With the addition of a convenience newCellWithInlineString method to generate INLINE_STR cells, the shared string table code can be removed entirely, leaving this:

While the first example could definitely have been distilled down and refactored to keep the workings of the shared string table hidden away, the lack of string-tracking overhead in the second example makes for much easier-to-read code in my opinion. It’s also much faster to get to the point of functional Excel export without having the hassle of setting up the shared string table. I haven’t yet investigated the performance implications of using INLINE_STR when opening the workbook in Excel, but from a quick look at the XML after opening and saving an Excel workbook created using the INLINE_STR method, it looks as though Excel generates its own shared string table when the workbook’s first saved.