Bringing Back Moore's Law

Harshad Deo

6 minutes read

The sequence of operations that convert input to output can be visualized as a graph. Graph compression is the family name for a suite of SimianQuant algorithms that rewrite the computational graph to reduce the time and/or memory required.

At a conceptual level, they are similar to Common Subexpression Elimination. However, since this is not your grandfather’s compiler, the compression is semantic, i.e. based on the algebraic properties of the graph. The library supports three levels of graph compression:

  1. Subexpression Elimination
  2. Atomic Function Hybridization
  3. Datastructure Compression

This article explains how each improve runtime performance (with no extra development cost) using examples taken from quantitative finance, specifically the Black Scholes call option pricing formula and the Stochastic Alpha Beta Rho (SABR) volatility model. However knowledge of these models (or indeed of quantitative finance) is not necessary to understand the rest of this article, you could mentally replace SABR or Black with fancy_function and not lose out on the meaning.

In summary, graph compression can improve the runtime performance by an order of magnitude. For example, for realistic problems using algorithmic differentiation in Quantitative Finance, a graph compressed SimianQuant implementation is more than 3x faster than one written using Google’s Ceres library.

Subexpression Elimination

This is similar to the variant performed by most optimizing compilers, but the targets for elimination are chosen by analysing the global algebraic properties of the graph. For example, if a graph contains a $\sqrt x$ and $x ^ {0.75}$, the latter will be rewritten as $\frac{x}{\sqrt {\sqrt{x}}}$ and $\sqrt x$ will be the common subexpression.

As an example, consider the SABR model. For those who don’t know, SABR is a classic way of modeling the volatility smile across asset classes. Oweing to its algebraic complexity, most routines to calibrate SABR use finite differences, or at best algorithmic differentiation. Graph compressed symbolic evaluators are an order of magnitude faster without any additional effort. To illustrate:

// context imports

val (alpha, nu, rho, k, f, t) = (sym("alpha"), sym("nu"), sym("rho"), sym("k"), sym("f"), sym("t"))

val value = sabr(alpha, 0.5, nu, rho, k, f, t) // beta fixed to 0.5
val dv_dalpha = value.diff("alpha")
val dv_dnu = value.diff("nu")
val dv_drho = value.diff("rho")

val dabs1 = DeAbstractable(value, "v") // only evaluates value
val dabs2 = DeAbstractable(dv_dalpha, "dv_da") // only evaluates dv_dalpha
val dabs3 = DeAbstractable(value, "v") // evaluates value and three partial derivatives
  .add(dv_dalpha, "dv_da")
  .add(dv_dnu, "dv_dn")
  .add(dv_drho, "dv_dr") 

Interestingly, it requires less programmer effort to evaluate the function and derivatives together than it takes to evaluate them separately. For implementations generated in Scala, the median runtimes are:

Therefore, the time taken to evaluate the value and all derivatives is only marginally greater than the time taken to evaluate any one of the subcomponents. The only reasonable alternative to symbolic graph compression is algorithmic differentiation. To compare runtimes of graph compressed symbolic differentiation and algorithmic differentiation, the SABR value implementation (dabs1) was materialized using the Jet struct from Ceres. The median runtimes are:

In this case, while algorithmic differentiation will do much to spare the user from the conceptual and numerical ugliness of finite differences, it won’t go far in improving runtime performance.

Atomic Function Hybridization

Atomic functions are those that cannot be written as a combination of simpler functions in a reasonable way, for example $\sqrt x$ and $\sin(x)$. Some, like $\sqrt x$ are implemented in hardware while most are implemented using tricky numerical routines. Atomic function hybridization is a class of algorithms that merge two or more of the latter kind of atomic functions into a hybridized atomic function that simultaneously computes all of them. Classic examples are $\sin(x)$ and $\cos(x)$, $\sinh(x)$, $\cosh(x)$ and $\exp(x)$, and $\Phi(x)$ and $\phi(x)$, the normal distribution and density functions.

Axiomatically, atomic function hybridization requires a reimplementation of atomic functions, that’s why it is not supported for generation over standard or open source targets. To illustrate, consider the problem of evaluating the call option price and its vega, probably the most frequently used computation in all of finance. The code looks like:

// context imports 
val (spot, strike, rd, rf, tau, sigma) = (sym("S"), sym("K"), sym("r_d"), sym("r_f"), sym("tau"), sym("sigma"))

val pips = blackVanillaPipsNumeraire(spot, strike, rd, rf, tau, sigma, true) // true := call
val vega = blackVanillaVega(spot, strike, rd, rf, tau, sigma)

val dabs1 = DeAbstractable(pips, "pips") // only evaluates the pips
val dabs2 = DeAbstractable(pips, "pips") // evaluates the pips and the vega
  .add(vega, "vega")

For imlementations generated in Scala, the median runtimes are:

The standard implementation is much slower than the SimianQuant implementation because it uses the error function provided by Apache Commons Math to evaluate the normal distribution function, while the SimianQuant implementation uses a proprietary implementation of the error function.

The SimianQuant implementation takes basically the same time to evaluate the value and vega together as it does to evaluate the value, while the standard implementation requires an additional $\exp$ evaluation. The difference might seem small in this example, but it is significant for the more realistic models used in practice. To quote Paul Graham,

..there’s a big difference between nothing and almost nothing, when it’s multiplied by the area under the sun.

Datastructure Compression

The final class of graph compression algorithms is applicable to implementations generated over non-primitive types (i.e. not Double, Float etc.) that require a custom datastructure to evaluate (for example complex numbers). Datastructure compression generates a temporary datastructure that can be initialized from the standard variant, and that can be used to initialize a standard variant. This custom implementation is, however, much better suited to evaluate the given computational graph. The temporary datastructure is accessible neither to the generator of the graph nor to code that evaluates the graph. By design, both are unaware of its existence. For commercial reasons, datastructure compression is not supported for generation over standard or open source targets.

The most salient application of datastructure compression is algorithmic differentiation. To illustrate, consider the problem of evaluating the market sensitivity of a portfolio of foreign exchange options in the same currencypair. This portfolio will be sensitive to ~ 100 variables (~ 50 points on the two rate curves, ~ 50 points on the volatility surface, and spot). The only reasonable way of evaluating this sensitivity is to use algorithmic differentiation. To capture the runtime delta of different implementations, consider the simplified problem of evaluating the SABR volatility and its sensitivity to 100 parameters.

For users of the library, switching from a implementation that computes the value to one that evaluates the value and sensitivity is equivalent to changing the context import. Therefore the actual user code to generate a function to perform this sensitivity analysis is identical to dabs1 in the listing in the first section. The median runtimes are:

The performance step on the JVM is limited by its poor support for Single Instruction Multiple Data (SIMD) vectorization. Since C++ provides explicit control over vectorization, the step is much more obvious. What is remarkable is that despite these limitations, the SimianQuant implementations are able to achieve the same performance on the JVM as ceres is able to do in C++.


Engineering is design under constraint

Recent posts

See more