Vectorizing Black Scholes - CPU vs. GPU

April 29, 2019
quantitative finance avx tbb cuda

Share:


A data parallel operation is one in which the same function is applied to different inputs. Vectorizing data parallel operations is an important problem with applications in market risk, full portfolio evaluation and numerical schemes. This article compares four ways of vectorizing the Black Scholes formula for a call option:

  1. Using AVX2 instructions
  2. Multi Threading
  3. Using AVX2 instructions and multi threading
  4. Using GPU computing

The four implementations were generated using the SimianQuant library and therefore require equal programmer effort. They are also significantly faster than hand written approaches since the library does not use naive approaches like locks or purely immutable data structures for multi threading. The trends observed are:

  1. For the purpose of evaluating the Black Scholes formula, and similar expressions that require computationally intensive atomic operations, a CPU is unambiguously better than a GPU
  2. For fewer than 1000 elements, the additional complexity of multi threading degrades runtime performance. Using AVX2 instructions is the only way of improving the performance.
  3. Multi threading comes with a significant overhead, and doubling the core count does not halve the runtime

Approach

The multi threaded implementations were benchmarked on two AWS instances to compare the benefits of increasing the core count:

  1. A c5.4xlarge instance with 16 cores
  2. A c5.18xlarge instance with 72 cores

The GPU implementations were benchmarked on two AWS instances to compare the benefits of using a newer generation GPU, with a higher clock speed and multiprocessor count:

  1. A p3.2xlarge instance with an Nvidia V100 GPU
  2. A p2.xlarge instance with an Nvidia K80 GPU

The time taken to allocate memory on the CPU and GPU was ignored, because the allocation can be done in parallel with the conversion of the business/persisted representations of the option and market data to ones which can be used for pricing.

In Depth

The SimianQuant library uses Intel’s Thread Building Blocks to implement multi-threading in C++. The benchmarks were run using the default settings. While tuning the parameters could improve the performance, the speedup will at best be incremental.

Runtimes

The simplest way of comparing the different approaches is to compare the median time taken to evaluate the input vector. To simplify the visualization of the result, the log of the median evaluation time (in nanoseconds) with the log of the input vector size.

The corresponding data table is:

Input Size Scalar AVX2 Scalar x 16 Scalar x 72 AVX2 x 16 AVX2 x 72 k80 V100
2 8.31 6.54 11.10 13.20 8.13 8.08 NA NA
3 9.33 7.47 11.89 13.84 10.13 11.93 NA NA
4 10.32 8.43 12.35 14.20 11.19 13.14 NA NA
5 11.32 9.43 12.76 14.30 11.96 13.83 NA NA
6 12.32 10.41 13.17 14.31 12.36 14.29 NA NA
7 13.32 11.41 13.49 14.43 12.76 14.57 NA NA
8 14.32 12.40 13.82 14.65 13.18 14.63 NA NA
9 15.34 13.40 14.16 14.84 13.55 14.61 NA NA
10 16.33 14.40 14.59 15.06 13.85 14.68 16.23 15.47
11 17.33 15.40 15.01 15.23 14.23 14.88 16.37 15.52
12 18.33 16.40 15.58 15.47 14.65 15.08 16.76 15.85
13 19.33 17.40 16.28 15.75 15.20 15.30 17.00 16.26
14 20.33 18.40 17.08 16.17 15.83 15.58 17.53 16.86
15 21.33 19.43 17.91 16.65 16.59 15.92 18.32 17.63
16 22.33 20.40 18.80 17.25 17.32 16.34 19.15 18.52
17 23.33 21.40 19.77 17.98 18.19 16.91 20.09 19.46
18 24.34 22.40 20.71 18.81 19.04 17.60 21.04 20.42
19 25.34 23.41 21.70 19.70 20.02 18.41 22.02 21.39
20 26.34 24.41 22.69 20.63 21.00 19.46 23.01 22.37

Speedup

An alternate way to interpret the results is to compare the speedup of the median evaluation time compared with that in the single threaded scalar case.

The corresponding data table is:

Input Size Scalar AVX2 Scalar x 16 Scalar x 72 AVX2 x 16 AVX2 x 72 k80 V100
2 1.00 3.41 0.15 0.03 1.13 1.18 NA NA
3 1.00 3.62 0.17 0.04 0.57 0.16 NA NA
4 1.00 3.70 0.24 0.07 0.55 0.14 NA NA
5 1.00 3.69 0.37 0.13 0.64 0.18 NA NA
6 1.00 3.75 0.55 0.25 0.97 0.25 NA NA
7 1.00 3.76 0.89 0.46 1.47 0.42 NA NA
8 1.00 3.77 1.41 0.79 2.21 0.80 NA NA
9 1.00 3.82 2.27 1.41 3.45 1.65 NA NA
10 1.00 3.80 3.34 2.41 5.57 3.14 1.05 1.81
11 1.00 3.81 5.00 4.27 8.55 5.46 1.91 3.51
12 1.00 3.82 6.72 7.30 12.88 9.52 2.94 5.60
13 1.00 3.82 8.28 11.99 17.52 16.33 5.03 8.39
14 1.00 3.80 9.54 17.84 22.57 26.89 6.96 11.09
15 1.00 3.74 10.67 25.56 26.74 42.57 7.83 12.97
16 1.00 3.81 11.56 33.94 32.27 63.46 8.73 14.07
17 1.00 3.81 11.85 40.77 35.31 85.76 9.58 14.59
18 1.00 3.82 12.32 46.16 39.26 106.99 10.11 15.14
19 1.00 3.80 12.45 49.81 39.77 121.45 10.39 15.45
20 1.00 3.79 12.52 52.20 40.42 117.25 10.40 15.60

It is observed that for intermediate input vector lengths, i.e. between 210 and 216, the performance scales superlinearly with input vector size (less than 2x time for 2x input size). This is a consequence of the overheads associated with multi-threading.


Harshad

Goals are not always meant to be achieved, sometimes they simply serve as something to aim for



Share: