Vectorizing Black Scholes - CPU vs. GPU

April 29, 2019 Harshad Deo

5 minutes read

A data parallel operation is one in which the same function is applied to different inputs. Vectorizing data parallel operations is an important problem with applications in market risk, full portfolio evaluation and numerical schemes. This article compares different approaches for vectorizing the Black Scholes formula for a call option.

Four cases are considered:

Using AVX2 instructions
Multi Threading
Using AVX2 instructions and multi threading
Using GPU computing

The four implementations were generated using the SimianQuant library and therefore require equal programmer effort. They are also significantly faster than hand written approaches since the library does not use naive approaches like locks or purely immutable data structures for multi threading. The trends observed are:

For the purpose of evaluating the Black Scholes formula, and similar expressions that require computationally intensive atomic operations, a CPU is unambiguously better than a GPU
For fewer than 1000 elements, the additional complexity of multi threading degrades runtime performance. Using AVX2 instructions is the only way of improving the performance.
Multi threading comes with a significant overhead, and doubling the core count does not halve the runtime

Approach

The multi threaded implementations were benchmarked on two AWS instances to compare the benefits of increasing the core count:

A c5.4xlarge instance with 16 cores
A c5.18xlarge instance with 72 cores

The GPU implementations were benchmarked on two AWS instances to compare the benefits of using a newer generation GPU, with a higher clock speed and multiprocessor count:

A p3.2xlarge instance with an Nvidia V100 GPU
A p2.xlarge instance with an Nvidia K80 GPU

The time taken to allocate memory on the CPU and GPU was ignored, because the allocation can be done in parallel with the conversion of the business/persisted representations of the option and market data to ones which can be used for pricing.

In Depth

The SimianQuant library uses Intel’s Thread Building Blocks to implement multi-threading in C++. The benchmarks were run using the default settings. While tuning the parameters could improve the performance, the speedup will at best be incremental.

Runtimes

The simplest way of comparing the different approaches is to compare the median time taken to evaluate the input vector. To simplify the visualization of the result, the log of the median evaluation time (in nanoseconds) with the log of the input vector size.

The corresponding data table is:

Input Size	Scalar	AVX2	Scalar x 16	Scalar x 72	AVX2 x 16	AVX2 x 72	k80	V100
2	8.31	6.54	11.10	13.20	8.13	8.08	NA	NA
3	9.33	7.47	11.89	13.84	10.13	11.93	NA	NA
4	10.32	8.43	12.35	14.20	11.19	13.14	NA	NA
5	11.32	9.43	12.76	14.30	11.96	13.83	NA	NA
6	12.32	10.41	13.17	14.31	12.36	14.29	NA	NA
7	13.32	11.41	13.49	14.43	12.76	14.57	NA	NA
8	14.32	12.40	13.82	14.65	13.18	14.63	NA	NA
9	15.34	13.40	14.16	14.84	13.55	14.61	NA	NA
10	16.33	14.40	14.59	15.06	13.85	14.68	16.23	15.47
11	17.33	15.40	15.01	15.23	14.23	14.88	16.37	15.52
12	18.33	16.40	15.58	15.47	14.65	15.08	16.76	15.85
13	19.33	17.40	16.28	15.75	15.20	15.30	17.00	16.26
14	20.33	18.40	17.08	16.17	15.83	15.58	17.53	16.86
15	21.33	19.43	17.91	16.65	16.59	15.92	18.32	17.63
16	22.33	20.40	18.80	17.25	17.32	16.34	19.15	18.52
17	23.33	21.40	19.77	17.98	18.19	16.91	20.09	19.46
18	24.34	22.40	20.71	18.81	19.04	17.60	21.04	20.42
19	25.34	23.41	21.70	19.70	20.02	18.41	22.02	21.39
20	26.34	24.41	22.69	20.63	21.00	19.46	23.01	22.37

Speedup

An alternate way to interpret the results is to compare the speedup of the median evaluation time compared with that in the single threaded scalar case.

The corresponding data table is:

Input Size	Scalar	AVX2	Scalar x 16	Scalar x 72	AVX2 x 16	AVX2 x 72	k80	V100
2	1.00	3.41	0.15	0.03	1.13	1.18	NA	NA
3	1.00	3.62	0.17	0.04	0.57	0.16	NA	NA
4	1.00	3.70	0.24	0.07	0.55	0.14	NA	NA
5	1.00	3.69	0.37	0.13	0.64	0.18	NA	NA
6	1.00	3.75	0.55	0.25	0.97	0.25	NA	NA
7	1.00	3.76	0.89	0.46	1.47	0.42	NA	NA
8	1.00	3.77	1.41	0.79	2.21	0.80	NA	NA
9	1.00	3.82	2.27	1.41	3.45	1.65	NA	NA
10	1.00	3.80	3.34	2.41	5.57	3.14	1.05	1.81
11	1.00	3.81	5.00	4.27	8.55	5.46	1.91	3.51
12	1.00	3.82	6.72	7.30	12.88	9.52	2.94	5.60
13	1.00	3.82	8.28	11.99	17.52	16.33	5.03	8.39
14	1.00	3.80	9.54	17.84	22.57	26.89	6.96	11.09
15	1.00	3.74	10.67	25.56	26.74	42.57	7.83	12.97
16	1.00	3.81	11.56	33.94	32.27	63.46	8.73	14.07
17	1.00	3.81	11.85	40.77	35.31	85.76	9.58	14.59
18	1.00	3.82	12.32	46.16	39.26	106.99	10.11	15.14
19	1.00	3.80	12.45	49.81	39.77	121.45	10.39	15.45
20	1.00	3.79	12.52	52.20	40.42	117.25	10.40	15.60

It is observed that for intermediate input vector lengths, i.e. between 2¹⁰ and 2¹⁶, the performance scales superlinearly with input vector size (less than 2x time for 2x input size). This is a consequence of the overheads associated with multi-threading.

Harshad

Goals are not always meant to be achieved, sometimes they simply serve as something to aim for

SimianQuant