Should You Multithread? An Experiment-Driven Approach

Implement a matrix multiplication and learn the overhead from pThreads

Kyrö | Selébou
Better Programming

--

Photo by Stephane Gagnon on Unsplash

As an undergraduate, I always wanted to write a multi-threaded program, but I switched from computer science and into mathematics before getting to do so. This summer, I have been revisiting some old projects I wanted to do, and I finally got around to multi-threading.

I always assumed that a multithreaded program is superior to a singly threaded one, but a moment’s reflection shows that this cannot be the case. There must be some computational overhead to spawning and dealing with threads; therefore, a simple task like adding two numbers must take longer when multi-threaded.

In this article, we will implement a matrix multiplication routine and try to answer the question: should we even be multithreading? In the process, we examine how much time our code takes to execute and use some simple math to build a ‘smart’ matrix multiplier which will decide whether to run the matrix multiply routine in parallel.

I assume that the reader is familiar with programming in C, has a foundation in linear algebra equivalent to the first few weeks of a standard undergraduate course, has a computer capable of running these code samples, and has sufficient knowledge of their compiler to compile the linked code for themselves.

Implementing Matrix Multiplication

Before we write a multithreaded matrix multiply routine we should have a simple matrix library that can run on a single thread.

We implement matrices as a simple struct that keeps track of the number of rows and columns and our entries. We store the data as double type since, on most machines, this will be enough precision for standard computational tasks involving matrices.

I prefer using a double* to store the data instead of a double** . The latter allows us to access data in the familiar data[i][j] notation for the i-th row and j-th column. However, using a double** takes up more space in memory, and, more importantly, cannot be generalized to working with tensors.

Next, we need ways to create and destroy our new matrices. The error handling in the code below is paltry at best, but should be sufficient for our purposes here.

We cannot effectively work with matrices without being able to get and set data. Since we are using a linear array to store the data, we must implement our own indexer and store the data row-first. Here’s what that looks like:

We finally get to matrix multiplication. Recall that given two matrices A = [a_{ij}] and B = [b_{ij}], the matrix product AB is defined when the number of rows of A is equal to the number of columns of B, and in this case, the product is given by c_{ij} = \sum_{k=1}^N a_{ik}b_{kj}.

Implementing matrix multiplication in code is, at worst, O(N³), so we must iterate over all rows and columns of the result ( O(N²) ), and at each entry in the product, we must compute the dot product of the k-th row of A and thek -th column of B ( O(N) ).

There is an algorithm known as Strassen’s algorithm, which can drop this down to O(N^{~2.807}), but it is impractical to implement. Our single-threaded matrix multiplication function factors out the inner dot product, which will make the transition to a multi-threaded implementation smoother:

For testing, it is nice to have a ‘pretty-print’ routine for our matrices. We add this final piece:

We now have a working matrix multiplication program. We can play around with the new code by doing some standard test matrix multiplications:

Naïve Multithreading: Lots of Threads

The first thing to try is to run a thread for each of the N² dot products. This may be close to the number of cores for small matrices, but for larger matrices, we will quickly be spawning thousands of threads.

We create a new struct called thread_params which allows us to pass the necessary arguments to our wrapper code for matmul_subroutine:

The pthread implementation allows us to call a thread that will run a function with a void* (*)(void *) signature. The void * argument allows us to pass a parameter struct the function and pass sub-arguments by casting the voidpointer to a thread_params pointer.

We replace the call to matmul_subroutine with the following threaded wrapper code which wraps our matmul_subroutine from earlier in the required void* (*)(void*) call signature:

We must do some pthread_t management to pass the parameters to matmul_subroutine_thread , but otherwise, the matrix multiplication implementation is the same:

In the code above, we create a new thread for each of the entries in the product matrix and compute the row-column dot product in each of these threads. Finally, we wait for all the threads to return before returning a pointer to the product.

We run the following in our main loop:

The full version of the above code can be found here, and on my machine, I get the following frustrating result:

Linear took     0.00017200 seconds
Parallel took 0.00313400 seconds

The parallel multiply took about 30 times longer to execute than the linear one! Bad news: multi-threading is not a panacea.

Overhead from pthread Management

I want to improve my routine. I know that there is some advantage, somewhere, in using all of the cores available to me. We need to find out how to find that advantage.

We can quickly guess that either pthread_create or pthread_join is causing the parallel multiply to run much slower since the pointer assignments and array indexing shouldn’t take much time (not enough to account for a 30x slowdown), and the calls to pthread_create and pthread_join are the only material differences compared to the linear routine.

I discovered rather quickly that the calls to pthread_create are what take the most time. The following code quantifies just how slow these calls are:

We measure time using clock_gettime to get the wall time instead of the CPU time, since when measuring the performance of a multi-threaded program, the CPU time will always be higher than when compared to the linear version of the same program. This is because CPU time measures how many operations were performed, and calling the thread management functions adds additional CPU operations. Remember that the end goal of multi-threading is to reduce the wall time.

When run, the above program gives me

Average amortized overhead:    0.00010 seconds 
Total elapsed overhead: 0.97482 seconds

For every thread we create, we end up incurring an extra 10,000-th of a second per thread. Already at a matrix with 10,000 entries (i.e., the product with shape (100, 100)), we add an extra second over the linear implementation (this isn’t quite right, we will address this in a moment).

Multiplying two (1000, 1000) matrices increases the extra overhead by a factor of 100: our program will spend about a minute and a half on just creating the threads! This leads us to guess that we want to minimize the number of threads we create. We address this by finding how large our matrices must be before multi-threading makes sense.

Optimal Threading Condition

Let S be A->n_rows * B->, T be the amount of time that it takes matmul_subroutine to execute, C be the number of cores, and h be the amount of time that the overhead from using pthreads takes. Then (theoretically), the amount of time the linear implementation takes is ST , while the amount of time that the parallel implementation takes is ST/C + Sh . Setting these equal to each other and solving for T , we find that the parallel implementation should outperform the linear one whenever T = hC/(C-1) .

On my machine, I have four cores, so theoretically, the parallel implementation should start outperforming the linear implementation when the time it takes to execute matmul_subroutine is about 4*h/3.

Note that S divides out from the computation; i.e., the size of the product array is not important in the computation, and the only thing that matters (from the perspective of the parallel implementation outperforming the serial one) is how big the inner dot-product is. As we increase this inner dimension, T increases (linearly).

We have already (approximately) computed h . On my machine, I ran the following code to compute my T (again, approximately):

Which outputs:

$ ./ht.o 
100 took: 0.0000010
1000 took: 0.0000040
10000 took: 0.0000350
100000 took: 0.0004900
1000000 took: 0.0038400
10000000 took: 0.0313800
100000000 took: 0.3031080

Note the obvious O(N) behavior here as a consequence of the matmul subroutine being linear. Recall from above that the overhead for creating the pthreads was 0.00010 seconds , and therefore (approximately) we should run the routine in parallel when T(n) = 4/3 * 0.00010. This comes out to approximately n = 27000 .

A ‘Smarter’ Matrix Multiply

We can now write a ‘smarter’ version of our matrix multiply routine that considers the host machine’s specifics to obtain optimal performance. For matrix multiplication with “small” inner dimensions, we should use the parallel implementation, and for larger inner dimensions, we should switch to the parallel implementation:

Benchmarking

Perhaps our theoretical analysis missed the mark, and we don’t see any performance gains from our smart matrix multiply. To test this, we will run our multiply routine on 42 sets of matrices with increasing inner dimensions and run each multiplication 30 times to get an average execution time. Our goal is to show that the slope decreases as we switch from the serial to multi-threaded implementation:

We can visualize the improvement using a best-fit line. The plotted results below show a clear improvement in performance as compared to the extrapolated non-multi-threaded implementation (blue):

We also note that the approximate value of 27,000 is too high, and with a little more work, we could find a more optimal cutoff value. This likely results from accumulating errors from the simplistic computation we did above.

The plot above was generated with the following Python script:

Takeaway

Multi-threading is not always the right thing to do. There is additional overhead inherent in using multi-threading, and depending on the application, one should find the critical point at which multi-threading a given routine gives a performance gain.

--

--