Today we will get familiar with some concepts concerning concurent processing. We will try to solve the problems as we proceed with the material.
We will start with some simple examples of starting and managing threads using standard C++ tolls. We will try to show some common problems and solutions whan dealing with concurent processing.
To compile programs using std::thread
use g++ thread1.cpp -pthread
The easiest way to manage threads is to add to your source #include <thread>
and make do wiht std::thread
objects. New threads asre started by creating new std::thread
instances, and passing a collable (one that has () and can be called) object to be executed in the new thread. Once the thread is started it executes, until it finishes. The first problem to look at is not to leave alive threads for control to go out of scope (i.e. to go out of the {} region), since this will cause running threads to trtminate and is in general an undefined behavior. To prevent this, threads should be joined with .join()
(C++20 introduces a jthread
that supports auto-joining).
See thread1.cpp, compile and run it, examine the commented out synchronization section.
Threads started within a scope live only within this scope, and will be terminated when it finishes. This might be a problem for sytuations one wishes to use some form of thread caller functions. To prevent the thread execution from being abruptly terminated we can use detach
, which detaches execution from the thread object.
Note: Should be used with care.
See thread2.cpp
Examine thread3.cpp, compile and run. What is the result of the program? Is it the same every time? Is this a problem? At least the program works, right?
Experiment with the atomic
template. Is it any better now?
Modify incrementation of the thread_counter
, is there any difference in using thread_counter = thread_counter + 1
or thread_counter += 1
?
Threads working concurrently use the same memory space, and as we have seen the consequent race condition is a problem. The region of the code our threads might interfere is the citical section, one that needs to be appropriatly protected. We will use the mutex
(Mutually Exclusive Lock) mechanism. See thred4.cpp for an example. Compile and run the code. Than, modify thread3.cpp to use mutex
instead of atomic
.
Thread Building Blocks is a library designed to make parallelization of the already existing, or new code (relativly easy). The objective of TBB is to provide a template library, very much like STL for managing creation of and working with threads.
Examine serial.cpp, compile and run it. It is a very simple program that puts data in to an array and terminates. Nothing fancy.
Extend the code, so an average value is evaluated at the end.
The for loop used to initialize the vector content is a great example of an embarrassingly parallel problem. There should be no data race, and no problem with concurent manipulation, and it should parallelize very eaisly. For this task, TBB ofers a simple parallel_for
, which with little modification to the code manages starting, stoping and executing working threads.
tbb::parallel_for(range, kernel);
For range
we will use tbb::blocked_range<T>(T n0, T n1)
, which is a template class describing a one-dimensional iteration space from \(n_0\) to \(n_1-1\). For our problem it is the same as the the range of the for (0,y.size())
.
kernel
is a collable object that takes tbb::blocked_range<T> r
as an argument and processes the chunk of the problem defined by r
. In our problem we will use lambda expression [&](tbb::blocked_range<int> r){}
.
Note: Our problem is very simple, but a rule of thumb is to put frequently accessed values into local variables of the kernel
. This should help compiler to optimize the loop better. It seems local variables are easier for the compiler to track.
Examine the execution times and prepere a speedup plot using tbb_parallel_for.cpp
. Than move the definition of double x
out of the body of the lambda expresion (Note, this causes the race condition!) and examine resulting speedup, is there any difference?
is a performance degrading event that results from threads sharing resorurces that lie to close to each other.
When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity.
Examine, compile and run tbb_false_sharing.cpp
example.
The loop initializing the data was easy to parallelize with parallel_for
. How about summation over all elements? In this operation elements of an array are reduced into a single result - the sum. TBB offers a parallel_reduce
function template to perform reduction operations over the range. The simplest syntax is parallel_reduce(range, identity_value, func, reduction)
, where: * range
defines a a range, to which sub-ranges func
will be applied. * identity_value
is identity for the operation performed by func
(0 for summation and 1 for multiplication). * func
is a collable object. Could be a lambda expression. * reduction
defines how sub-range reductions are joined to produce the final result. This can also be defined as a lambda expression, or we could use standard function objects.
Solve linear advection problem \(\frac{\partial u}{\partial t} + c \frac{\partial u}{\partial x} = 0\), with periodic boundary conditions and an appropariate initial condition. Use advection_serial.cpp
as a starting point, here first order forward finite difference and an explicit time integration is used. Examine possible speedup due to parallelization. Then consider implementation of the implicit method.