Today we will get familiar with some concepts concerning concurent processing. We will try to solve the problems as we proceed with the material.
We will start with some simple examples of starting and managing threads using standard C++ tolls. We will try to show some common problems and solutions whan dealing with concurent processing.
To compile programs using std::thread
use
g++ thread1.cpp -pthread
The easiest way to manage threads is to add to your source
#include <thread>
and make do wiht
std::thread
objects. New threads asre started by creating
new std::thread
instances, and passing a collable (one that
has () and can be called) object to be executed in the new thread. Once
the thread is started it executes, until it finishes. The first problem
to look at is not to leave alive threads for control to go out of scope
(i.e. to go out of the {} region), since this will cause running threads
to trtminate and is in general an undefined behavior. To prevent this,
threads should be joined with .join()
(C++20 introduces a
jthread
that supports auto-joining).
See thread1.cpp, compile and run it, examine the commented out synchronization section.
Threads started within a scope live only within this scope, and will
be terminated when it finishes. This might be a problem for sytuations
one wishes to use some form of thread caller functions. To prevent the
thread execution from being abruptly terminated we can use
detach
, which detaches execution from the thread
object.
Note: Should be used with care.
See thread2.cpp
Examine thread3.cpp, compile and run. What is the result of the program? Is it the same every time? Is this a problem? At least the program works, right?
Experiment with the atomic
template. Is it any better
now?
Modify incrementation of the thread_counter
, is there any
difference in using thread_counter = thread_counter + 1
or
thread_counter += 1
?
Threads working concurrently use the same memory space, and as we
have seen the consequent race condition is a problem. The region of the
code our threads might interfere is the citical
section, one that needs to be appropriatly protected. We will
use the mutex
(Mutually Exclusive Lock) mechanism. See
thred4.cpp for an example. Compile and run the code. Than, modify
thread3.cpp to use mutex
instead of
atomic
.
Thread Building Blocks is a library designed to make parallelization of the already existing, or new code (relativly easy). The objective of TBB is to provide a template library, very much like STL for managing creation of and working with threads.
Examine serial.cpp, compile and run it. It is a very simple program
that puts data in to an array and terminates. Nothing fancy.
Extend the code, so an average value is evaluated at the end.
The for loop used to initialize the vector content is a great example
of an embarrassingly parallel problem. There should be no data race, and
no problem with concurent manipulation, and it should parallelize very
eaisly. For this task, TBB ofers a simple parallel_for
,
which with little modification to the code manages starting, stoping and
executing working threads.
tbb::parallel_for(range, kernel);
For range
we will use
tbb::blocked_range<T>(T n0, T n1)
, which is a
template class describing a one-dimensional iteration space from \(n_0\) to \(n_1-1\). For our problem it is the same as
the the range of the for (0,y.size())
.
kernel
is a collable object that takes
tbb::blocked_range<T> r
as an argument and processes
the chunk of the problem defined by r
. In our problem we
will use lambda expression
[&](tbb::blocked_range<int> r){}
.
Note: Our problem is very simple, but a rule of thumb is to put
frequently accessed values into local variables of the
kernel
. This should help compiler to optimize the loop
better. It seems local variables are easier for the compiler to
track.
Examine the execution times and prepere a speedup plot using
tbb_parallel_for.cpp
. Than move the definition of
double x
out of the body of the lambda expresion (Note,
this causes the race condition!) and examine resulting speedup, is there
any difference?
is a performance degrading event that results from threads sharing resorurces that lie to close to each other.
When a system participant attempts to periodically access data that is not being altered by another party, but that data shares a cache block with data that is being altered, the caching protocol may force the first participant to reload the whole cache block despite a lack of logical necessity.
Examine, compile and run tbb_false_sharing.cpp
example.
The loop initializing the data was easy to parallelize with
parallel_for
. How about summation over all elements? In
this operation elements of an array are reduced into a single
result - the sum. TBB offers a parallel_reduce
function
template to perform reduction operations over the range. The simplest
syntax is
parallel_reduce(range, identity_value, func, reduction)
,
where: * range
defines a a range, to which sub-ranges
func
will be applied. * identity_value
is
identity for the operation performed by func
(0 for
summation and 1 for multiplication). * func
is a collable
object. Could be a lambda expression. * reduction
defines
how sub-range reductions are joined to produce the final result. This
can also be defined as a lambda expression, or we could use standard
function objects.
Solve linear advection problem \(\frac{\partial u}{\partial t} + c \frac{\partial
u}{\partial x} = 0\), with periodic boundary conditions and an
appropariate initial condition. Use advection_serial.cpp
as
a starting point, here first order forward finite difference and an
explicit time integration is used. Examine possible speedup due to
parallelization. Then consider implementation of the implicit
method.