/** * \page examples Examples * * There are several examples shipping with %Vc. If you have a suggestion for a useful or interesting * example, please contact vc@compeng.uni-frankfurt.de. * * \li \subpage ex-polarcoord * This example explains the basic approaches to vectorization of a given problem. It contains a * discussion of storage issues. * \li \subpage ex-finitediff * \li \subpage ex-matrix * \li \subpage ex-mandelbrot * \li \subpage ex-buddhabrot * ********************************************************************************* * * \page ex-polarcoord Polar Coordinates * * The \c polarcoord example generates 1000 random Cartesian 2D coordinates that are then * converted to polar coordinates and printed to the terminal. * This is a very simple example but shows the concept of vertical versus horizontal * vectorization quite nicely. * * \section ex_polarcoord_background Background * * In our problem we start with the allocation and random initialization of 1000 Cartesian 2D * coordinates. Thus every coordinate consists of two floating-point values (x and y). * \code * struct CartesianCoordinate * { * float x, y; * }; * CartesianCoordinate input[1000]; * \endcode * \image html polarcoord-cartesian.png "Cartesian coordinate" * * We want to convert them to 1000 polar coordinates. * \code * struct PolarCoordinate * { * float r, phi; * }; * PolarCoordinate output[1000]; * \endcode * \image html polarcoord-polar.png "Polar coordinate" * * Recall that: * \f[ * r^2 = x^2 + y^2 * \f]\f[ * \tan\phi = y/x * \f] * (One typically uses \c atan2 to calculate \c phi efficiently.) * * \section ex_polarcoord_vectorization Identify Vectorizable Parts * * When you look into vectorization of your application/algorithm, the first task is to identify the * data parallelism to use for vectorization. * A scalar implementation of our problem could look like this: * \code * for (int i = 0; i < ArraySize; ++i) { * const float x = input[i].x; * const float y = input[i].y; * output[i].r = std::sqrt(x * x + y * y); * output[i].phi = std::atan2(y, x) * 57.295780181884765625f; // 180/pi * if (output[i].phi < 0.f) { * output[i].phi += 360.f; * } * } * \endcode * The data parallelism inside the loop is minimal. It basically consists of two multiplications * that can be executed in parallel. This kind of parallelism is already exploited by all modern * processors via pipelining, which is one form of instruction level parallelism (ILP). * Thus, if one were to put the x and y values into a SIMD vector, this one multiplication could be * executed with just a single SIMD instruction. This vectorization is called \e vertical * vectorization, because the vector is placed vertically into the object. * * There is much more data parallelism in this code snippet, though. The different iteration steps * are all independent, which means that subsequent steps do not depend on results of the preceding steps. * Therefore, several steps of the loop can be executed in parallel. This is the most * straightforward vectorization strategy for our problem: * From a loop, always execute N steps in parallel, where N is the number of entries in the SIMD vector. * The input values to the loop need to be placed into a vector. * Then all intermediate values and results are also vectors. Using the %Vc datatypes a single loop * step would then look like this: * \code * // x and y are of type float_v * float_v r = Vc::sqrt(x * x + y * y); * float_v phi = Vc::atan2(y, x) * 57.295780181884765625f; // 180/pi * phi(output[i].phi < 0.f) += 360.f; * \endcode * This vectorization is called \e horizontal vectorization, because the vector is placed * horizontally over several objects. * * \section ex_polarcoord_data Data Structures * * To form the \c x vector from the previously used storage format, one would thus write: * \code * float_v x; * for (int i = 0; i < float_v::Size; ++i) { * x[i] = input[offset + i].x; * } * \endcode * Notice how the memory access is rather inefficient. * * \subsection ex_polarcoord_data_aos Array of Structs (AoS) * * The data was originally stored as array of * structs (\e AoS). Another way to call it is \e interleaved storage. That's because the entries of * the \c x and \c y vectors are interleaved in memory. * Let us assume the storage format is given and we cannot change it. * We would rather not load and store all our vectors entry by entry as this would lead to * inefficient code, which mainly occupies the load/store ports of the processor. Instead, we can use * a little helper function %Vc provides to load the data as vectors with subsequent deinterleaving: * \code * Vc::float_v x, y; * Vc::deinterleave(&x, &y, &input[i], Vc::Aligned); * \endcode * This pattern can be very efficient if the interleaved data members are always accessed together. * This optimizes for data locality and thus cache usage. * * \subsection ex_polarcoord_data_vectors Interleaved Vectors * * If you can change the data structures, it might be a good option to store interleaved vectors: * \code * struct CartesianCoordinate * { * Vc::float_v x, y; * }; * CartesianCoordinate input[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size]; * \endcode * Accessing vectors of \c x and \c y is then as simple as accessing the members of a \c * CartesianCoordinate object. This can be slightly more efficient than the previous method because * the deinterleaving step is not required anymore. On the downside your data structure now depends * on the target architecture, which can be a portability concern. * In short, the \c sizeof operator returns different values depending on Vc::float_v::Size. * Thus, you would have to ensure correct conversion to target independent data * formats for any data exchange (storage, network). (But if you are really careful about portable * data exchange, you already have to handle endian conversion anyway.) * * Note the unfortunate complication of determining the size of the array. In order to fit 1000 * scalar values into the array, the number of vectors times the vector size must be greater or equal * than 1000. But integer division truncates. * * Sadly, there is one last issue with alignment. If the \c CartesianCoordinate object is allocated * on the stack everything is fine (because the compiler knows about the alignment restrictions of * \c x and \c y and thus of \c CartesianCoordinate). But if \c CartesianCoordinate is allocated on * the heap (with \c new or inside an STL container), the correct alignment is not ensured. %Vc provides * Vc::VectorAlignedBase, which contains the correct reimplementations of the \c new and \c delete operators: * \code * struct CartesianCoordinate : public Vc::VectorAlignedBase * { * Vc::float_v x, y; * } * CartesianCoordinate *input = new CartesianCoordinate[(1000 + Vc::float_v::Size - 1) / Vc::float_v::Size]; * \endcode * To ensure correctly aligned storage with STL containers you can use Vc::Allocator: * \code * struct CartesianCoordinate * { * Vc::float_v x, y; * } * VC_DECLARE_ALLOCATOR(CartesianCoordinate) * std::vector input((1000 + Vc::float_v::Size - 1) / Vc::float_v::Size); * \endcode * * For a thorough discussion of alignment see \ref intro_alignment. * * \subsection ex_polarcoord_data_soa Struct of Arrays (SoA) * * A third option is storage in the form of a single struct instance that contains arrays of the * data members: * \code * template struct CartesianCoordinate * { * float x[Size], y[Size]; * } * CartesianCoordinate<1000> input; * \endcode * Now all \c x values are adjacent in memory and thus can easily be loaded and stored as vectors. * Well, two problems remain: * 1. The alignment of \c x and \c y is not defined and therefore not guaranteed. Vector loads and * stores thus must assume unaligned pointers, which is bad for performance. Even worse, if an * instruction that expects an aligned pointer is executed with an unaligned address the program * will crash. * 2. The size of the \c x and \c y arrays is not guaranteed to be large enough to allow the last * values in the arrays to be loaded/stored as vectors. * * %Vc provides the Vc::Memory class to solve both issues: * \code * template struct CartesianCoordinate * { * Vc::Memory x, y; * } * CartesianCoordinate<1000> input; * \endcode * * \section ex_polarcoord_complete The Complete Example * * Now that we have covered the background and know what we need - let us take a look at the * complete example code. * * \snippet polarcoord/main.cpp includes * The example starts with the main include directive to use for %Vc: \c \#include \c . * The remaining includes are required for terminal output. * Note that we include Vc::float_v into the global namespace. * It is not recommended to include the whole %Vc namespace into the global namespace * except maybe inside a function scope. * * \snippet polarcoord/main.cpp memory allocation * At the start of the program, the input and output memory is allocated. * Of course, you can abstract these variables into structs/classes for Cartesian and polar * coordinates. * The Vc::Memory class can be used to allocate memory on the stack or on the heap. * In this case the memory is allocated on the stack, since the size of the memory is given at * compile time. * The first \c float value of Vc::Memory (e.g. x_mem[0]) is guaranteed to be aligned to the * natural SIMD vector alignment. * Also, the size of the allocated memory may be padded at the end to allow access to the last \c * float value (e.g. x_mem[999]) with a SIMD vector. * Thus, if this example is compiled for a target with a vector width (\c Vc::float_v::Size) of 16 * entries, the four arrays would internally be allocated as size 1008 (63 vectors with 16 entries = * 1008 entries). * * \snippet polarcoord/main.cpp random init * Next the x and y values are initialized with random numbers. * %Vc includes a simple vectorized random number generator. * The floating point RNGs in %Vc produce values in the range from 0 to 1. * Thus the value has to be scaled and subtracted to get into the desired range of -1 to 1. * The iteration over the memory goes from 0 (no surprise) to a value determined by the Vc::Memory * class. In the case of fixed-size allocation, this number is also available to the compiler as a * compile time constant. Vc::Memory has two functions to use as upper bound for iterations: * Vc::Memory::entriesCount and Vc::Memory::vectorsCount. The former returns the same number as was * used for allocation. The latter returns the number of SIMD vectors that fit into the (padded) * allocated memory. Thus, if Vc::float_v::Size were 16, \c x_mem.vectorsCount() would expand to 63. * Inside the loop, the memory i-th vector is then set to a random value. * * \warning Please do not use this RNG until you have read its documentation. It may not be as * random as you need it to be. * * \snippet polarcoord/main.cpp conversion * Finally we arrive at the conversion of the Cartesian coordinates to polar coordinates. * The for loop is equivalent to the one above. * * Inside the loop we first assign the x and y values to local variables. * This is not necessary; but it can help the compiler with optimization. The issue is that when you * access values from some memory area, the compiler cannot always be sure that the pointers to * memory do not alias (i.e. point to the same location). Thus, the compiler might rather take the * safe way out and load the value from memory more often than necessary. By using local variables, * the compiler has an easy task to prove that the value does not change and can be cached in a * register. This is a general issue, and not a special issue with SIMD. In this case mainly serves * to make the following code more readable. * * * * \snippet polarcoord/main.cpp output * ********************************************************************************* * * \page ex-finitediff Finite Differences * ********************************************************************************* * * \page ex-matrix Matrix Class * ********************************************************************************* * * \page ex-buddhabrot Buddhabrot * ********************************************************************************* * * \page ex-mandelbrot Mandelbrot * * This example draws a colorized Mandelbrot image on screen using Qt4 widgets. * * The example uses a simple class to abstract complex numbers. In principle, one could just use * std::complex, if it would perform well enough. But especially the norm function is very slow for * scalar float/double. Also, complex multiplication is correctly implemented to handle NaN and * infinity. This is not required for Mandelbrot as these special cases will not occur. * Additionally, the provided complex abstraction stores the square of the real and imaginary parts * to help the compiler in optimizing the code as good as possible. * \snippet mandelbrot/mandel.cpp MyComplex * * Mandelbrot uses the function z = z² + c for iteration. * \snippet mandelbrot/mandel.cpp P function * ********************************************************************************* * * \page ex-buddhabrot Buddhabrot * */