Thoughts on High Performance Computing

During my work as consultant, I was asked about high performance computing (HPC) and how to implement it. As always,  one of the strongest constraints is a tight budget.

In the last years, techniques for HPC changed as the hardware changed. Several years before HPC was only possible on computers  made of special HPC processors like NEC’s vector CPUs or a large mainframe was installed with thousands of standard CPUs which work together to run in an astonishing speed. Sometimes, combinations of that was installed.The complexity to program such machines is massive and special knowledge is needed about the programming paradigms and the hardware to get optimal results.

Today the situation is a little different due to several  factors:

  1. Standard CPU will not get faster significantly. The physical constraints are reached and downsizing the chips is not that easy anymore or even impossible. In some dimensions production specifications are around atoms. As long as we do not want to split atoms, we can not reduce some dimensions.
  2. Due to the constrains in the point above, CPU architectures changes. The most significant change are the multi core processors. Moore’s law on speed is extended by multiplying the number of cores in a process.
  3. Gaming industry and industry for graphics processing have let the computer industry into a development of high performance graphics cards. As it turns out, with some minor constraints, these cards are very well suited for HPC. Even on my “old” nVidia GeForce 8600 GTS, I found 4 multi core processors with 8 cores per processor.

Possibilities for HPC

I do not want to write about special computer hardware and special designed machines. The standard PC technologies are presented here for customers with small budgets where the purchase of a HPC server with thousands of cores is not an option.

Therefore, the following possibilities for HPC are available today:

  1. Even if it is an older approach, cluster computing with PVM or MPI is still a valid possibility. In cluster computing several PCs or servers are interconnected with a standard Ethernet network. The big drawback are latencies of and the speed  in the network. If large computations can be run in parallel where the time consumption of the latency and the bandwidth are much smaller than the computation time, the approach can and should be used. A very prominent example is movie rendering. The scenery information is sent to a client and the calculation is performed on the client. Hundreds of clients can share the work and speed up the whole process dramatically.
  2. Multi Core and Multi Processor parallelization machine is a common choice today. The current number of cores in a standard PC are limited from 2 to 8. Multi core processors with more cores can be expected within the next years, that is for sure. The total speed up of a software is therefore limited on the number of available cores. Even if not HPC is done, the parallelization of software should be a topic, because customers want their machines running as fast as possible and the investment should be used efficiently. For HPC itself it is not a real option, because standard software should use it, too. So it is not special high performance about it.
  3. Real HPC can be done with GPU programming. One constraint of GPUs are the limitation to single precision floating point operations. It is quite ok for calculation of 3D graphics, but for some scientific calculations it is not good enough. nVidia has met this demand by creating the so called Tesla cards. These cards contain up to 448 cores with 6GB RAM and operate in double precision mode. Programmed with nVidias CUDA framework or the OpenCL language high speed ups can be achieved. This is a real low budget HPC solution for many customers.

Example

For a small test with OpenCL, I programmed a small C program which has to perform a simple matrix multiplication. In C a classical sequential matrix multiplication lools like:

for (i = 0; i < MATRIX_SIZE; i++)
{
    for (j = 0; j < MATRIX_SIZE; j++)
    {
        for (k = 0; k < MATRIX_SIZE; k++)
        {
            result[i * MATRIX_SIZE + k] += a[j * MATRIX_SIZE + i]
                                         * b[k * MATRIX_SIZE + j];
        }
    }
}

I assumed here, that we have quadratic matrices with a size of MATRIX_SIZE in each direction. For a size of 1024 this algorithm needs about 51.9 seconds on my AMD Operton 2600.

The same algorithm was implemented in OpenCL. The Kernel code looks like:

__kernel void matmul(__global float* a,
                     __global float* b,
                     __global float* result,
                     int size)
{
    __private int i = get_global_id(0);
    __private int k = get_global_id(1);
    __private int j;
    __private float r = 0.0;
    for (j = 0; j < size; j++)
    {
        r += a[j * size + i] * b[k * size + j];
    }
    result[i * size + k] = r;
}

Started is the kernel on my nVidia GeForce 8600 GTS after copying the needed matrix data into the graphics card RAM with:

size_t global_item_size[2];
global_item_size[0] = MATRIX_SIZE;
global_item_size[1] = MATRIX_SIZE;
size_t local_item_size[2];
local_item_size[0] = 8;
local_item_size[1] = 8;

ret = clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_item_size,
                             local_item_size, 0, NULL, NULL);

This leads into a start of 1,048,576 threads which are started on 32 cores. The whole operation is finished in roughly 3.3 seconds. This is a total speed up of 15.7.

One of the specialties to be taken into account is, that GPU processors are not cached and that therefore, no cache coherence is to be expected. All processors write directly into RAM. The host process has to take care for concurrency and to avoid it. In the example above the two index variables for the results matrix are independent and the calculation itself, too. So we could create independent threads for these two variables. The third variable is dependent and can not be parallelized without additional locking mechanisms.

The situation on graphics cards are much more interesting as soon as we take the different memories into account which exist on a graphics card, too. In the example above I used the global memory which is accessible for reading and writing by all processors and the private memory which is private for each core. The private variable r was used due to fast read and write capabilities of the private memory. It’s faster to sum up the result first in private memory and to set the result in global memory later on. We also have a read only memory for constants on the graphics boards (read only for the GPU processors, but writable by the host), texture memory and some more…

Conclusion

As shown above, massive parallel GPU programming and OpenCL is a big chance for HPC on a small budget. Taken into account that my graphics card is not state of the art anymore and the nVidia Tesla cards with their performance, HPC is possible for Science and Research Institutes and Organizations with strong budget constraints.

%d bloggers like this: