High performance C++

smacl · 13-11-2018 09:06AM #1

Just read this in another thread and thought I'd start a new one rather than dragging it off topic;

14ned wrote: »

Modern C++ is all about zero allocation, zero threads, zero locks, zero waits, and zero memory copying. Indeed, if you ever write a for or while loop, smack yourself hard and don't do it again (look into <algorithm>). Also view every unbalanced if statement with suspicion. Can you rewrite the code to eliminate if statements, or at worst make them balanced?

I'm going to admit my ignorance here and ask what you mean by zero threads. A large part of what I do involves getting the best performance out of a given piece of hardware (CPU/GPU/memory/disk/network) processing and visualizing large data sets (e.g. typically 200 million - 2 billion 3d coordinates in memory on a 32 GB PC with real time rendering). This involves using divide and conquer approaches to get all the hardware involved to best effect. This might be OpenMP in some places (for loops and threads), and GPU elsewhere (with inevitable memory copying). I'd be interested in your thoughts on how modern C++ can be applied to achieve best performance in this context, as it has always been my language of choice. For leveraging the GPU for general purpose algorithms, we have AMP which seems to be dead, OpenCL which is struggling, CUDA which is proprietary to nVidia, and HLSL Compute which is proprietary to MS. Asking on SO draws a blank where I would have thought it would be a reasonably hot topic. Even knowledge on OpenMP performance profiling seems thin on the ground.

Do you see modern C++ properly addressing heterogeneous computing any time soon? I totally get the minimal locks, minimal memory copying and zero wait, but zero threads, zero locks, zero memory coping and no loops seems counter intuitive.

14ned · 14-11-2018 01:11PM

smacl wrote: »

I'm going to admit my ignorance here and ask what you mean by zero threads.

99% of the time if you ever write "std::mutex" or "std::thread" then you're doing C++ wrong.

A large part of what I do involves getting the best performance out of a given piece of hardware (CPU/GPU/memory/disk/network) processing and visualizing large data sets (e.g. typically 200 million - 2 billion 3d coordinates in memory on a 32 GB PC with real time rendering). This involves using divide and conquer approaches to get all the hardware involved to best effect. This might be OpenMP in some places (for loops and threads), and GPU elsewhere (with inevitable memory copying). I'd be interested in your thoughts on how modern C++ can be applied to achieve best performance in this context, as it has always been my language of choice. For leveraging the GPU for general purpose algorithms, we have AMP which seems to be dead, OpenCL which is struggling, CUDA which is proprietary to nVidia, and HLSL Compute which is proprietary to MS. Asking on SO draws a blank where I would have thought it would be a reasonably hot topic. Even knowledge on OpenMP performance profiling seems thin on the ground.

In theory, you can write any C++ you like today and it'll "just work" (TM) on any recent GPU via LLVM.

In practice, only the very latest nVidia GPUs (as in, just released) with beta quality compilers have much chance of working as you'd expect. Have a watch of https://www.youtube.com/watch?v=75LcDvlEIYw, you'll see what I mean. He literally says that only the just-released nVidia GPU can run that stuff.

Do you see modern C++ properly addressing heterogeneous computing any time soon? I totally get the minimal locks, minimal memory copying and zero wait, but zero threads, zero locks, zero memory coping and no loops seems counter intuitive.

Oh yes. I may be becoming the chair of an "Elsewhere Memory" study group on WG21. Literally a SG for how to get multiple, concurrently running, C++ programs to work with memory/storage shared between them. As in, every modern computing system, even your standard PC has two dozen or more CPUs in it nowadays, and any of them talking to any others is currently undefined behaviour under the standard. I find that unacceptable.

Just to be clear, the zero X, Y, Z ... is more to thought provoke than anything else. Instead of writing for loops, consider writing std::accumulate(...) for example. Because the latter can auto-use a thousand threads on a GPU if running on a GPU, whereas the former must always be sequential.

Similarly, nobody is saying don't use threads. But I am saying never use threads directly. Use higher level abstractions which may use threads. Point is, you shouldn't need to care if they do or don't (in theory).

Niall

smacl · 14-11-2018 04:06PM

14ned wrote: »

Just to be clear, the zero X, Y, Z ... is more to thought provoke than anything else. Instead of writing for loops, consider writing std::accumulate(...) for example. Because the latter can auto-use a thousand threads on a GPU if running on a GPU, whereas the former must always be sequential.

Similarly, nobody is saying don't use threads. But I am saying never use threads directly. Use higher level abstractions which may use threads. Point is, you shouldn't need to care if they do or don't (in theory).

Thanks Niall. Unfortunately for me at this point in time the theory and practice are altogether different animals, e.g. std::accumulate is C++20 whereas the latest update to VS2017 only just implements C++17. As for GPUs, I need to support anything reasonably modern, from any of the big manufacturers, so even CUDA is out. The issue I have with letting the standard library, frameworks and development tools handle performance demanding tasks is that my experience to date has been they frequently aren't performant, which in turn can lead to hours lost in front of a profiler staring at someone else's code when you would have been better writing your own code to meet specific requirements. For an example of this, see the following SO question I had for efficient containers for large data.

Similarly, nobody is saying don't use threads. But I am saying never use threads directly. Use higher level abstractions which may use threads. Point is, you shouldn't need to care if they do or don't (in theory).

Again, this is where I find theory and practice tend to diverge and if the algorithm isn't specifically designed and optimized for multi-threading it won't perform well when multi-threaded. e.g. non thread local data referenced inside a loop can lead to an implicit lock whereas creating thread local copies of that data will avoid the lock. Once you go towards GPUs, you need to consider tiling, buffering and register usage. My first attempt at moving some efficient multi-threaded CPU based code to the GPU actually slowed it down by about a factor of five. I think high levels of abstraction are fantastic, but you still need to have a good understanding of what's going on under the hood.

14ned · 14-11-2018 04:31PM

smacl wrote: »

Thanks Niall. Unfortunately for me at this point in time the theory and practice are altogether different animals, e.g. std::accumulate is C++20 whereas the latest update to VS2017 only just implements C++17.

VS2017 isn't too bad in fact. See https://blogs.msdn.microsoft.com/vcblog/2018/09/11/using-c17-parallel-algorithms-for-better-performance/. As Billy points out, QoI is a separate issue to function availability

As for GPUs, I need to support anything reasonably modern, from any of the big manufacturers, so even CUDA is out. The issue I have with letting the standard library, frameworks and development tools handle performance demanding tasks is that my experience to date has been they frequently aren't performant, which in turn can lead to hours lost in front of a profiler staring at someone else's code when you would have been better writing your own code to meet specific requirements. For an example of this, see the following SO question I had for efficient containers for large data.

I agree that we've been deficient historically on big data. Incidentally, https://ned14.github.io/llfio/trivial__vector_8hpp.html would solve your problem, and it's proposed in WG21 P1031 for C++ 26 or so.

Again, this is where I find theory and practice tend to diverge and if the algorithm isn't specifically designed and optimized for multi-threading it won't perform well when multi-threaded. e.g. non thread local data referenced inside a loop can lead to an implicit lock whereas creating thread local copies of that data will avoid the lock. Once you go towards GPUs, you need to consider tiling, buffering and register usage. My first attempt at moving some efficient multi-threaded CPU based code to the GPU actually slowed it down by about a factor of five. I think high levels of abstraction are fantastic, but you still need to have a good understanding of what's going on under the hood.

I think data layout optimisation is totally orthogonal to threading or concurrency.

Absolutely you need to lay out your data for your target architecture if you want maximum performance. Intel CPUs, in particular, allow programmers to be very lazy on doing that (though interestingly they are much less permissive of row-orientated vs column-orientated SIMD mismatches than other CPUs e.g. ARM NEON).

My point is that absolutely yes you need to think about data layout, sharing, and data contention. That's unavoidable if you want performance. You should never be thinking about mutexs or locks or condition variables. That's what Java programmers do.

Re: writing this stuff yourself, sure if absolute max performance is needed, then you do bespoke tuning. But you surely can see that we are entering a world soon where something written using Parallel STL algorithms will have acceptable performance whatever you point your code at. Not optimal, acceptable. And usually with a bit of tuning one can get within 10% of custom hand written for the target.

(Incidentally, have you looked into ISPC? https://ispc.github.io/. That's where future C++ is aiming to reach, but without leaving C++. In the meantime, ISPC delivers great results from a single source file today for a wide variety of compute, including GPUs)

Niall

smacl · 14-11-2018 05:57PM

14ned wrote: »

Re: writing this stuff yourself, sure if absolute max performance is needed, then you do bespoke tuning. But you surely can see that we are entering a world soon where something written using Parallel STL algorithms will have acceptable performance whatever you point your code at. Not optimal, acceptable. And usually with a bit of tuning one can get within 10% of custom hand written for the target.

Not so sure. There seems to be a continuing expectation among certain programmers that improved hardware and better libraries will solve ongoing performance problems. My experience has been that performance demand always outstrips this and taking this line leads to a game of catch-up that is never won. As an example, much of the programming that I've been doing in the last six years involves laser scanned data. Six years ago a scanned model of ~10 million points would be considered big. Today a good scanner will collect 2 million points per second and a reasonably big model would be a couple of billion points. So in this case data has grown by about a factor of 200, whereas in the same period of time we have double CPU speed, quadrupled average number of cores for a performance increase of ~x8. The big gains are algorithmic and to a lesser extent leveraging GPU threads, but they struggle to keep up. A simpler example would be something like 4k screens, where quadrupling the number of pixels is hammering the GPU, particularly where you still might have many algorithms that have complexity > O(n) in play rendering graphics. Parallel STL is great, but most programmers struggling for performance will already be using multi-threading heavily. If you're already optimally multi-threading and call a multi-threaded library, chances are that you'll see a significant performance drop.

C++ is the go to language when it comes to demands for maximum performance, but it has been slow to encapsulate and abstract gains offered by newer hardware.

(Incidentally, have you looked into ISPC? https://ispc.github.io/. That's where future C++ is aiming to reach, but without leaving C++. In the meantime, ISPC delivers great results from a single source file today for a wide variety of compute, including GPUs)

Thanks, I hadn't seen it but will have a play with it when I get a chance. Currently looking at HIP and debating whether to ditch DirectCompute in favour of it.

14ned · 15-11-2018 12:50PM

smacl wrote: »

Not so sure. There seems to be a continuing expectation among certain programmers that improved hardware and better libraries will solve ongoing performance problems. My experience has been that performance demand always outstrips this and taking this line leads to a game of catch-up that is never won.

Well sure, exactly the bare minimum viable amount of investment in performance and scalability is always chosen. That's capitalism, it always trails the market.

As an example, much of the programming that I've been doing in the last six years involves laser scanned data. Six years ago a scanned model of ~10 million points would be considered big. Today a good scanner will collect 2 million points per second and a reasonably big model would be a couple of billion points. So in this case data has grown by about a factor of 200, whereas in the same period of time we have double CPU speed, quadrupled average number of cores for a performance increase of ~x8.

There is continuing exponential growth in bandwidth and storage (i.e. density), but only linear growth in computation and latency (i.e. coordination). That's been the case for fifteen years or so now.

Good news is that everybody expects all remaining exponential growth in hardware to disappear soon. Thereafter, all efficiency gains will need to come from software alone. That's where C++ and other low level skills ought to see a large bump as people rewrite everything currently written in Java/.NET/Python into C++. That's exactly what WG21 are positioning C++ to be ready for (and all the upstart proprietary systems languages as well, of course).

The big gains are algorithmic and to a lesser extent leveraging GPU threads, but they struggle to keep up. A simpler example would be something like 4k screens, where quadrupling the number of pixels is hammering the GPU, particularly where you still might have many algorithms that have complexity > O(n) in play rendering graphics. Parallel STL is great, but most programmers struggling for performance will already be using multi-threading heavily. If you're already optimally multi-threading and call a multi-threaded library, chances are that you'll see a significant performance drop.

C++ is the go to language when it comes to demands for maximum performance, but it has been slow to encapsulate and abstract gains offered by newer hardware.

At the standards level, yes. But one must always standardise only existing practice.

We actually throw away lots of performance at the OS level. Once hardware stops improving at all, it becomes feasible to completely rearchitect OS kernels to better fit actual hardware realities instead of throwing away latency on potential paging of RAM etc. Non-volatile RAM also means a sea change could be highly beneficial. Obviously, all this will break backwards compatibility with all existing software of any complexity. Billions of dollars needed for such a rearchitecture, and likely will deliver only a once-off large improvement.

One thing I'm very sure of is ever more heterogeneous compute. So instead of your kernel implementing your filesystem, your storage device will implement all of that locally, and your program when it works with files will actually be going straight to the hardware. Samsung's upcoming KV-SSD is the first generation of such a device. I am very sure it will be first of many, and soon it'll be seen as pure backwards compatibility for your CPU to implement filesystems at all.

Niall

High performance C++

Comments