I had a quick look at the C++ source code provided at https://github.com/ExaScie...

JoeSmithson · on April 23, 2020

The problem with C++, for a lot of people, is not that it's impossible to do it right, but that it's easy to do it wrong.

Rochus · on April 23, 2020

I think this applies to any kind of advanced software development or programming languages, not only C++.

There may be many reasons why scientists who are not computer scientists feel more comfortable with Go than with C++, but performance is certainly not one of them.

thelazydogsback · on April 23, 2020

It depends on your definition of "wrong". You can mess up your domain logic with any language, but the chances of messing up perf (and even some ref vs. copy semantics that effect logic) in C++ are vastly greater than in GC'd langs. I've ported C++ code to C# with much perf gain, due to c'tors of every kind getting called all over the place. Doing it right in C++ is too much cognitive overload -- at least for me. If you're writing inner game loops or device drivers, I feel I need to pay for this complexity - for any LOB apps, I can't see choosing C++, or even Rust for that matter.

oconnor663 · on April 24, 2020

Rust certainly comes with its share of complexity, but it might be worth clarifying that Rust doesn't have this specific problem of implicit constructors called all over the place. Instead, everything uses move semantics by default.

_ph_ · on April 23, 2020

The authors of that paper are computer scientists.

Rochus · on April 23, 2020

But most life scientists using the library for their work are not; Go is a language they can master, as well as Python or Lua; if they're biophysicists they're likely to know Fortran and C as well, but I rarely meet scientists fluent in C++. Also not sure how much experience the authors have with such software projects.

Koshkin · on April 23, 2020

What is easy is to write C++ as if it was Java. (You just stick to the standard containers and use smart pointers when allocating objects on the heap.)

zucker42 · on April 23, 2020

I think it's clear that anything that Java/Go can do, C++ can do faster. The question is how much effort does it take to get to that point.

benhoyt · on April 23, 2020

They address that somewhat in the discussion section:

> C++ provides many features for more explicit memory management than is possible with reference counting. For example, it provides allocators [35] to decouple memory management from handling of objects in containers. In principle, this may make it possible to use such an allocator to allocate temporary objects that are known to become obsolete during the deallocation pause described above. Such an allocator could then be freed instantly, removing the described pause from the runtime. However, this approach would require a very detailed, error-prone analysis which objects must and must not be managed by such an allocator, and would not translate well to other kinds of pipelines beyond this particular use case. Since elPrep’s focus is on being an open-ended software framework, this approach is therefore not practical.

nemetroid · on April 23, 2020

Custom allocators are a relatively niche topic, and I agree that library users cannot be expected to customize memory behaviour to that level of detail (especially if the library users typically are not professional developers).

However, a certain level of proficiency must be expected, and in the case of C++ this includes "know when to use const&, unique_ptr<T> or shared_ptr<T>". If this cannot be expected of the user, the comparison becomes less a question about performance and more about which language is the best at being the lowest common denominator.

Rochus · on April 23, 2020

Of course you can use better allocators; but it's faster to avoid dynamic allocation (e.g. by pointing to memory mapped from the input file by the OS) altogether. If they allocate memory for each flyspeck of a 200 GB file and also create and change a reference counter for it, nobody should be surprised about the low performance. Have a look what e.g. shared_ptr does behind the scenes.

PaulDavisThe1st · on April 23, 2020

Unless you're streaming, in which case mmap'ed access on Linux is generally slower than read/write. At least it was the last time we checked for the Ardour project (probably about 3 years ago).

Rochus · on April 23, 2020

See https://en.wikipedia.org/wiki/SAM_(file_format)

vvanders · on April 23, 2020

That's not even the real benefit you get. What you really get is the ability to specify memory locality which easily has a 10-30x performance impact.

tomp · on April 23, 2020

> elPrep is an open-ended software framework that allows for arbitrary combinations of different functional steps in a pipeline, like duplicate marking, sorting reads, replacing read groups, and so on; additionally, elPrep also accommodates functional steps provided by third-party tool writers. This openness makes it difficult to precisely determine the lifetime of allocated objects during a program run

> Phase 1 allocates various data structures while parsing the read representations from BAM files into heap objects. A subset of these objects become obsolete after phase 1.

> Therefore, manual memory management is not a practical candidate for elPrep,

TL;DR: it's non-deterministic which objects will be garbage when so some for of dynamic memory management is needed.

If you disagree, what would you use instead?

kllrnohj · on April 23, 2020

Well, one thing that jumps out immediately to me is that everything appears to be using shared_ptr. And by that I mean everything. Why is everything shared? What does it even mean to have a shared_ptr<stringstream>? Arbitrary mixed writes to a string seems like a bad idea, so shouldn't that be unique_ptr?

Or like this:

    auto alns = make_shared<deque<shared_ptr<sam_alignment>>>();

A shared_ptr to a deque of shared_ptrs? deque isn't thread-safe, why would it be shared? And why does the deque instance need to be heap allocated at all? It just contains a pointer to the actual allocation anyway, moving it around by value is super cheap?

It's like make_shared is the only way they know to allocate an object. They even put string inside of shared_ptr:

    class istream_wrapper {
        ...
 shared_ptr<string> buffer;

It could be that it does need to be shared for some reason, but this looks like a pointer to a pointer for no obvious reason. Even ignoring the atomics that shared_ptr results in, the dependent data loads are going to be brutal.

EDIT: And they don't even seem to use std::move anywhere :/ There's a huge gap between this code and custom allocator territory.

gpderetta · on April 23, 2020

they heap allocate strings because their string_range implementation is a shared_ptr to the original string plus two indices. It might be somewhat worth it assuming the strings are large enough. But if most strings are small, passing them around by move-value would probably be an overall win. One would need to benchmark it.

kllrnohj · on April 24, 2020

In that case you'd still want a shared_string not a shared_ptr<string>. The double pointer chases add up quick.

Edit: It might help to keep in mind that string is essentially an alias for unique_ptr<char[]>. As in, string is already heap allocated. They heap allocated a pointer to a heap allocation.

UncleMeat · on April 24, 2020

In my experience, this kind of error is the cause of 90% of the "wait, I thought C++ was supposed to be fast so why is it chugging compared to my Java implementation" problems. People turn everything into two dereferences.

Rochus · on April 23, 2020

The problem starts with the design decision to stream the file into memory; if you map the file instead you can directly use the mapped data and only have to allocate supporting structures (if required). And anyway, the "Therefore" part does not follow from the "Phase 1" statement.

ezoe · on April 23, 2020

For objects allocated at phase 1 and not necessary after,

I would use std::monotonic_buffer_resource added in C++17. It's implementation is consists of just a pointer point to large contagious memory. You want a n bytes of memory? return the pointer while adding ptr += n. Deallocation do nothing. Destructing std::monotonic_buffer_resource object deallocate the memory. If the objects has trivial destructor, as that is the case based on glancing the code, this is very efficient.