As suspected, everything is dynamically allocated and no memory mapping (see e.g. http://man7.org/linux/man-pages/man2/mmap.2.html) is used. No wonder this is slow and eats a lot of memory. At the moment I have no information about why this design was chosen, if there is a justification for it, or if the developers only knew this option. Maybe I can find some hints in the paper. From what I've seen up to now it can be savely assumed that with optimal use of data structures and system functions the C++ results are at least one order of magnitude better.
I think this applies to any kind of advanced software development or programming languages, not only C++.
There may be many reasons why scientists who are not computer scientists feel more comfortable with Go than with C++, but performance is certainly not one of them.
It depends on your definition of "wrong".
You can mess up your domain logic with any language, but the chances of messing up perf (and even some ref vs. copy semantics that effect logic) in C++ are vastly greater than in GC'd langs.
I've ported C++ code to C# with much perf gain, due to c'tors of every kind getting called all over the place. Doing it right in C++ is too much cognitive overload -- at least for me. If you're writing inner game loops or device drivers, I feel I need to pay for this complexity - for any LOB apps, I can't see choosing C++, or even Rust for that matter.
Rust certainly comes with its share of complexity, but it might be worth clarifying that Rust doesn't have this specific problem of implicit constructors called all over the place. Instead, everything uses move semantics by default.
But most life scientists using the library for their work are not; Go is a language they can master, as well as Python or Lua; if they're biophysicists they're likely to know Fortran and C as well, but I rarely meet scientists fluent in C++. Also not sure how much experience the authors have with such software projects.
What is easy is to write C++ as if it was Java. (You just stick to the standard containers and use smart pointers when allocating objects on the heap.)
They address that somewhat in the discussion section:
> C++ provides many features for more explicit memory management than is possible with reference counting. For example, it provides allocators [35] to decouple memory management from handling of objects in containers. In principle, this may make it possible to use such an allocator to allocate temporary objects that are known to become obsolete during the deallocation pause described above. Such an allocator could then be freed instantly, removing the described pause from the runtime. However, this approach would require a very detailed, error-prone analysis which objects must and must not be managed by such an allocator, and would not translate well to other kinds of pipelines beyond this particular use case. Since elPrep’s focus is on being an open-ended software framework, this approach is therefore not practical.
Custom allocators are a relatively niche topic, and I agree that library users cannot be expected to customize memory behaviour to that level of detail (especially if the library users typically are not professional developers).
However, a certain level of proficiency must be expected, and in the case of C++ this includes "know when to use const&, unique_ptr<T> or shared_ptr<T>". If this cannot be expected of the user, the comparison becomes less a question about performance and more about which language is the best at being the lowest common denominator.
Of course you can use better allocators; but it's faster to avoid dynamic allocation (e.g. by pointing to memory mapped from the input file by the OS) altogether. If they allocate memory for each flyspeck of a 200 GB file and also create and change a reference counter for it, nobody should be surprised about the low performance. Have a look what e.g. shared_ptr does behind the scenes.
Unless you're streaming, in which case mmap'ed access on Linux is generally slower than read/write. At least it was the last time we checked for the Ardour project (probably about 3 years ago).
> elPrep is an open-ended software framework that allows for arbitrary combinations of different functional steps in a pipeline, like duplicate marking, sorting reads, replacing read groups, and so on; additionally, elPrep also accommodates functional steps provided by third-party tool writers. This openness makes it difficult to precisely determine the lifetime of allocated objects during a program run
> Phase 1 allocates various data structures while parsing the read representations from BAM files into heap objects. A subset of these objects become obsolete after phase 1.
> Therefore, manual memory management is not a practical candidate for elPrep,
TL;DR: it's non-deterministic which objects will be garbage when so some for of dynamic memory management is needed.
Well, one thing that jumps out immediately to me is that everything appears to be using shared_ptr. And by that I mean everything. Why is everything shared? What does it even mean to have a shared_ptr<stringstream>? Arbitrary mixed writes to a string seems like a bad idea, so shouldn't that be unique_ptr?
Or like this:
auto alns = make_shared<deque<shared_ptr<sam_alignment>>>();
A shared_ptr to a deque of shared_ptrs? deque isn't thread-safe, why would it be shared? And why does the deque instance need to be heap allocated at all? It just contains a pointer to the actual allocation anyway, moving it around by value is super cheap?
It's like make_shared is the only way they know to allocate an object. They even put string inside of shared_ptr:
class istream_wrapper {
...
shared_ptr<string> buffer;
It could be that it does need to be shared for some reason, but this looks like a pointer to a pointer for no obvious reason. Even ignoring the atomics that shared_ptr results in, the dependent data loads are going to be brutal.
EDIT: And they don't even seem to use std::move anywhere :/ There's a huge gap between this code and custom allocator territory.
they heap allocate strings because their string_range implementation is a shared_ptr to the original string plus two indices. It might be somewhat worth it assuming the strings are large enough. But if most strings are small, passing them around by move-value would probably be an overall win. One would need to benchmark it.
In that case you'd still want a shared_string not a shared_ptr<string>. The double pointer chases add up quick.
Edit: It might help to keep in mind that string is essentially an alias for unique_ptr<char[]>. As in, string is already heap allocated. They heap allocated a pointer to a heap allocation.
In my experience, this kind of error is the cause of 90% of the "wait, I thought C++ was supposed to be fast so why is it chugging compared to my Java implementation" problems. People turn everything into two dereferences.
The problem starts with the design decision to stream the file into memory; if you map the file instead you can directly use the mapped data and only have to allocate supporting structures (if required). And anyway, the "Therefore" part does not follow from the "Phase 1" statement.
For objects allocated at phase 1 and not necessary after,
I would use std::monotonic_buffer_resource added in C++17. It's implementation is consists of just a pointer point to large contagious memory. You want a n bytes of memory? return the pointer while adding ptr += n. Deallocation do nothing. Destructing std::monotonic_buffer_resource object deallocate the memory. If the objects has trivial destructor, as that is the case based on glancing the code, this is very efficient.
As suspected, everything is dynamically allocated and no memory mapping (see e.g. http://man7.org/linux/man-pages/man2/mmap.2.html) is used. No wonder this is slow and eats a lot of memory. At the moment I have no information about why this design was chosen, if there is a justification for it, or if the developers only knew this option. Maybe I can find some hints in the paper. From what I've seen up to now it can be savely assumed that with optimal use of data structures and system functions the C++ results are at least one order of magnitude better.