Making programs fast has always been an interest of mine.  Most of us have seen that system that a small breeze can knock over and render flat.  Where the request time graph looks like an altitude map of mount Kilimanjaro when any real load is applied.  Up, up, and away!  But system speed isn't the only kind of speed in development, and often times isn't even the most important.  There are other measures of speed, like development time.  How much time it takes a user to learn your system.

Back in college I was given the opportunity to build and tune a ray tracer as an exercise in algorithms.  Ray tracing has been around since 1968, and has been explored quite extensively and it was thought that we had pretty much explored all the acceleration options possible in the 30+ years between the concept's origin and when I took that class.  Our professor challenged us to improve things, but also still held the position that we had pretty much explored everything.  Only, I found something new in that class.  It was a variant of what already existed, but it still had fundamental advantages and differences from the existing strategies.  My structure was faster to build, faster to render (depending on the scene it was 5%-50% faster), and used perhaps 5% more memory on average with more variability on the total memory used.  I had other metrics I deemed more important than the classic measurements at the time, such as the sum of the total volume of the bounding hierarchy at a given depth.  I had a good time tuning this system and thinking through the various approaches and their limitations.  I was eventually cited in a ray tracing book written by a fellow student, cited in an academic paper, and even was invited to update the school's RTRT (Real Time Ray Tracer built for a massive parallel system) all as a shiny new undergrad.

This experience, however fun, has not proven similar to any industry system I've helped to build.  In my experience with developing products in modern languages I've found that about 95% of your code just doesn't matter how efficiently it runs.  Correctness, good design, maintainability, and development velocity are generally more important.  The remaining 5% can spend significant amounts of time being run, and is usually at fault if your system is slow, this is where it apparently attempts to climb mount Kilimanjaro before getting around to your request.

I shall call this 5% the critical path.  Some projects don't even have one, this is common in front end and desktop software I've built, but every API or server I've worked with does.  This critical path of code should get some extra attention and avoid certain types of operations that tend to be slow.  The main sneaky culprit in modern systems I've found from my experience is allocating memory and this article will focus on that.  In my ray tracing algorithm, I deviated only slightly from this theme where I thought it was prudent and I was right.  But rarely does your O(n) really matter for most software and that has held true for 95% of the software I've written since school.  All the common operations are already solved for you, baked into language features or libraries, and you should not generally spend time re-writing them.  Even when building and considering extensively your O(n), memory is king, and if you ignored it whatever your other advantages you might have found will be swallowed up in the sting of bad memory management.  

Many junior, and even not so junior devs don't have a particularly good understanding of the memory system operating underneath them to support their convenient high level language features.  Every developer ought become sensitive to how the innards of the system he is building with operates.

Some new development fads, like functional programming, insist on the immutability of data when passed into functions.  Which means if you are performing operations on a list, you end up with multiple copies of that list as you perform operations on it.  Weather you need those copies or not.  listData.map().filter().map() will end up with ~4 copies of the data, depending on how much filter() removes.  Not a big deal if it's not on the critical path to use memory in this manner.  But if it is, that data adds up quickly.

Some caching strategies clone the data that is extracted from the cache for use. Every single JavaScript cache library I've looked at on npm does this. This is intentional to avoid allowing altering the copy to alter the original, a behavior which the language generally allows you to do.  But this also makes an operation you believe will help speed the system up, by potentially avoiding additional requests to APIs, and turns it into one that will also slow the system down significantly under load.

Why is allocating memory so bad for performance? CPU cache thrashing isn't normally on the radar in network based systems, but a similar concept is. That allocated memory must eventually be freed, which means your virtual machine needs to find it and release it back to the managed heap. It takes time to find which segments of memory are no longer needed, those are cycles built into the language for every bit of memory you need to use. To make things worse, it's not even linear. As you allocate more memory more quickly the search space becomes larger and thus, takes longer. On top of this the garbage collector needs to run more frequently to keep up.

With all of these compounding factors eating at your cycles, if you are careless with management you can enter a state where upon reaching a certain load the application performance craters and the majority of your application time is spent in the garbage collector and it may not even show up in your performance statistics why the system is so slow. I've seen such thrashing in various systems. In one case the developer I worked with doubled down on the caching being employed, which happened to be of the cloning variety in a JavaScript system, which ultimately did very little to solve his fundamental performance issues.

I've also seen systems that try to paper over the issue by throwing more resources at the problem bits from the cloud. Micro-services are another path people take to that end.  Performance issues can be solved by paying more!  That certainly is a valid route that works, and it's a trade off between that and paying for the engineering time to make a better system.  I believe with an accurate view of some of the primary culprits that makes systems slow the costs can be clearly in favor of making a better system.